Recording: of the word 'six' spoken by a male speaker
Fricatives in speech
Fricatives are produced when air is forced through a narrow constriction in the vocal tract—such as between the tongue and the teeth for /s/ or between the back of the tongue and the soft palate for /x/. This constriction generates turbulent airflow, which excites a broad spectrum of frequencies and leads to a noise-like acoustic signal. This turbulence consists of many small, independent pressure fluctuations summing at the microphone, the central limit theorem implies that their combined amplitude tends toward a Gaussian (normal) distribution.
Sound pressure histogram
The red box highlights the /x/ part of the recording. We take these values to plot the histogram. Additionally, we fit a Gaussian distribution by matching the estimated mean and variance to the data.
Sound pressure 2D histogram
While the distribution appear noise-like, consecutive time samples are not independent. The 2D histogram reveals the dependency. In fact, we will learn later that this hints at a high-passed frequency spectrum.
Ergodicity: Analyzing the “ks” Sound
We estimate the correlation structure of a speech signal assuming ergodicity.
\hat{R}_{XX}[\kappa] = \langle X[k] X[k+\kappa] \rangle_k = \frac{1}{N-\kappa} \sum_{k=0}^{N-\kappa-1} X[k] X[k+\kappa]
Estimated Autocorrelation and PSD of ‘ks’ Sound
To estimate the PSD in a stable manner, short-time estimates need to be averaged. For more details see Welch’s method.