Theoretically, it should be possible to recognize speech
directly from the digitized waveform. However, because of the large variability
of the speech signal, it is better to perform some feature extraction that
would reduce that variability. Particularly, eliminating various source of
information, such as whether the sound is voiced or unvoiced and, if voiced, it
eliminates the effect of the periodicity or pitch, amplitude of excitation
signal and fundamental frequency etc.
The reason for
computing the short-term spectrum is that the cochlea of the human ear performs
a quasi-frequency analysis. The analysis in the cochlea takes place on a
nonlinear frequency scale (known as the Bark scale or the mel scale). This
scale is approximately linear up to about 1000 Hz and is approximately
logarithmic thereafter. So, in the feature extraction, it is very common to
perform a frequency warping of the frequency axis after the spectral
computation.
No comments:
Post a Comment