When we analyze audio signals, we usually adopt the
method of short-term analysis since most audio signals are more or less stable
within a short period of time, say 20 ms or so. When we do frame blocking,
there may be some overlap between neighboring frames to capture subtle change
in the audio signals. Note that each frame is the basic unit for our analysis.
Within each frame, we can observe the three most distinct acoustic features, as
follows.
- Volume : This feature represents the loudness of the audio signal, which is correlated to the amplitude of the signals. Sometimes it is also referred to as energy or intensity of audio signals.
- Pitch : This feature represents the vibration rate of audio signals, which can be represented by the fundamental frequency, or equivalently, the reciprocal of the fundamental period of voiced audio signals.
- Timbre : Timbre is an acoustic feature that is defined conceptually. In general, timbre refers to the "content" of a frame of audio signals, which is ideally not affected much by pitch and intensity. Theoretically, for quasi-periodic audio signals, we can use the waveform within a fundamental period as the timbre of the frame. However, it is difficult to analysis the waveform within a fundamental period directly. Instead, we usually use the fast Fourier transform (or FFT) to transform the time-domain waveform into frequency-domain spectrum for further analysis. The amplitude spectrum in the frequency domain simply represent the intensity of the waveform at each frequency band.
These acoustic
features mostly correspond to human's "perception" and therefore
cannot be represented exactly by mathematical formula or quantities. However,
we still try to "quantitify" these features for further
computer-based analysis in the hope that the used formula or quantities can
emulate human's perception as closely as possible.
No comments:
Post a Comment