Signal processing: Extraction of audio Acoustic features

The basic approach to the extraction of audio acoustic features can be summarized as follows:

Perform frame blocking such that a stream of audio signals is converted to a set of frames. The time duration of each frame is about 20~30 ms. If the frame duration is too big, we cannot catch the time-varying characteristics of the audio signals. On the other hand, if the frame duration is too small, then we cannot extract valid acoustic features. In general, a frame should be contains several fundamental periods of the given audio signals.
If we want to reduce the difference between neighboring frames, we can allow overlap between them.
Assuming the audio signals within a frame is stationary, we can extract acoustic features such as zero crossing rates, volume, pitch, MFCC, LPC, etc.
We can perform endpoint detection based on zero crossing rate and volume, and keep non-silence frames for further analysis.
When we are performing the above procedures, there are several terminologies that are used often :
- Frame size: The sampling points within each frame
- Frame overlap: The sampling points of the overlap between consecutive frames
- Frame step (or hop size): This is equal to the frame size minus the overlap.
- Frame rate: The number of frames per second, which is equal to the sample frequency divided by the frame step.

Signal processing