The basic approach to the extraction of audio acoustic features can be summarized as follows:
- Perform frame blocking such that a stream of audio signals is converted to a set of frames. The time duration of each frame is about 20~30 ms. If the frame duration is too big, we cannot catch the time-varying characteristics of the audio signals. On the other hand, if the frame duration is too small, then we cannot extract valid acoustic features. In general, a frame should be contains several fundamental periods of the given audio signals.
- If we want to reduce the difference between neighboring frames, we can allow overlap between them.
- Assuming the audio signals within a frame is stationary, we can extract acoustic features such as zero crossing rates, volume, pitch, MFCC, LPC, etc.
- We can perform endpoint detection based on zero crossing rate and
volume, and keep non-silence frames for further analysis.
When we are performing the above procedures, there are several terminologies that are used often :
- Frame size: The sampling points within each frame
- Frame overlap: The sampling points of the overlap between consecutive frames
- Frame step (or hop size): This is equal to the frame size minus the overlap.
- Frame rate: The number of frames per second, which is equal to the sample frequency divided by the frame step.
No comments:
Post a Comment