Thursday, September 10, 2015

LPC(Linear Predictive Coding)


It is one of the important method for speech analysis because it can provide an estimate of the poles (hence the formant frequency- produced by vocal tract) of the vocal tract transfer function. LPC (Linear Predictive Coding) analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse filtering and the remaining signal is called the residue. The basic idea behind LPC coding is that each sample can be approximated as a linear combination of a few past samples. The linear prediction method provides a robust, reliable, and accurate method for estimating the parameters. The computation involved in LPC processing is considerably less than cepstrum analysis.


Liftering


Liftering operation is similar to filtering operation in the frequency domain where a desired quefrency region for analysis is selected by multiplying the whole cepstrum by a rectangular window at the desired position. There are two types of liftering performed, low-time liftering and high-time liftering. Low-time liftering operation is performed to extract the vocal tract characteristics in the quefrency domain and high-time liftering is performed to get the excitation characteristics of the analysis speech frame.

Cepstrum Domain



Speech is composed of an excitation sequence convolved with the impulse response of the vocal system model. It is often desirable to eliminate one of the components so that the other may be used in a recognition algorithm. Cepstrum is a common transform, which can be used to separate the excitation signal (which contains the phones and the pitch) and the transfer function (which contains the voice quality). These two portions are convolved in the time domain, but convolution in time domain becomes multiplication in
frequency domain, which could be represented as,
                        X(w) =G(w)H(w)
When a log of the magnitude of both sides of the transform is taken,
                       log | X(w) |= log |G(w) | +log | H(w) |
Taking IDFT on both sides of the above equation, introduces us to a term called “Quefrency”, which is the x-axis of the cepstrum domain.
This process is better understood with the help of a block diagram .A lifter is used to separate the high quefrency (Excitation) from the low quefrency (Transfer Function).

Thursday, September 3, 2015

Frame Blocking of speech signal


In this step, the continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M (M < N). The first frame consists of the first N samples. The second frame begins M samples after the first frame, and overlaps it by N – M samples. Similarly, the third frame begins 2M samples after the first frame (or M samples after the second frame) and overlaps it by N - 2M samples. This process continues until all the speech is accounted for within one or more frames. Frame blocking of the speech signal is done because when examined over a sufficiently short period of time, its characteristics are fairly stationary. However, over long periods of time the signal characteristic change to reflect the different speech sounds being spoken. Overlapping frames are taken not to have much information loss and to maintain correlation between the adjacent frames.

Tuesday, September 1, 2015

What is white noise?

White noise is a type of noise that is produced by combining sounds of all different frequencies together. If you took all of the imaginable tones that a human can hear and combined them together, you would have white noise.
The adjective "white" is used to describe this type of noise because of the way white light works. White light is light that is made up of all of the different colors (frequencies) of light combined together. In the same way, white noise is a combination of all of the different frequencies of sound. You can think of white noise as 20,000 tones all playing at the same time. Because white noise contains all frequencies, it is frequently used to mask other sounds.


Spectral and Temporal Features in Speech signal

There are two types of features of a speech signal :
  • The temporal features (time domain features), which are simple to extract and have easy physical interpretation, like: the energy of signal, zero crossing rate, maximum amplitude, minimum energy, etc.
  • The spectral features (frequency based features), which are obtained by converting the time based signal into the frequency domain using the Fourier Transform, like: fundamental frequency, frequency components, spectral centroid, spectral flux, spectral density, spectral roll-off, etc. These features can be used to identify the notes, pitch, rhythm, and melody.

The most successful spectral features used in speech are (i) Mel frequency cepstral coefficients (MFCC) and (ii) Perceptive Linear Prediction (PLP) features. It is well known that the basilar membrane in the inner ear actually analyzes the frequency content of the speech we hear. In fact, the analysis of basilar membrane can be modeled by a bank of constant Q, band pass filters. There also exist the critical bands, which give rise to the phenomenon of masking - where one strong tone or burst can mask another weaker tone within the critical band. Actually, both MFCC and PLP capture these characteristics of our auditory system in some way; so, even though it looks strange, the same features give reasonably good performance for speech recognition, speaker recognition, language identification and even accent identification ! However, these spectral features are not very robust to noise.

On the other hand, some of the time domain (temporal) features such as plosion index and maximum correlation coefficient are relatively more robust to noise.

Wednesday, August 12, 2015

Remove silent region in audio signal

[ip,fs]=audioread('so1.wav');
%plot(ip);
% step 1 - break the signal into frames of 0.1 seconds
fs = 11000; % sampling frequency
frame_duration = 0.04;
frame_len = frame_duration*fs;
N = length(ip);
num_frames= floor(N/frame_len);

new_sig = zeros(N,1);
count=0;
for k = 1 : num_frames
    % extracting a frame of speech
    frame = ip( (k-1)*frame_len + 1 : frame_len*k );
    % step 2 - identify non silence frames by finding frames with max amplitute more than
    % 0.03
   
    max_val = max(frame);
    if(max_val > 0.03)
        count=count+1;
        new_sig((count-1)*frame_len + 1 : frame_len*count) = frame;
    end
end

new_sig(frame_len*count:end)=[];
 plot(new_sig);