350 likes | 517 Views
A Comparison Of Speech Coding With Linear Predictive Coding (LPC) And Code-Excited Linear Predictor Coding (CELP). By: Kendall Khodra Instructor: Dr. Kepuska. Introduction This project will develop Linear Predictive Coding (LPC) to process a speech signal. The objective is to mitigate
E N D
A Comparison Of Speech Coding With Linear Predictive Coding (LPC) And Code-Excited Linear Predictor Coding(CELP) By: Kendall Khodra Instructor: Dr. Kepuska
Introduction This project will develop Linear Predictive Coding (LPC) to process a speech signal. The objective is to mitigate the lack of quality of the simple LPC model by using a more complex description of the excitation, Code Excited Linear Prediction (CELP) to process the output of simple LPC.
Background Linear Predictive Coding (LPC) methods are the most widely used in speech coding, speech synthesis, speech recognition, speaker recognition and verification and for speech storage. LPC has been considered one of the most powerful techniques for speech analysis. In fact, this technique is the basis of other more recent and sophisticated algorithms that are used for estimating speech parameters, e.g., pitch, formants, spectra, vocal tract and low bit representations of speech.
The basic principle of linear prediction, states that speech can be modeled as the output of a linear, time- varying system excited by either periodic pulses or random noise. These two kinds of acoustic sources are called voiced and unvoiced respectively. In this sense, voiced emissions are those generated by the vibration of the vocal cords in the presence of airflow and unvoiced sounds are those generated when the vocal cords are relaxed.
A. Physical Model: • When you speak: • Air is pushed from your lungs through your vocal tract and out of your mouth comes speech.
For certain voiced sound, your vocal cords (folds) vibrate (open and close). The rate at which the vocal cords vibrate determines the pitch of your voice. Women and young children tend to have high pitch (fast vibration) while adult males tend to have low pitch (slow vibration). • For certain fricative and plosive (or unvoiced) sounds your vocal cords do not vibrate but remain constantly opened. • The shape of your vocal tract, which changes as we speak, determines the sound that you make. • The amount of air coming from your lung determines the loudness of your voice.
B. Mathematical Model Block diagram of simplified mathematical model for speech production • · The model says that the digital speech signal is the • output of a digital filter (called the LPC filter) whose • input is either a train of impulses or a white noise • sequence.
The relationship between the physical and the mathematical models: Vocal Tract H(z) (LPC Filter) Air u(n) (innovation) Vocal Cord Vibration V(Voiced) Vocal Cord Vibration Period T (Pitch period) Fricatives and Plosives UV (Unvoiced) where Vocal tract system = function
The LPC Model The LPC method considers a speech sample s(n) at time n, and approximates it as a linear combination of the past samples in the way: (1) Where G is the gain and u(n) the normalized excitation. The predictor coefficients (the k’s) are determined (computed) by minimizing the sum of squared differences (over a finite interval) between actual speech samples and the linearly predicted ones( we will see later).
Block diagram of an LPC In the LPC model the residual (excitation) is approximated during voicing by a quasi-periodic impulse train and during unvoicing by a white noise sequence. This approximation is denoted by . We then pass through the filter 1/A(z)
LPC consists of the following steps • Pre-emphasis Filtering • Data Windowing • Autocorrelation Parameter Estimation • Pitch Period and Gain Estimation • Quantization • Decoding and Frame Interpolation
Pre-emphasis Filtering • When we speak, the speech signal experiences some spectral roll off due to the radiation effects of the sound from the mouth • As a result, the majority of the spectral energy is concentrated in the lower frequencies. • To have our model give equal weight to both low and high frequencies, we need to apply a high-pass filter to the original signal. • This is done with a one zero filter, called the pre-emphasis filter. The filter has the form: y[n] = 1 - a x[n] Most standards use a = 15/16 = .9375 ( our default) When we decode the speech, the last thing we do to each frame is to pass it through a de-emphasis filter to undo this effect. Matlab: speech = filter([1 -preemp], 1, data)'; % Preemphasize speech
Data Windowing Because speech signals vary with time, this process is done on short chunks of the speech signal, which we call frames. Usually 30 to 50 ms frames give intelligible speech with good compression. • For implementation in this project we will use overlapping data framesto avoid discontinuities in the model. We used a frame width of 30 ms and overlap of 10 ms. • A hamming window was used to extract frames as shown below
Determining Pitch Period For each frame, we must determine if the speech is voiced or unvoiced. We do this by searching for periodicities in the residual (prediction error) signal. To determine if the frame is voiced or unvoiced, we apply a threshold to the autocorrelation. Typically, this threshold is set at Rx(0) * 0.3. • If no values of the autocorrelation sequence exceed this threshold, then we declare the frame unvoiced. • If we have periodicities in the data , there should be spikes which exceed the threshold; in this case we declare the frame voiced. The distance between spikes in the autocorrelation function is equivalent to the pitch period of the original signal.
LPC analyzes the speech signal by: • Estimating the formants • Removing their effects from the speech signal • Estimating the intensity and frequency of the remaining signal. The process of removing the formants is called inverse filtering, and the remaining signal is called the residue. LPC synthesizes the speech signal by reversing the process: • Use the residue to create a source signal • Use the formants to create a filter (which represents the tube/tract) • Run the source through the filter, resulting in speech.
Estimating the Formants The coefficients of the difference equation (the prediction coefficients) characterize the formants. The LPC system needs to estimate these coefficients which is done by minimizing the mean-square error between the predicted signal and the actual signal.
CELP (Code Excited Linear Predictor) A CELP coder does the same LPC modeling but then computes the errors between the original speech & the synthetic model and transmits both model parameters and a very compressed representation of the errors (the compressed representation is an index into a 'code book' shared between coders & decoders -- this is why it's called "Code Excited"). A CELP coder does much more work than an LPC coder (usually about an order of magnitude more) but the result is much higher quality speech:
The perceptual weighting filter is defined as: 0<r<1 This filter is used to de-emphasize the frequency regions that correspond to the formants as determined by LPC analysis. The noise, located in formant regions, that is more perceptibly disturbing can be reduced. The de-emphasis is controlled by factor r.
After determining the formant synthesis filter 1/A(z), the pitch synthesis filter 1/P(z), and encoding data rate, we can do an excitation codebook search. The codebook search is performed in the subframes of an LPC frame. The subframe length is usually equal to or shorter than the pitch subframe length.
Autocorrelation Parameter Estimation The autocorrelation method assumes that the signal is identically zero outside the analysis interval (0<=m<=N-1). Then it tries to minimize the prediction error wherever it is nonzero, that is in the interval 0<=m<=N-1+p, where p is the order of the model used. The error is likely to be large at the beginning and at the end of this interval. This is the reason why the speech segment analyzed is usually tapered by the application of a Hamming Window.
Finding the Parameters Given Our goal is to find the predictor coefficients ai which minimizes k the square of the prediction error in a short segment of speech. The mean short time prediction error per frame is defined as: To minimize this we take the derivative and set it to zero. This results in the equation:
Letting , we have
This equation is solved using the Levinson-Durbin algorithm This algorithm is one used to assist in finding the filter coefficients ai from the system Ra=r. What the Levinson-Durbin algorithm does here is making the solution to the problem O(n2) instead of O(n3) by exploiting the fact that matrix R is toeplitz hermitian.
Matlab % Levinson's method err(1) = autoCorVec(1); k(1) = 0; A = []; for index=1:L numerator = [1 A.']*autoCorVec(index+1:-1:2); denominator = -1*err(index); k(index) = numerator/denominator; % PARCOR coeffs A = [A+k(index)*flipud(A); k(index)]; err(index+1) = (1-k(index)^2)*err(index); end aCoeff(:,nframe) = [1; A]; parcor(:,nframe) = k';
Helpful matlab tools used • synFrame = filter(1, A', residFrame) This filters the data in vector residframe with the filter described by vector A • resid2 = dct(resid); This returns the discrete cosine transform of resid as discrete cosine transform coefficients. Only the first 50 coefficients are kept since most of the energy is stored there • resid3 = uencode(resid2,4); This function uniformly quantizes and encodes the data in the vector resid2 into N-bits. • newsignal = udecode(resid,4); This does the opposite of uencode of resid
It can be seen from the waveforms that the CELP method looks much more like and hence is a better method for speech coding. This is emphasized from the log-magnitude spectrum. • The synthesized voice of linear prediction waveform is peaky and sounds buzzy since it is based on the autocorrelation method that has loss of absolute phase structure because of its minimum phase characteristics.
Results Male Voice Original Signal LPC Signal CELP Signal Female Voice Original Signal LPC Signal CELP Signal 4 bits encoding 8 bits encoding 4 bits encoding 8 bits encoding
Drawbacks The LPC method has inherent errors (quantization) and in most cases doesn’t give accurate solution. The tapering effects of the window (hamming window used) also introduces error since the waveform may not follow an all pole model assumed. However the tapering of window has an advantage that least square error in the finding the solution is reduced.
Conclusion By comparison of the original speech against LPC speech and the CELP; in both cases, the reconstructed speech has lower quality than the input speech. Both of the reconstructed speech sounds noisy with the LPC model being nearly unintelligible. The sound seems to be whispered with an extensive amount of noise. The CELP reconstructed speech sounds more spoken and less whispered. In all, the CELP speech sounded closer to the original one, still with a muffled sound.
Further investigation MELP • The MELP (Mixed-Excitation Linear Predictive) • Vocoder is the new 2400 bps Federal Standard speech coder. • It is robust in difficult background noise environments such as those frequently encountered in commercial and military communication systems. • It is very efficient in its computational requirements. • The MELP Vocoder is based on the traditional LPC parametric model, but also includes four additional features. These are mixed-excitation, aperiodic pulses, pulse dispersion, and adaptive spectral enhancement.
The mixed-excitation is implemented using a multi-band mixing model. The primary effect of this multi-band mixed-excitation is to reduce the buzz usually associated with LPC vocoders, especially in broadband acoustic noise. • Require explicit multi-band decision and source characterization
References: [1]J.L. Flanagan and L. R. Rabiner Speech Synthesis, Dowden, Hutchington & Ross, Inc., Stroudsburg, Pennsylvania 1973. [2] Z Li and M. Drew Fundamentals of Multimedia Prentice Hall (October 22, 2003) [3] Atlanta Signal Processors, Inc. The New 2400 bps Federal Standard Speech Coder (http://www.aspi.com/tech/specs/pdfs/melp.pdf