1 / 53

A 12-WEEK PROJECT IN Speech Coding and Recognition

A 12-WEEK PROJECT IN Speech Coding and Recognition. by Fu-Tien Hsiao and Vedrana Andersen. Overview. An Introduction to Speech Signals (Vedrana) Linear Prediction Analysis (Fu) Speech Coding and Synthesis (Fu) Speech Recognition (Vedrana). Speech Coding and Recognition.

vaughn
Download Presentation

A 12-WEEK PROJECT IN Speech Coding and Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A 12-WEEK PROJECT INSpeech Coding and Recognition by Fu-Tien Hsiao and Vedrana Andersen

  2. Overview • An Introduction to Speech Signals (Vedrana) • Linear Prediction Analysis (Fu) • Speech Coding and Synthesis (Fu) • Speech Recognition (Vedrana)

  3. Speech Coding and Recognition AN INTRODUCTION TO SPEECH SIGNALS

  4. AN INTRODUCTION TO SPEECH SIGNALSSpeech Production • Flow of air from lungs • Vibrating vocal cords • Speech production cavities • Lips • Sound wave • Vowels (a, e, i), fricatives (f, s, z) and plosives (p, t, k)

  5. AN INTRODUCTION TO SPEECH SIGNALSSpeech Signals • Sampling frequency 8 — 16 kHz • Short-time stationary assumption (frames 20 – 40 ms)

  6. AN INTRODUCTION TO SPEECH SIGNALSModel for Speech Production • Excitation (periodic, noisy) • Vocal tract filter (nasal cavity, oral cavity, pharynx)

  7. AN INTRODUCTION TO SPEECH SIGNALSVoiced and Unvoiced Sounds • Voiced sounds, periodic excitation, pitch period • Unvoiced sounds, noise-like excitation • Short-time measures: power and zero-crossing

  8. AN INTRODUCTION TO SPEECH SIGNALSFrequency Domain • Pitch, harmonics (excitation) • Formants, envelope (vocal tract filter) • Harmonic product spectrum

  9. AN INTRODUCTION TO SPEECH SIGNALSSpeech Spectrograms • Time varying formant structure • Narrowband / wideband

  10. Speech Coding and Recognition LINEAR PREDICTION ANALYSIS

  11. LINEAR PREDICTION ANALYSISCategories • Vocal Tract Filter • Linear Prediction Analysis • Error Minimization • Levison-Durbin Recursion • Residual sequence u(n)

  12. LINEAR PREDICTION ANALYSISVocal Tract Filter(1) • Vocal tract filter • If we assume an all poles filter? Output: speech Input: periodic impulse train

  13. LINEAR PREDICTION ANALYSISVocal Tract Filter(2) • Auto regressive model: (all poles filter) where p is called the model order • Speech is a linear combination of past samples and an extra part, Aug(z)

  14. LINEAR PREDICTION ANALYSISLinear Prediction Analysis(1) • Goal: how to find the coefficients ak in this all poles model? Physical model v.s. Analysis system error, e(n) impulse, Aug(n) speech, s(n) all poles model ? ak here is fixed, but unknown! we try to find αk to estimate ak

  15. LINEAR PREDICTION ANALYSISLinear Prediction Analysis(2) • What is really inside the ? box? • A predictor (P(z), FIR filter) inside, where ŝ(n)= α1s(n-1)+α2s(n-2)+… + αps(n-p) • If αk≈ ak , then e(n) ≈ Aug(n) predicitve ŝ(n) predictive error, e(n)=s(n)- ŝ(n) original s(n) P(z) - A(z)=1-P(z)

  16. e(n) ≈Aug(n) ŝ(n) 1 / A(z) LINEAR PREDICTION ANALYSISLinear Prediction Analysis (3) • If we can find a predictor generating a smallest error e(n) which is close to Aug(n), then we can use A(z) to estimate filter coefficients. very similar to vocal tract model

  17. LINEAR PREDICTION ANALYSISError Minization(1) • Problem: How to find the minimum error? • Energy of error: , where e(n)=s(n)- ŝ(n) = function(αi) • For quadratic function of αi we can find the smallest value by for each

  18. LINEAR PREDICTION ANALYSISError Minization(2) • By differentiation, • We define that, where This is actually an autocorrelation of s(n) a set of linear equations

  19. LINEAR PREDICTION ANALYSISError Minization(3) • Hence, let’s discuss linear equations in matrix: • Linear prediction coefficient is our goal. • How to solve it efficiently?

  20. LINEAR PREDICTION ANALYSISLevinson-Durbin Recursion(1) • In the matrix, LD recursion method is based on following characteristics: • Symmetric • Toeplitz • Hence we can solve matrix in O(p2) instead of O(p3) • Don’t forget our objective, which is to find αkto simulate the vocal tract filter.

  21. LINEAR PREDICTION ANALYSISLevinson-Durbin Recursion(2) • In exercise, we solve matrix by ‘brute force’ and L-D recursion. There is no difference of corresponding parameters • Error energy v.s. Predictor order

  22. u(n) A(z) s(n) LINEAR PREDICTION ANALYSISResidual sequence u(n) • After knowing filter coefficients, we can find residual sequence u(n) by inversely filtering computation. • Try to compare original s(n) residual u(n)

  23. Speech Coding and Recognition SPEECH CODING AND SYNTHESIS

  24. SPEECH CODING AND SYNTHESISCategories • Analysis-by-Synthesis • Perceptual Weighting Filter • Linear Predictive Coding • Multi-Pulse Linear Prediction • Code-Excited Linear Prediction (CELP) • CELP Experiment • Quantization

  25. SPEECH CODING AND SYNTHESISAnalysis-by-Synthesis(1) • Analyze the speech by estimating a LP synthesis filter • Computing a residual sequence as a excitation signal to reconstruct signal • Encoder/Decoder : the parameters like LP synthesis filter, gain, and pitch are coded, transmitted, and decoded

  26. E N C O D E R LP parameters s(n) e(n) To channel LP analysis ŝ(n) Excitation Generator LP Synthesis Filter Excitation parameters - Error Minimization SPEECH CODING AND SYNTHESISAnalysis-by-Synthesis(2) • Frame by frame • Without error minimization: • With error minimization:

  27. SPEECH CODING AND SYNTHESISPerceptual Weighting Filter(1) • Perceptual masking effect: Within the formant regions, one is less sensitive to the noise • Idea: designing a filter that de-emphasizes the error in the formant region • Result: synthetic speech with more error near formant peaks but less error in others

  28. SPEECH CODING AND SYNTHESISPerceptual Weighting Filter(2) • In frequency domain: • LP syn. filter v.s. PW filter • Perceptual weighting coefficient: • α = 1, no filtering. • α decreases, filtering more • optimalαdepends on perception

  29. SPEECH CODING AND SYNTHESISPerceptual Weighting Filter(3) • In z domain, LP filter v.s. PW filter • Numerator: generating the zeros which are the original poles of LP synthesis filter • Denominator: placing the poles closer to the origin. αdetermines the distance

  30. SPEECH CODING AND SYNTHESISLinear Predictive Coding(1) • Based on above methods, PW filter and analysis-by-synthesis • If excitation signal ≈ impulse train, during voicing, we can get a reconstructed signal very close to the original • More often, however, the residue is far from the impulse train

  31. SPEECH CODING AND SYNTHESISLinear Predictive Coding(2) • Hence, there are many kinds of coding trying to improve this • Primarily differ in the type of excitation signal • Two kinds: • Multi-Pulse Linear Prediction • Code-Excited Linear Prediction (CELP)

  32. SPEECH CODING AND SYNTHESISMulti-Pulse Linear Predcition(1) • Concept: represent the residual sequence by putting impulses in order to make ŝ(n) closer to s(n). s(n) LP Analysis ŝ(n) Error Minimization Excitation Generator LP Synthesis Filter - Multi-pulse, u(n) PW Filter

  33. original multi-pulse synthetic error s2,3 s4 s1 SPEECH CODING AND SYNTHESISMulti-Pulse Linear Predcition(2) • s1 Estimate the LPC filter without excitation s2 Place one impulse (placement and amplitude) s3 A new error is determined s4 Repeat s2-s3 until reaching a desired min error

  34. SPEECH CODING AND SYNTHESISCode-Excited Linear Prediction(1) • The difference: • Represent the residue v(n) by codewords (exhaustive searching) from a codebook of zero-mean Gaussian sequence • Consider primary pitch pulses which are predictable over consecutive periods

  35. Pitch estimate v(n) Pitch synthesis filter SPEECH CODING AND SYNTHESISCode-Excited Linear Prediction(2) s(n) LP analysis LP parameters s(n) u(n) ŝ(n) LP synthesis filter Gaussian excitation codebook Multi-pulse generator - PW filter Error minimization

  36. SPEECH CODING AND SYNTHESISCELP Experiment(1) • An experiment of CELP • Original (blue) : • Excitation signal (below): • Reconstructed (green) :

  37. SPEECH CODING AND SYNTHESISCELP Experiment(2) • Test the quality for different settings: • LPC model order • Initial M=10 • Test M=2 • PW coefficient

  38. SPEECH CODING AND SYNTHESISCELP Experiment(3) • Codebook (L,K) • K: codebook size • K influences the computation time strongly. if K: 1024 to 256, then time: 13 to 6 sec • Initial (40,1024) • Test (40,16) • L: length of the random signal • L determines the number of subblock in the frame

  39. SPEECH CODING AND SYNTHESISQuantization • With quantization, • 16000 bps CELP • 9600 bps CELP • Trade-off Bandwidth efficiency v.s. speech quality

  40. Speech Coding and Recognition SPEECH RECOGNITION

  41. SPEECH RECOGNITIONDimensions of Difficulty • Speaker dependent / independent • Vocabulary size (small, medium, large) • Discrete words / continuous utterance • Quiet / noisy environment

  42. SPEECH RECOGNITIONFeature Extraction • Overlapping frames • Feature vector for each frame • Mel-cepstrum, difference cepstrum, energy, diff. energy

  43. SPEECH RECOGNITIONVector Quantization • Vector quantization • K-means algorithm • Observation sequence for the whole word

  44. SPEECH RECOGNITIONHidden Markov Model (1) • Changing states, emitting symbols • (1), A, B 1 2 3 4 5

  45. SPEECH RECOGNITIONHidden Markov Model (2) • Probability of transition • State transition matrix • State probability vector • State equation

  46. SPEECH RECOGNITIONHidden Markov Model (3) • Probability of observing • Observation probability matrix • Observation probability vector • Observation equation

  47. SPEECH RECOGNITIONHidden Markov Model (4) • Discrete observation hidden Markov model • Two HMM problems • Training problem • Recognition problem

  48. 3 3 3 3 3 states 2 2 2 2 2 1 1 1 1 1 time SPEECH RECOGNITIONRecognition using HMM (1) • Determining the probability that a given HMM produced the observation sequence • Using straightforward computation – all possible paths, ST

  49. i SPEECH RECOGNITIONRecognition using HMM (2) • Forward-backward algorithm, only the forward part • Forward partial observation • Forward probability

  50. i j SPEECH RECOGNITIONRecognition using HMM (3) • Initialization • Recursion • Termination

More Related