530 likes | 712 Views
A 12-WEEK PROJECT IN Speech Coding and Recognition. by Fu-Tien Hsiao and Vedrana Andersen. Overview. An Introduction to Speech Signals (Vedrana) Linear Prediction Analysis (Fu) Speech Coding and Synthesis (Fu) Speech Recognition (Vedrana). Speech Coding and Recognition.
E N D
A 12-WEEK PROJECT INSpeech Coding and Recognition by Fu-Tien Hsiao and Vedrana Andersen
Overview • An Introduction to Speech Signals (Vedrana) • Linear Prediction Analysis (Fu) • Speech Coding and Synthesis (Fu) • Speech Recognition (Vedrana)
Speech Coding and Recognition AN INTRODUCTION TO SPEECH SIGNALS
AN INTRODUCTION TO SPEECH SIGNALSSpeech Production • Flow of air from lungs • Vibrating vocal cords • Speech production cavities • Lips • Sound wave • Vowels (a, e, i), fricatives (f, s, z) and plosives (p, t, k)
AN INTRODUCTION TO SPEECH SIGNALSSpeech Signals • Sampling frequency 8 — 16 kHz • Short-time stationary assumption (frames 20 – 40 ms)
AN INTRODUCTION TO SPEECH SIGNALSModel for Speech Production • Excitation (periodic, noisy) • Vocal tract filter (nasal cavity, oral cavity, pharynx)
AN INTRODUCTION TO SPEECH SIGNALSVoiced and Unvoiced Sounds • Voiced sounds, periodic excitation, pitch period • Unvoiced sounds, noise-like excitation • Short-time measures: power and zero-crossing
AN INTRODUCTION TO SPEECH SIGNALSFrequency Domain • Pitch, harmonics (excitation) • Formants, envelope (vocal tract filter) • Harmonic product spectrum
AN INTRODUCTION TO SPEECH SIGNALSSpeech Spectrograms • Time varying formant structure • Narrowband / wideband
Speech Coding and Recognition LINEAR PREDICTION ANALYSIS
LINEAR PREDICTION ANALYSISCategories • Vocal Tract Filter • Linear Prediction Analysis • Error Minimization • Levison-Durbin Recursion • Residual sequence u(n)
LINEAR PREDICTION ANALYSISVocal Tract Filter(1) • Vocal tract filter • If we assume an all poles filter? Output: speech Input: periodic impulse train
LINEAR PREDICTION ANALYSISVocal Tract Filter(2) • Auto regressive model: (all poles filter) where p is called the model order • Speech is a linear combination of past samples and an extra part, Aug(z)
LINEAR PREDICTION ANALYSISLinear Prediction Analysis(1) • Goal: how to find the coefficients ak in this all poles model? Physical model v.s. Analysis system error, e(n) impulse, Aug(n) speech, s(n) all poles model ? ak here is fixed, but unknown! we try to find αk to estimate ak
LINEAR PREDICTION ANALYSISLinear Prediction Analysis(2) • What is really inside the ? box? • A predictor (P(z), FIR filter) inside, where ŝ(n)= α1s(n-1)+α2s(n-2)+… + αps(n-p) • If αk≈ ak , then e(n) ≈ Aug(n) predicitve ŝ(n) predictive error, e(n)=s(n)- ŝ(n) original s(n) P(z) - A(z)=1-P(z)
e(n) ≈Aug(n) ŝ(n) 1 / A(z) LINEAR PREDICTION ANALYSISLinear Prediction Analysis (3) • If we can find a predictor generating a smallest error e(n) which is close to Aug(n), then we can use A(z) to estimate filter coefficients. very similar to vocal tract model
LINEAR PREDICTION ANALYSISError Minization(1) • Problem: How to find the minimum error? • Energy of error: , where e(n)=s(n)- ŝ(n) = function(αi) • For quadratic function of αi we can find the smallest value by for each
LINEAR PREDICTION ANALYSISError Minization(2) • By differentiation, • We define that, where This is actually an autocorrelation of s(n) a set of linear equations
LINEAR PREDICTION ANALYSISError Minization(3) • Hence, let’s discuss linear equations in matrix: • Linear prediction coefficient is our goal. • How to solve it efficiently?
LINEAR PREDICTION ANALYSISLevinson-Durbin Recursion(1) • In the matrix, LD recursion method is based on following characteristics: • Symmetric • Toeplitz • Hence we can solve matrix in O(p2) instead of O(p3) • Don’t forget our objective, which is to find αkto simulate the vocal tract filter.
LINEAR PREDICTION ANALYSISLevinson-Durbin Recursion(2) • In exercise, we solve matrix by ‘brute force’ and L-D recursion. There is no difference of corresponding parameters • Error energy v.s. Predictor order
u(n) A(z) s(n) LINEAR PREDICTION ANALYSISResidual sequence u(n) • After knowing filter coefficients, we can find residual sequence u(n) by inversely filtering computation. • Try to compare original s(n) residual u(n)
Speech Coding and Recognition SPEECH CODING AND SYNTHESIS
SPEECH CODING AND SYNTHESISCategories • Analysis-by-Synthesis • Perceptual Weighting Filter • Linear Predictive Coding • Multi-Pulse Linear Prediction • Code-Excited Linear Prediction (CELP) • CELP Experiment • Quantization
SPEECH CODING AND SYNTHESISAnalysis-by-Synthesis(1) • Analyze the speech by estimating a LP synthesis filter • Computing a residual sequence as a excitation signal to reconstruct signal • Encoder/Decoder : the parameters like LP synthesis filter, gain, and pitch are coded, transmitted, and decoded
E N C O D E R LP parameters s(n) e(n) To channel LP analysis ŝ(n) Excitation Generator LP Synthesis Filter Excitation parameters - Error Minimization SPEECH CODING AND SYNTHESISAnalysis-by-Synthesis(2) • Frame by frame • Without error minimization: • With error minimization:
SPEECH CODING AND SYNTHESISPerceptual Weighting Filter(1) • Perceptual masking effect: Within the formant regions, one is less sensitive to the noise • Idea: designing a filter that de-emphasizes the error in the formant region • Result: synthetic speech with more error near formant peaks but less error in others
SPEECH CODING AND SYNTHESISPerceptual Weighting Filter(2) • In frequency domain: • LP syn. filter v.s. PW filter • Perceptual weighting coefficient: • α = 1, no filtering. • α decreases, filtering more • optimalαdepends on perception
SPEECH CODING AND SYNTHESISPerceptual Weighting Filter(3) • In z domain, LP filter v.s. PW filter • Numerator: generating the zeros which are the original poles of LP synthesis filter • Denominator: placing the poles closer to the origin. αdetermines the distance
SPEECH CODING AND SYNTHESISLinear Predictive Coding(1) • Based on above methods, PW filter and analysis-by-synthesis • If excitation signal ≈ impulse train, during voicing, we can get a reconstructed signal very close to the original • More often, however, the residue is far from the impulse train
SPEECH CODING AND SYNTHESISLinear Predictive Coding(2) • Hence, there are many kinds of coding trying to improve this • Primarily differ in the type of excitation signal • Two kinds: • Multi-Pulse Linear Prediction • Code-Excited Linear Prediction (CELP)
SPEECH CODING AND SYNTHESISMulti-Pulse Linear Predcition(1) • Concept: represent the residual sequence by putting impulses in order to make ŝ(n) closer to s(n). s(n) LP Analysis ŝ(n) Error Minimization Excitation Generator LP Synthesis Filter - Multi-pulse, u(n) PW Filter
original multi-pulse synthetic error s2,3 s4 s1 SPEECH CODING AND SYNTHESISMulti-Pulse Linear Predcition(2) • s1 Estimate the LPC filter without excitation s2 Place one impulse (placement and amplitude) s3 A new error is determined s4 Repeat s2-s3 until reaching a desired min error
SPEECH CODING AND SYNTHESISCode-Excited Linear Prediction(1) • The difference: • Represent the residue v(n) by codewords (exhaustive searching) from a codebook of zero-mean Gaussian sequence • Consider primary pitch pulses which are predictable over consecutive periods
Pitch estimate v(n) Pitch synthesis filter SPEECH CODING AND SYNTHESISCode-Excited Linear Prediction(2) s(n) LP analysis LP parameters s(n) u(n) ŝ(n) LP synthesis filter Gaussian excitation codebook Multi-pulse generator - PW filter Error minimization
SPEECH CODING AND SYNTHESISCELP Experiment(1) • An experiment of CELP • Original (blue) : • Excitation signal (below): • Reconstructed (green) :
SPEECH CODING AND SYNTHESISCELP Experiment(2) • Test the quality for different settings: • LPC model order • Initial M=10 • Test M=2 • PW coefficient
SPEECH CODING AND SYNTHESISCELP Experiment(3) • Codebook (L,K) • K: codebook size • K influences the computation time strongly. if K: 1024 to 256, then time: 13 to 6 sec • Initial (40,1024) • Test (40,16) • L: length of the random signal • L determines the number of subblock in the frame
SPEECH CODING AND SYNTHESISQuantization • With quantization, • 16000 bps CELP • 9600 bps CELP • Trade-off Bandwidth efficiency v.s. speech quality
Speech Coding and Recognition SPEECH RECOGNITION
SPEECH RECOGNITIONDimensions of Difficulty • Speaker dependent / independent • Vocabulary size (small, medium, large) • Discrete words / continuous utterance • Quiet / noisy environment
SPEECH RECOGNITIONFeature Extraction • Overlapping frames • Feature vector for each frame • Mel-cepstrum, difference cepstrum, energy, diff. energy
SPEECH RECOGNITIONVector Quantization • Vector quantization • K-means algorithm • Observation sequence for the whole word
SPEECH RECOGNITIONHidden Markov Model (1) • Changing states, emitting symbols • (1), A, B 1 2 3 4 5
SPEECH RECOGNITIONHidden Markov Model (2) • Probability of transition • State transition matrix • State probability vector • State equation
SPEECH RECOGNITIONHidden Markov Model (3) • Probability of observing • Observation probability matrix • Observation probability vector • Observation equation
SPEECH RECOGNITIONHidden Markov Model (4) • Discrete observation hidden Markov model • Two HMM problems • Training problem • Recognition problem
3 3 3 3 3 states 2 2 2 2 2 1 1 1 1 1 time SPEECH RECOGNITIONRecognition using HMM (1) • Determining the probability that a given HMM produced the observation sequence • Using straightforward computation – all possible paths, ST
i SPEECH RECOGNITIONRecognition using HMM (2) • Forward-backward algorithm, only the forward part • Forward partial observation • Forward probability
i j SPEECH RECOGNITIONRecognition using HMM (3) • Initialization • Recursion • Termination