690 likes | 696 Views
Learn about Mel-Frequency Cepstral Coefficients (MFCC), the most widely used spectral representation in Automatic Speech Recognition (ASR). Understand the importance of frame division and windowing in capturing spectral information. Explore the computation of MFCC, Mel-scale, Mel Filter Bank, and log energy. Discover delta and double-delta features for capturing temporal information. Find out why MFCC is popular for ASR and how it improves Hidden Markov Model (HMM) modeling.
E N D
LSA 352Speech Recognition and Synthesis Dan Jurafsky Lecture 6: Feature Extraction and Acoustic Modeling IP Notice: Various slides were derived from Andrew Ng’s CS 229 notes, as well as lecture notes from Chen, Picheny et al, Yun-Hsuan Sung, and Bryan Pellom. I’ll try to give correct credit on each slide, but I’ll prob miss some.
MFCC • Mel-Frequency Cepstral Coefficient (MFCC) • Most widely used spectral representation in ASR
Windowing • Why divide speech signal into successive overlapping frames? • Speech is not a stationary signal; we want information about a small enough region that the spectral information is a useful cue. • Frames • Frame size: typically, 10-25ms • Frame shift: the length of time between successive frames, typically, 5-10ms
Windowing Slide from Bryan Pellom
Discrete Fourier Transform computing a spectrum • A 24 ms Hamming-windowed signal • And its spectrum as computed by DFT (plus other smoothing)
Mel-scale • Human hearing is not equally sensitive to all frequency bands • Less sensitive at higher frequencies, roughly > 1000 Hz • I.e. human perception of frequency is non-linear:
Mel-scale • A mel is a unit of pitch • Definition: • Pairs of sounds perceptually equidistant in pitch • Are separated by an equal number of mels: • Mel-scale is approximately linear below 1 kHz and logarithmic above 1 kHz • Definition:
Mel Filter Bank Processing • Mel Filter bank • Uniformly spaced before 1 kHz • logarithmic scale after 1 kHz
Log energy computation • Compute the logarithm of the square magnitude of the output of Mel-filter bank
Log energy computation • Why log energy? • Logarithm compresses dynamic range of values • Human response to signal level is logarithmic • humans less sensitive to slight differences in amplitude at high amplitudes than low amplitudes • Makes frequency estimates less sensitive to slight variations in input (power variation due to speaker’s mouth moving closer to mike) • Phase information not helpful in speech
Dynamic Cepstral Coefficient • The cepstral coefficients do not capture energy • So we add an energy feature • Also, we know that speech signal is not constant (slope of formants, change from stop burst to release). • So we want to add the changes in features (the slopes). • We call these delta features • We also add double-delta acceleration features
Delta and double-delta • Derivative: in order to obtain temporal information
Typical MFCC features • Window size: 25ms • Window shift: 10ms • Pre-emphasis coefficient: 0.97 • MFCC: • 12 MFCC (mel frequency cepstral coefficients) • 1 energy feature • 12 delta MFCC features • 12 double-delta MFCC features • 1 delta energy feature • 1 double-delta energy feature • Total 39-dimensional features
Why is MFCC so popular? • Efficient to compute • Incorporates a perceptual Mel frequency scale • Separates the source and filter • IDFT(DCT) decorrelates the features • Improves diagonal assumption in HMM modeling • Alternative • PLP
…and now what? • What we have = what we OBSERVE • A MATRIX OF OBSERVATIONS: O = [O1, O2, ……………….OT]
Multivariate Gaussians • Vector of means and covariance matrix • M mixtures of Gaussians:
Gaussian Intuitions: Off-diagonal • As we increase the off-diagonal entries, more correlation between value of x and value of y Text and figures from Andrew Ng’s lecture notes for CS229
Diagonal covariance • Diagonal contains the variance of each dimension ii2 • So this means we consider the variance of each acoustic feature (dimension) separately
Gaussian Intuitions: Size of • = [0 0] = [0 0] = [0 0] • = I = 0.6I = 2I • As becomes larger, Gaussian becomes more spread out; as becomes smaller, Gaussian more compressed Text and figures from Andrew Ng’s lecture notes for CS229
Observation “probability” How “well” and observation “fits” into a given state Example: Obs. 3 in State 2 Text and figures from Andrew Ng’s lecture notes for CS229
Gaussian PDFs • A Gaussian is a probability density function; probability is area under curve. • To make it a probability, we constrain area under curve = 1. • BUT… • We will be using “point estimates”; value of Gaussian at point. • Technically these are not probabilities, since a pdf gives a probability over an interval, needs to be multiplied by dx • BUT….this is ok since same value is omitted from all Gaussians (we will use argmax, it is still correct).
Gaussian as Probability Density Function Point estimate
Using Gaussian as an acoustic likelihood estimator EXAMPLE: • Let’s suppose our observation was a single real-valued feature (instead of 33 dimension vector) • Then if we had learned a Gaussian over the distribution of values of this feature • We could compute the likelihood of any given observation ot as follows:
So, how to train a HMM ?? HMM =l First decide the topology: number of sates, transitions,..
In our HTK example (HMM prototypes) • File name MUST be equal to the linguistic unit: si , no and sil • We will use: • 12 states ( 14 in HTK: 12 + 2 start/end) • and two Gaussians (mixtures) per state… • …..with dimension 33.
In our HTK example (HMM prototypes) So, for example, the file named si will be: ~o <vecsize> 33 <MFCC_0_D_A> ~h "si" <BeginHMM> <Numstates> 14 <State> 2 <NumMixes> 2 <Stream> 1 <Mixture> 1 0.6 <Mean> 33 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 33 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <Mixture> 2 0.4 <Mean> 330.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 331.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <State> 3 <NumMixes> 2 <Stream> 1 ......................................
In our HTK example (HMM prototypes) ...................................... <Transp> 14x14 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <EndHMM>
…but, how do we train a HMM ?? HMM =l for a particular linguistic unit Train = “estimate” transition and observation probabilities (weights, means and variances)
…but, how do we train a HMM ?? We have some training data for each linguistic unit
…but, how do we train a HMM ?? Example for a (single) Gaussian characterized by a mean and a variance
Uniform segmentation is used by HInit OUTPUT HMMs
…what is the “hidden” sequence of sates ?? • GIVEN: • a HMM =l • and “a sequences of states” S
…what is the “hidden” sequence of sates ?? Can be easily obtained
…what is the “hidden” sequence of sates ?? ..and by adding all possible sequences S
…what is the “hidden” sequence of sates ?? ..and by adding all possible sequences S (Baum-Welch)
…so with the optimum sequence of states on the training data.. transition and observation probabilities (weights, means and variances) can be re-trained … UNTIL CONVERGENCE
HMM training using HRest OUTPUT HMMs
…training is usually done using Baum-Welch Simplifying (too much..?) adaptation formulas weight observations by their probability to occupy a particular state (occupation probability) Weighted by occupation probability and Normalize HTKBook
…why not to train all HMM together? …like a global HMM … In our example, in each file we have:
HMM training using HERest OUTPUT HMMs