1 / 69

LSA 352 Speech Recognition and Synthesis

Learn about Mel-Frequency Cepstral Coefficients (MFCC), the most widely used spectral representation in Automatic Speech Recognition (ASR). Understand the importance of frame division and windowing in capturing spectral information. Explore the computation of MFCC, Mel-scale, Mel Filter Bank, and log energy. Discover delta and double-delta features for capturing temporal information. Find out why MFCC is popular for ASR and how it improves Hidden Markov Model (HMM) modeling.

gober
Download Presentation

LSA 352 Speech Recognition and Synthesis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LSA 352Speech Recognition and Synthesis Dan Jurafsky Lecture 6: Feature Extraction and Acoustic Modeling IP Notice: Various slides were derived from Andrew Ng’s CS 229 notes, as well as lecture notes from Chen, Picheny et al, Yun-Hsuan Sung, and Bryan Pellom. I’ll try to give correct credit on each slide, but I’ll prob miss some.

  2. MFCC • Mel-Frequency Cepstral Coefficient (MFCC) • Most widely used spectral representation in ASR

  3. Windowing • Why divide speech signal into successive overlapping frames? • Speech is not a stationary signal; we want information about a small enough region that the spectral information is a useful cue. • Frames • Frame size: typically, 10-25ms • Frame shift: the length of time between successive frames, typically, 5-10ms

  4. Windowing Slide from Bryan Pellom

  5. MFCC

  6. Discrete Fourier Transform computing a spectrum • A 24 ms Hamming-windowed signal • And its spectrum as computed by DFT (plus other smoothing)

  7. MFCC

  8. Mel-scale • Human hearing is not equally sensitive to all frequency bands • Less sensitive at higher frequencies, roughly > 1000 Hz • I.e. human perception of frequency is non-linear:

  9. Mel-scale • A mel is a unit of pitch • Definition: • Pairs of sounds perceptually equidistant in pitch • Are separated by an equal number of mels: • Mel-scale is approximately linear below 1 kHz and logarithmic above 1 kHz • Definition:

  10. Mel Filter Bank Processing • Mel Filter bank • Uniformly spaced before 1 kHz • logarithmic scale after 1 kHz

  11. MFCC

  12. Log energy computation • Compute the logarithm of the square magnitude of the output of Mel-filter bank

  13. Log energy computation • Why log energy? • Logarithm compresses dynamic range of values • Human response to signal level is logarithmic • humans less sensitive to slight differences in amplitude at high amplitudes than low amplitudes • Makes frequency estimates less sensitive to slight variations in input (power variation due to speaker’s mouth moving closer to mike) • Phase information not helpful in speech

  14. MFCC

  15. MFCC

  16. Dynamic Cepstral Coefficient • The cepstral coefficients do not capture energy • So we add an energy feature • Also, we know that speech signal is not constant (slope of formants, change from stop burst to release). • So we want to add the changes in features (the slopes). • We call these delta features • We also add double-delta acceleration features

  17. Delta and double-delta • Derivative: in order to obtain temporal information

  18. Typical MFCC features • Window size: 25ms • Window shift: 10ms • Pre-emphasis coefficient: 0.97 • MFCC: • 12 MFCC (mel frequency cepstral coefficients) • 1 energy feature • 12 delta MFCC features • 12 double-delta MFCC features • 1 delta energy feature • 1 double-delta energy feature • Total 39-dimensional features

  19. Why is MFCC so popular? • Efficient to compute • Incorporates a perceptual Mel frequency scale • Separates the source and filter • IDFT(DCT) decorrelates the features • Improves diagonal assumption in HMM modeling • Alternative • PLP

  20. …and now what? • What we have = what we OBSERVE • A MATRIX OF OBSERVATIONS: O = [O1, O2, ……………….OT]

  21. Multivariate Gaussians • Vector of means  and covariance matrix  • M mixtures of Gaussians:

  22. Gaussian Intuitions: Off-diagonal • As we increase the off-diagonal entries, more correlation between value of x and value of y Text and figures from Andrew Ng’s lecture notes for CS229

  23. Diagonal covariance • Diagonal contains the variance of each dimension ii2 • So this means we consider the variance of each acoustic feature (dimension) separately

  24. Gaussian Intuitions: Size of  •  = [0 0]  = [0 0]  = [0 0] •  = I  = 0.6I  = 2I • As  becomes larger, Gaussian becomes more spread out; as  becomes smaller, Gaussian more compressed Text and figures from Andrew Ng’s lecture notes for CS229

  25. Observation “probability” How “well” and observation “fits” into a given state Example: Obs. 3 in State 2 Text and figures from Andrew Ng’s lecture notes for CS229

  26. Gaussian PDFs • A Gaussian is a probability density function; probability is area under curve. • To make it a probability, we constrain area under curve = 1. • BUT… • We will be using “point estimates”; value of Gaussian at point. • Technically these are not probabilities, since a pdf gives a probability over an interval, needs to be multiplied by dx • BUT….this is ok since same value is omitted from all Gaussians (we will use argmax, it is still correct).

  27. Gaussian as Probability Density Function

  28. Gaussian as Probability Density Function Point estimate

  29. Using Gaussian as an acoustic likelihood estimator EXAMPLE: • Let’s suppose our observation was a single real-valued feature (instead of 33 dimension vector) • Then if we had learned a Gaussian over the distribution of values of this feature • We could compute the likelihood of any given observation ot as follows:

  30. …what about transition probabilities?

  31. So, how to train a HMM ?? HMM =l First decide the topology: number of sates, transitions,..

  32. In our HTK example (HMM prototypes) • File name MUST be equal to the linguistic unit: si , no and sil • We will use: • 12 states ( 14 in HTK: 12 + 2 start/end) • and two Gaussians (mixtures) per state… • …..with dimension 33.

  33. In our HTK example (HMM prototypes) So, for example, the file named si will be: ~o <vecsize> 33 <MFCC_0_D_A> ~h "si" <BeginHMM> <Numstates> 14 <State> 2 <NumMixes> 2 <Stream> 1 <Mixture> 1 0.6 <Mean> 33 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 33 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <Mixture> 2 0.4 <Mean> 330.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 331.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <State> 3 <NumMixes> 2 <Stream> 1 ......................................

  34. In our HTK example (HMM prototypes) ...................................... <Transp> 14x14 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <EndHMM>

  35. …but, how do we train a HMM ?? HMM =l for a particular linguistic unit Train = “estimate” transition and observation probabilities (weights, means and variances)

  36. …but, how do we train a HMM ?? We have some training data for each linguistic unit

  37. …but, how do we train a HMM ?? Example for a (single) Gaussian characterized by a mean and a variance

  38. …but, how do we train a HMM ??

  39. Uniform segmentation is used by HInit OUTPUT HMMs

  40. …what is the “hidden” sequence of sates ?? • GIVEN: • a HMM =l • and “a sequences of states” S

  41. …what is the “hidden” sequence of sates ?? Can be easily obtained

  42. …what is the “hidden” sequence of sates ?? ..and by adding all possible sequences S

  43. …what is the “hidden” sequence of sates ?? ..and by adding all possible sequences S (Baum-Welch)

  44. …what is the “hidden” sequence of sates ??

  45. …so with the optimum sequence of states on the training data.. transition and observation probabilities (weights, means and variances) can be re-trained … UNTIL CONVERGENCE

  46. HMM training using HRest OUTPUT HMMs

  47. …training is usually done using Baum-Welch Simplifying (too much..?) adaptation formulas weight observations by their probability to occupy a particular state (occupation probability) Weighted by occupation probability and Normalize HTKBook

  48. …why not to train all HMM together? …like a global HMM … In our example, in each file we have:

  49. HMM training using HERest OUTPUT HMMs

More Related