1 / 32

Speech recognition and the EM algorithm

Speech recognition and the EM algorithm. Karthik Visweswariah IBM Research India. 2. Speech recognition: The problem. Input: Audio data with a speaker saying a sentence in English Output: string of words corresponding to the words spoken Data resources

lfry
Download Presentation

Speech recognition and the EM algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech recognitionand the EM algorithm KarthikVisweswariah IBM Research India

  2. 2 Speech recognition: The problem • Input: Audio data with a speaker saying a sentence in English • Output: string of words corresponding to the words spoken • Data resources • Large corpus (thousands of hours) of audio recordings with associated text

  3. Agenda (for next two lectures) • Overview of statistical approach to speech recognition • Discuss sub-components indicating specific problems to be solved • Deeper dive into couple of areas with general applicability • EM algorithm • Maximum likelihood estimation • Gaussian mixture models • EM algorithm itself • Application to machine translation • Decision trees (?)

  4. 4 Evolution: 1960-present • Isolated digits: Filter bank analysis, time normalisation, dynamic programming • Isolated words, continuous digits: Pattern recognition, LPC analysis, clustering algorithms • Connected words: Statistical approaches, Hidden Markov Models • Continuous speech, Large Vocabulary • Speaker independence • Speaker adaptation • Discriminative training • Deep learning

  5. 5 Early attempts • Trajectory of formant frequencies (resonant frequencies of vocal tract) Automatic speech recognition: Brief history of technology development, B. J. Huang et. al. 2006

  6. 6 Simpler problem • Given weight of the person determine gender of the person • Clearly cannot be done deterministically • Model probabilistically • Joint distribution: P(gender, weight) • Bayes: P(gender | weight) = P(gender) P(weight | gender)/P(weight) • P(gender): Just count up gender of persons in database • P(weight|gender) • Non-parametric: Histogram of weights for each gender • Parametric: Assume Normal (Gaussian) distribution, estimate mean/variance separately for each gender • Choose gender with higher posterior probability given weight

  7. How a speech recognizer works Acoustic model P(x|w) Feature vectors: x Search: argmax P(w)P(x|w) Signal processing Words Audio Language model: P(w)

  8. How speech recognizer works Acoustic model P(x|w) Feature vectors: x Search: argmax P(w)P(x|w) Signal processing Words Audio Language model: P(w)

  9. Feature extraction: How do we represent the data 9 • Usually the most important step in data science or data analytics • Also a function of amount of data • Converts data into a vector of real numbers • Represent documents for email classification into spam/non-spam • Counts of various characters (?) • Counts of various words (?) All words equally important? • Special characters used, colors used (?) • Predict attrition of employee: Performance, salary, … • Should we capture salary change as percentage change rather than absolute numbers • Should we look at performance of the manager • Salaries of team members? • Interacts with algorithm that is to be used downstream • Is algorithm invariant to scale? • Can the algorithm handle correlations in features • What other assumptions? • Domain/background knowledge comes in here

  10. Signal processing • Input: Raw sampled audio, 11 KHz or 22 KHz on desktop, 8KHz for telephony • Output: 40 dimensional features, 60-100 vectors per second • Ideally: • Different sounds represented differently • Unnecessary variations removed • Noise • Speaker • Channel • Match modeling assumptions

  11. Signal processing (contd.) • Windowed FFT: sounds easier to distinguish in frequency space • Mel binning: Measure sensitivity to frequencies by listening experiments • Sensitivity to a fixed difference in tone decreases with tone frequency • Log scale: Humans perceive volume on roughly a log scale • Decorrelate data (use DCT) <- Called MFCC upto this • Subtract mean: scale invariance, channel invariance

  12. Signal processing (contd.) • Model dynamics: Concatenate previous and next few feature vectors • Project down to throw away noise/reduce computation (Linear/Fisher Discriminant Analysis) • Linear transform learned to match diagonal Gaussian modeling assumption

  13. How speech recognizer works Acoustic model P(x|w) Feature vectors: x Search: argmax P(w)P(x|w) Signal processing Words Audio Language model: P(w)

  14. Language modeling

  15. Language modeling

  16. How speech recognizer works Acoustic model P(x|w) Feature vectors: x Search: argmax P(w)P(x|w) Signal processing Words Audio Language model: P(w)

  17. Acoustic modeling • Need to model acoustic sequences given words P(x|w) • Obviously cannot create a model for every word • Need to break words into the fundamental sounds • Cat • K AE T - represent the pronunciation using phonemes • At IBM we used 40-50 phonemes for English • Dictionaries • Hand created lists of words with their alternate pronunciations • Handling new words • Automatic generation of pronunciations from spellings • Clearly a tricky task for English e.g foreign names

  18. Acoustic modeling (contd)

  19. Acoustic modeling (contd.)

  20. Acoustic modeling (contd.) • Pronunciations change in continuous speech depending on neighboring words • “Give me” might sound more like “gimme” • Emission probabilities should depend on context • Use a different distribution for each different context? • Even with 40 phonemes looking two phones to either side gives us 2.5 million possibilites => Way too many • Learn which contexts the acoustics is different • Tie together contexts using a decision tree • At each node allowed to ask questions about two (typically) phones to the left and right • Eg. Is the first phoneme to the right a glottal stop • Use entropy gain to grow a tree • End up with 2000 to 10000 context dependent states from 120 context independent states

  21. Acoustic modeling

  22. How a speech recognizer works Acoustic model P(x|w) Feature vectors: x Search: argmax P(w)P(x|w) Signal processing Words Audio Language model: P(w)

  23. Search

  24. Search • Current approach is to precompile Language model, dictionary, phone HMMs and decision tree into a complete graph • Use Weighted Finite State Machine technology heavily • Complications • Space of words is large (five gram language model) • Context dependent acoustic models look across word boundaries • Need to prune to keep perform search at reasonable speeds • Throw away states that are far enough below the best state

  25. Speaker/condition dependent systems • Humans can certainly do better with a little data: “adapt” to an unfamilar accent or noise • With minutes of data we can certainly do better • Could change our Acoustic models (Gaussian Mixture models) based on the new data • Can change the signal processing • Techniques described work even without supervision • Do a speaker independent decode, and pretend that the obtained word sequence is the truth

  26. Adaptation Acoustic model P(x|w) Feature vectors: x Search: argmax P(w)P(x|w) Signal processing Words Audio Language model: P(w)

  27. Vocal tract length normalization • Different speakers have different vocal tract lengths • Frequency stretched or squished • At test time estimate this frequency stretching/squishing and undo • Just a single parameter, quantized to 10 different values • Try each value and pick one that gives best likelihood • To get full benefit need to retrain in this canonical feature space • Gaussian Mixture Models and decision trees benefit from being trained in this “cleaned up” feature space

  28. Adaptation of models

  29. Adaptation of features

  30. Improvements obtained • Conversational telephony data, test set from a call center • Training data 2000 hours of Fisher data (0.7 billion frames of acoustic data) • Language model built with hundreds of millions of words from various sources • Including data from the domain of interest (call center conversations IT help desk) • Roughly 30 million parameters in the acoustic model • System performance measured by word error rates • Speaker independent system: 34.0% • Vocal Tract Length Normalized system: 29.0% • Linear transform adaptation: 27.5% • Discriminative feature space: 23.8% • Discriminative training of model: 22.6% • Its hard work to improve on the best systems, no silver bullet!

  31. Current state of the art • Progress tracked on Switchboard (conversational telephony test set) • Replace GMMs for acoustic modeling with deep networks Source: http://arxiv.org/pdf/1505.05899v1.pdf

  32. Conclusions • Gave a brief overview of various components in practical state of the art speech recognition systems • Speech recognition technology has relied on generative statistical models with parameters learned from data • Moved away from hand coded knowledge • Discriminative estimation techniques are more expensive but give significant improvements • Deep learning has shown significant gains for speech recognition • Speech recognition systems are good enough to support several useful applications • But they are still sensitive to variations that humans can handle with ease

More Related