320 likes | 344 Views
Speech recognition and the EM algorithm. Karthik Visweswariah IBM Research India. 2. Speech recognition: The problem. Input: Audio data with a speaker saying a sentence in English Output: string of words corresponding to the words spoken Data resources
E N D
Speech recognitionand the EM algorithm KarthikVisweswariah IBM Research India
2 Speech recognition: The problem • Input: Audio data with a speaker saying a sentence in English • Output: string of words corresponding to the words spoken • Data resources • Large corpus (thousands of hours) of audio recordings with associated text
Agenda (for next two lectures) • Overview of statistical approach to speech recognition • Discuss sub-components indicating specific problems to be solved • Deeper dive into couple of areas with general applicability • EM algorithm • Maximum likelihood estimation • Gaussian mixture models • EM algorithm itself • Application to machine translation • Decision trees (?)
4 Evolution: 1960-present • Isolated digits: Filter bank analysis, time normalisation, dynamic programming • Isolated words, continuous digits: Pattern recognition, LPC analysis, clustering algorithms • Connected words: Statistical approaches, Hidden Markov Models • Continuous speech, Large Vocabulary • Speaker independence • Speaker adaptation • Discriminative training • Deep learning
5 Early attempts • Trajectory of formant frequencies (resonant frequencies of vocal tract) Automatic speech recognition: Brief history of technology development, B. J. Huang et. al. 2006
6 Simpler problem • Given weight of the person determine gender of the person • Clearly cannot be done deterministically • Model probabilistically • Joint distribution: P(gender, weight) • Bayes: P(gender | weight) = P(gender) P(weight | gender)/P(weight) • P(gender): Just count up gender of persons in database • P(weight|gender) • Non-parametric: Histogram of weights for each gender • Parametric: Assume Normal (Gaussian) distribution, estimate mean/variance separately for each gender • Choose gender with higher posterior probability given weight
How a speech recognizer works Acoustic model P(x|w) Feature vectors: x Search: argmax P(w)P(x|w) Signal processing Words Audio Language model: P(w)
How speech recognizer works Acoustic model P(x|w) Feature vectors: x Search: argmax P(w)P(x|w) Signal processing Words Audio Language model: P(w)
Feature extraction: How do we represent the data 9 • Usually the most important step in data science or data analytics • Also a function of amount of data • Converts data into a vector of real numbers • Represent documents for email classification into spam/non-spam • Counts of various characters (?) • Counts of various words (?) All words equally important? • Special characters used, colors used (?) • Predict attrition of employee: Performance, salary, … • Should we capture salary change as percentage change rather than absolute numbers • Should we look at performance of the manager • Salaries of team members? • Interacts with algorithm that is to be used downstream • Is algorithm invariant to scale? • Can the algorithm handle correlations in features • What other assumptions? • Domain/background knowledge comes in here
Signal processing • Input: Raw sampled audio, 11 KHz or 22 KHz on desktop, 8KHz for telephony • Output: 40 dimensional features, 60-100 vectors per second • Ideally: • Different sounds represented differently • Unnecessary variations removed • Noise • Speaker • Channel • Match modeling assumptions
Signal processing (contd.) • Windowed FFT: sounds easier to distinguish in frequency space • Mel binning: Measure sensitivity to frequencies by listening experiments • Sensitivity to a fixed difference in tone decreases with tone frequency • Log scale: Humans perceive volume on roughly a log scale • Decorrelate data (use DCT) <- Called MFCC upto this • Subtract mean: scale invariance, channel invariance
Signal processing (contd.) • Model dynamics: Concatenate previous and next few feature vectors • Project down to throw away noise/reduce computation (Linear/Fisher Discriminant Analysis) • Linear transform learned to match diagonal Gaussian modeling assumption
How speech recognizer works Acoustic model P(x|w) Feature vectors: x Search: argmax P(w)P(x|w) Signal processing Words Audio Language model: P(w)
How speech recognizer works Acoustic model P(x|w) Feature vectors: x Search: argmax P(w)P(x|w) Signal processing Words Audio Language model: P(w)
Acoustic modeling • Need to model acoustic sequences given words P(x|w) • Obviously cannot create a model for every word • Need to break words into the fundamental sounds • Cat • K AE T - represent the pronunciation using phonemes • At IBM we used 40-50 phonemes for English • Dictionaries • Hand created lists of words with their alternate pronunciations • Handling new words • Automatic generation of pronunciations from spellings • Clearly a tricky task for English e.g foreign names
Acoustic modeling (contd.) • Pronunciations change in continuous speech depending on neighboring words • “Give me” might sound more like “gimme” • Emission probabilities should depend on context • Use a different distribution for each different context? • Even with 40 phonemes looking two phones to either side gives us 2.5 million possibilites => Way too many • Learn which contexts the acoustics is different • Tie together contexts using a decision tree • At each node allowed to ask questions about two (typically) phones to the left and right • Eg. Is the first phoneme to the right a glottal stop • Use entropy gain to grow a tree • End up with 2000 to 10000 context dependent states from 120 context independent states
How a speech recognizer works Acoustic model P(x|w) Feature vectors: x Search: argmax P(w)P(x|w) Signal processing Words Audio Language model: P(w)
Search • Current approach is to precompile Language model, dictionary, phone HMMs and decision tree into a complete graph • Use Weighted Finite State Machine technology heavily • Complications • Space of words is large (five gram language model) • Context dependent acoustic models look across word boundaries • Need to prune to keep perform search at reasonable speeds • Throw away states that are far enough below the best state
Speaker/condition dependent systems • Humans can certainly do better with a little data: “adapt” to an unfamilar accent or noise • With minutes of data we can certainly do better • Could change our Acoustic models (Gaussian Mixture models) based on the new data • Can change the signal processing • Techniques described work even without supervision • Do a speaker independent decode, and pretend that the obtained word sequence is the truth
Adaptation Acoustic model P(x|w) Feature vectors: x Search: argmax P(w)P(x|w) Signal processing Words Audio Language model: P(w)
Vocal tract length normalization • Different speakers have different vocal tract lengths • Frequency stretched or squished • At test time estimate this frequency stretching/squishing and undo • Just a single parameter, quantized to 10 different values • Try each value and pick one that gives best likelihood • To get full benefit need to retrain in this canonical feature space • Gaussian Mixture Models and decision trees benefit from being trained in this “cleaned up” feature space
Improvements obtained • Conversational telephony data, test set from a call center • Training data 2000 hours of Fisher data (0.7 billion frames of acoustic data) • Language model built with hundreds of millions of words from various sources • Including data from the domain of interest (call center conversations IT help desk) • Roughly 30 million parameters in the acoustic model • System performance measured by word error rates • Speaker independent system: 34.0% • Vocal Tract Length Normalized system: 29.0% • Linear transform adaptation: 27.5% • Discriminative feature space: 23.8% • Discriminative training of model: 22.6% • Its hard work to improve on the best systems, no silver bullet!
Current state of the art • Progress tracked on Switchboard (conversational telephony test set) • Replace GMMs for acoustic modeling with deep networks Source: http://arxiv.org/pdf/1505.05899v1.pdf
Conclusions • Gave a brief overview of various components in practical state of the art speech recognition systems • Speech recognition technology has relied on generative statistical models with parameters learned from data • Moved away from hand coded knowledge • Discriminative estimation techniques are more expensive but give significant improvements • Deep learning has shown significant gains for speech recognition • Speech recognition systems are good enough to support several useful applications • But they are still sensitive to variations that humans can handle with ease