1 / 59

Speech and Language Processing

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition. Outline for ASR. ASR Architecture The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction Acoustic Model Language Model Lexicon/Pronunciation Model (Introduction to HMMs again) Decoder

azuka
Download Presentation

Speech and Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition

  2. Outline for ASR • ASR Architecture • The Noisy Channel Model • Five easy pieces of an ASR system • Feature Extraction • Acoustic Model • Language Model • Lexicon/Pronunciation Model (Introduction to HMMs again) • Decoder • Evaluation Speech and Language Processing Jurafsky and Martin

  3. Speech Recognition • Applications of Speech Recognition (ASR) • Dictation • Telephone-based Information (directions, air travel, banking, etc) • Hands-free (in car) • Speaker Identification • Language Identification • Second language ('L2') (accent reduction) • Audio archive searching Speech and Language Processing Jurafsky and Martin

  4. LVCSR Large Vocabulary Continuous Speech Recognition ~20,000-64,000 words Speaker independent (vs. speaker-dependent) Continuous speech (vs isolated-word) Speech and Language Processing Jurafsky and Martin

  5. Current error rates Ballpark numbers; exact numbers depend very much on the specific corpus Speech and Language Processing Jurafsky and Martin

  6. HSR versus ASR • Conclusions: • Machines about 5 times worse than humans • Gap increases with noisy speech • These numbers are rough, take with grain of salt Speech and Language Processing Jurafsky and Martin

  7. Why is conversational speech harder? A piece of an utterance without context The same utterance with more context Speech and Language Processing Jurafsky and Martin

  8. LVCSR Design Intuition • Build a statistical model of the speech-to-words process • Collect lots and lots of speech, and transcribe all the words. • Train the model on the labeled speech • Paradigm: Supervised Machine Learning + Search Speech and Language Processing Jurafsky and Martin

  9. Speech Recognition Architecture Speech and Language Processing Jurafsky and Martin

  10. The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform. Speech and Language Processing Jurafsky and Martin

  11. The Noisy Channel Model (II) • What is the most likely sentence out of all sentences in the language L given some acoustic input O? • Treat acoustic input O as sequence of individual observations • O = o1,o2,o3,…,ot • Define a sentence as a sequence of words: • W = w1,w2,w3,…,wn Speech and Language Processing Jurafsky and Martin

  12. Noisy Channel Model (III) Probabilistic implication: Pick the highest prob S = W: We can use Bayes rule to rewrite this: Since denominator is the same for each candidate sentence W, we can ignore it for the argmax: Speech and Language Processing Jurafsky and Martin

  13. Noisy channel model likelihood prior Speech and Language Processing Jurafsky and Martin

  14. The noisy channel model Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source) Speech and Language Processing Jurafsky and Martin

  15. Speech Architecture meets Noisy Channel Speech and Language Processing Jurafsky and Martin

  16. Architecture: Five easy pieces (only 3-4 for today) HMMs, Lexicons, and Pronunciation Feature extraction Acoustic Modeling Decoding Language Modeling (seen this already) Speech and Language Processing Jurafsky and Martin

  17. Lexicon • A list of words • Each one with a pronunciation in terms of phones • We get these from on-line pronucniation dictionary • CMU dictionary: 127K words • http://www.speech.cs.cmu.edu/cgi-bin/cmudict • We’ll represent the lexicon as an HMM Speech and Language Processing Jurafsky and Martin

  18. HMMs for speech: the word “six” Speech and Language Processing Jurafsky and Martin

  19. Phones are not homogeneous! Speech and Language Processing Jurafsky and Martin

  20. Each phone has 3 subphones Speech and Language Processing Jurafsky and Martin

  21. Resulting HMM word model for “six” with their subphones Speech and Language Processing Jurafsky and Martin

  22. HMM for the digit recognition task Speech and Language Processing Jurafsky and Martin

  23. Detecting Phones • Two stages • Feature extraction • Basically a slice of a spectrogram • Building a phone classifier (using GMM classifier) Speech and Language Processing Jurafsky and Martin

  24. MFCC: Mel-Frequency Cepstral Coefficients Speech and Language Processing Jurafsky and Martin

  25. MFCC process: windowing Speech and Language Processing Jurafsky and Martin

  26. MFCC process: windowing Speech and Language Processing Jurafsky and Martin

  27. Hamming window on the signal, and then computing the spectrum Speech and Language Processing Jurafsky and Martin

  28. Final Feature Vector • 39 Features per 10 ms frame: • 12 MFCC features • 12 Delta MFCC features • 12 Delta-Delta MFCC features • 1 (log) frame energy • 1 Delta (log) frame energy • 1 Delta-Delta (log frame energy) • So each frame represented by a 39D vector Speech and Language Processing Jurafsky and Martin

  29. Acoustic Modeling (= Phone detection) • Given a 39-dimensional vector corresponding to the observation of one frame oi • And given a phone q we want to detect • Compute p(oi|q) • Most popular method: • GMM (Gaussian mixture models) • Other methods • Neural nets, CRFs, SVM, etc Speech and Language Processing Jurafsky and Martin

  30. Gaussian Mixture Models Also called “fully-continuous HMMs” P(o|q) computed by a Gaussian: Speech and Language Processing Jurafsky and Martin

  31. Gaussians for Acoustic Modeling A Gaussian is parameterized by a mean and a variance: Different means P(o|q) is highest here at mean P(o|q is low here, very far from mean) P(o|q) o P(o|q): Speech and Language Processing Jurafsky and Martin

  32. Training Gaussians A (single) Gaussian is characterized by a mean and a variance Imagine that we had some training data in which each phone was labeled And imagine that we were just computing 1 single spectral value (real valued number) as our acoustic observation We could just compute the mean and variance from the data: Speech and Language Processing Jurafsky and Martin

  33. But we need 39 gaussians, not 1! The observation o is really a vector of length 39 So need a vector of Gaussians: Speech and Language Processing Jurafsky and Martin

  34. Actually, mixture of gaussians Phone A Phone B Each phone is modeled by a sum of different gaussians Hence able to model complex facts about the data Speech and Language Processing Jurafsky and Martin

  35. Gaussians acoustic modeling • Summary: each phone is represented by a GMM parameterized by • M mixture weights • M mean vectors • M covariance matrices • Usually assume covariance matrix is diagonal • I.e. just keep separate variance for each cepstral feature Speech and Language Processing Jurafsky and Martin

  36. Where we are • Given: A wave file • Goal: output a string of words • What we know: the acoustic model • How to turn the wavefile into a sequence of acoustic feature vectors, one every 10 ms • If we had a complete phonetic labeling of the training set, we know how to train a gaussian “phone detector” for each phone. • We also know how to represent each word as a sequence of phones • What we knew from Chapter 4: the language model • Next time: • Seeing all this back in the context of HMMs • Search: how to combine the language model and the acoustic model to produce a sequence of words Speech and Language Processing Jurafsky and Martin

  37. Decoding • In principle: • In practice:

  38. Why is ASR decoding hard?

  39. HMMs for speech

  40. HMM for digit recognition task

  41. The Evaluation (forward) problem for speech • The observation sequence O is a series of MFCC vectors • The hidden states W are the phones and words • For a given phone/word string W, our job is to evaluate P(O|W) • Intuition: how likely is the input to have been generated by just that word string W

  42. Evaluation for speech: Summing over all different paths! • f ay ay ay ay v v v v • f f ay ay ay ay v v v • f f f f ay ay ay ay v • f f ay ay ay ay ay ay v • f f ay ay ay ay ay ay ay ay v • f f ay v v v v v v v

  43. The forward lattice for “five”

  44. The forward trellis for “five”

  45. Viterbi trellis for “five”

  46. Viterbi trellis for “five”

  47. Search space with bigrams

  48. Viterbi trellis

  49. Viterbi backtrace

  50. Training

More Related