300 likes | 387 Views
Automatic Speech Recognition. Julia Hirschberg CS 6998. What is speech recognition?. Transcribing words? Understanding meaning?. It’s hard to recognize speech. People speak in very different ways Across speaker variation Within speaker variation
E N D
Automatic Speech Recognition Julia Hirschberg CS 6998
What is speech recognition? • Transcribing words? • Understanding meaning?
It’s hard to recognize speech... • People speak in very different ways • Across speaker variation • Within speaker variation • Speech sounds vary according to the speech context • Environment varies wrt noise • Transcription task must handle all of this and produce a transcript of spoken words
Success: low WER (S+I+D)/N * 100 • Thesis test vs. This is a test. 75% WER • Progress: • Very large training corpora • Fast machines and cheap storage • Bake-offs • Market for real-time systems • New representations and algorithms: Finite State Transducers
ASR and the Noisy Channel Model • Source --> noisy channel --> Hypothesis • Find the most likely input to have generated the (observed) “noisy” sentence by finding most likely sentence W in language given acoustic input O • W’= P(W|O) • Bayes rule • W’=
P(O) same for all hypothetical W, so • W’=P(O|W)P(W) • P(W) the prior; P(O|W) the (acoustic) likelihood
Simple Isolated Digit Recognition • Train 10 acoustic templates Mi: one per digit • Compare input x with each • Select most similar template j according to some comparison function, minimizing differences • j = min{f(x,Mi)}
Scaling Up: Continuous Speech Recognition • Collect training and test corpora of • Speech + word transcription • Speech + phonetic transcription • Built by hand or using TTS • Text corpus • Determine a representation for the signal • Build probabilitistic • Acoustic model: signal to phones
Pronunciation model: phones to words • Language model: words to sentences • Select search procedures to decode new input given these training models
Representing the Signal • What parameters (features) of the waveform • Can be extracted automatically • Will preserve phonetic identity and distinguish it from other phones • Will be independent of speaker variability and channel conditions • Will not take up too much space • …Power Spectrum
Speech captured by microphone and digitized • Signal divided into frames • Power spectrum computed to represent energy in different bands of the signal • LPC spectrum, Cepstra, PLP • Each frame’s spectral features represented by small set of numbers
Why it works? • Different phonemes have different spectral characteristics • Why it doesn’t work? • Phonemes can have different properties in different acoustic contexts, spoken by different people, ...
Acoustic Models • Model likelihood of phone given spectral features and prior context • Usually represented as HMM • Set of states representing phones or other subword units • Transition probabilities on states: how likely is it to see one phone after another? • Observation/output likelihoods: how likely is spectral feature vector to be observed from state i, given state i-1?
Train initial model on small hand-labeled corpus to get estimate of transition and observation probabilities • Tune parameters on large corpus with only transcription • Iterate until no further improvement
Pronunciation Model • Models likelihood of word given network of candidate phone hypotheses (weighted phone lattice) • Allophones: butter vs. but • Lexicon may be HMM or simple dictionary
Language Models • Models likelihood of word sequence given candidate word hypotheses • Grammars • Finite state or CFG • Ngrams • Corpus trained • Smoothing issues • Out of Vocabulary (OOV) problem
Search • Find the best hypothesis given • Lattice of subword units (AM) • Segmentation of all paths into possible words (PM) • Probabilities of word sequences (LM) • Huge search space • Viterbi decoding • Beam search
Challenges for Transcription • Robustness to channel characteristics and noise • Portability to new applications • Adapatation: to speakers, to environments • LMs: simple ngrams need help • Confidence measures • OOV words • New speaking styles/genres • New applications
Challenges for Understanding • Recognizing communicative ‘problems’ • ASR errors • User corrections • Disfluencies and self-repairs
An Unsuccessful Dialogue S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I'd like to find a train from Ch- to Chicago from Baltimore. S: Sorry, I can't understand you. Please repeat your utterance. U: I'd like to find a train from Baltimore to Chicago. S: I heard you say “leave today around 5 thirty in the morning .“ Which city do you want to go to? U: I would like to leave Saturday at eight in the morning.
S: You can say the arrival city name, such as “New York City." U: Baltimore to Chicago. S: You can say the arrival city name, such as “New York City.” U: Baltimore to Chicago. (I hate it.) U: Train from Baltimore ...S: ...I heard you s...U: to Chicago.... Hello? S: You can say the name of your departure city, such as "New York City." U: Baltimore.
U: Bal-ti-more.... S: You can say... U: Baltimore. S: Do you want to continue and find a new set of trains? U: Yes. S: You can say "yes" or "I am done here" to find a new set of trains. U: Yes. S: Thank you for using AT&T Amtrak train time table system. See you next time. U: I nev-
Disfluencies and Self-Repairs • Disfluencies abound in spontaneous speech • every 4.6s in radio call-in (Blackmer & Mitton ‘91) hesitation: Ch- change strategy. filled pause: Um Baltimore. self-repair: Ba- uh Chicago. • Hard to recognize Ch- change strategy. --> to D C D C today ten fifteen. Um Baltimore. --> From Baltimore ten. Ba- uh Chicago. --> For Boston Chicago.
Possibilities for Understanding • Recognizing speaker emotion • Identifying speech acts: okay • Locating topic boundaries for topic tracking, audio browsing, speech data mining