470 likes | 687 Views
Automatic Speech Recognition (ASR): A Brief Overview. Radio Rex – 1920’s ASR. Statistical ASR. i_best = argmax P(M |X ) = argmax P(X|M ) P(M ) (1st term, acoustic model; 2nd term, language model) P(X|M ) P(X|Q ) [ Viterbi approx .] where Q is the best state sequence in M
E N D
Statistical ASR • i_best = argmax P(M |X ) = argmax P(X|M ) P(M )(1st term, acoustic model; 2nd term,language model) • P(X|M ) P(X|Q ) [Viterbi approx.]where Qis the best state sequence in M • approximated by product of local likelihoods (Markov,conditional independence assumptions) i i i i i i M i i M i
Automatic Speech Recognition Speech Production/Collection Pre-processing Feature Extraction Hypothesis Generation Cost Estimator Decoding
Simplified Model of SpeechProduction Periodic Source Filters Random Source Vocal vibration or turbulence (Fine spectral structure) Vocal tract, nasal tract, radiation (spectral envelope)
Pre-processing Speech RoomAcoustics LinearFiltering Sampling & Digitization Microphone Issues: Noise and reverb, effect on modeling
Framewise Analysis of Speech Frame 1 Frame 2 Feature VectorX1 Feature VectorX2
Feature Extraction SpectralAnalysis AuditoryModel/ Orthogonalize (cepstrum) Issues: Design for discrimination, insensitivities to scaling and simple distortions
Representations are Important Network Speech waveform 23% frame correct Network PLP features 70% frame correct
Spectral vs Temporal Processing Analysis (e.g., cepstral) frequency Spectral processing Time Processing (e.g., mean removal) frequency Temporal processing
a cat not is adog Hypothesis Generation cat dog a dog is not a cat Issue: models of language and task
Cost Estimation • Distances • -Log probabilities, from • discrete distributions • Gaussians, mixtures • neural networks
Language Models • Most likely words for largest product • P(acousticswords) P(words) • P(words) = P(wordshistory) • bigram, history is previous word • trigram, history is previous 2 words • n-gram, history is previous n-1 words
Language Model RecognizedWords “zero” “three” “two” Cepstrum Probabilities“z” -0.81“th” = 0.15“t” = 0.03 Decoder Signal Processing Acoustic ProbabilityEstimator (HMM state likelihoods) ASR System Architecture Speech Signal Pronunciation Lexicon
HMMs for Speech • Math from Baum and others, 1966-1972 • Applied to speech by Baker in theoriginal CMU Dragon System (1974) • Developed by IBM (Baker, Jelinek, Bahl,Mercer,….) (1970-1993) • Extended by others in the mid-1980’s
Hidden Markov model (graphical form) q q q q 1 2 3 4 x x x x 1 2 3 4
Hidden Markov Model(state machine form) P(x | q ) P(x | q ) P(x | q ) 1 2 3 q q q 2 1 3 P(q | q ) P(q | q ) P(q | q ) 2 1 3 2 4 3
1 2 1 2 1 1 1 2 1 2 2 Markov model q q 1 2 P(x ,x |q ,q ) P( q ) P(x |q ) P(q | q ) P(x | q )
HMM Training Steps • Initialize estimators and models • Estimate “hidden” variable probabilities • Choose estimator parameters to maximizemodel likelihoods • Assess and repeat steps as necessary • A special case of ExpectationMaximization (EM)
Progress in 3 Decades • From digits to 60,000 words • From single speakers to many • From isolated words to continuousspeech • From no products to many products,some systems actually saving LOTSof money
Real Uses • Telephone: phone company services(collect versus credit card) • Telephone: call centers for queryinformation (e.g., stock quotes, parcel tracking) • Dictation products: continuous recognition, speaker dependent/adaptive
But: • Still <97% on “yes” for telephone • Unexpected rate of speech causes doublingor tripling of error rate • Unexpected accent hurts badly • Performance on unrestricted speech at 70%(with good acoustics) • Don’t know when we know • Few advances in basic understanding
Why is ASR Hard? • Natural speech is continuous • Natural speech has disfluencies • Natural speech is variable over:global rate, local rate, pronunciationwithin speaker, pronunciation acrossspeakers, phonemes in differentcontexts
Why is ASR Hard?(continued) • Large vocabularies are confusable • Out of vocabulary words inevitable • Recorded speech is variable over:room acoustics, channel characteristics,background noise • Large training times are not practical • User expectations are for equal to orgreater than “human performance”
ASR Dimensions • Speaker dependent, independent • Isolated, continuous, keywords • Lexicon size and difficulty • Task constraints, perplexity • Adverse or easy conditions • Natural or read speech
Telephone Speech • Limited bandwidth (F vs S) • Large speaker variability • Large noise variability • Channel distortion • Different handset microphones • Mobile and handsfree acoustics
Hot Research Problems • Speech in noise • Multilingual conversational speech (EARS) • Portable (e.g., cellular) ASR • Question answering • Understanding meetings – or at least browsing them
Hot Research Approaches • New (multiple) features and models • New statistical dependencies • Multiple time scales • Multiple (larger) sound units • Dynamic/robust pronunciation models • Long-range language models • Incorporating prosody • Incorporating meaning • Non-speech modalities • Understanding confidence
Multi-frame analysis • Incorporate multiple frames as a single observation • LDA the most common approach • Neural networks • Bayesian networks (graphical models, including Buried Markov Models)
Linear Discriminant Analysis (LDA) All variables for several frames x 1 x 2 y 1 = X x y 3 2 x 4 Transformation to maximize ratio: between-class variance within-class variance x 5
Multi-stream analysis • Multi-band systems • Multiple temporal properties • Multiple data-driven temporal filters
Another novel approach: Articulator dynamics • Natural representation of context • Production apparatus has mass, inertia • Difficult to accurately model • Can approximate with simple dynamics
Hidden Dynamic Models “We hold these truths to be self-evident: that speech is produced by an underlying dynamic system, that it is endowed by its production system with certain inherent dynamic qualities, among these are compactness, continuity, and the pursuit of target values for each phone class, that to exploit these characteristics Hidden Dynamic Models are instituted among men. We … solemnly publish and declare, that these phone classes are and of aright ought to be free and context independent states …And for the support of this declaration, with a firm reliance on the acoustic theory of speech production, we mutually pledge our lives, our fortunes, and our sacred honor.” John Bridle and Li Deng, 1998 Hopkins Spoken LanguageWorkshop, with apologies to Thomas Jefferson ... (See http://www/clsp.jhu.edu/ws98/projects/dynamic/)
Hidden Dynamic Models SEGMENTATION TARGET SWITCH TARGET VALUES FILTER NEURAL NETWORK SPEECH PATTERN
Sources of Optimism • Comparatively new research lines • Many examples of improvements • Moore’s Law much more processing • Points toward joint development of front end and statistical components
Summary • 2002 ASR based on 50+ years of research • Core algorithms mature systems, 10-30 yrs • Deeply difficult, but tasks can be chosenthat are easier in SOME dimension • Much more yet to do