420 likes | 432 Views
Explore how people and machines process speech, including mechanical filtering, prosody, front-end processing, and recognition techniques such as hidden Markov models. Understand the complexity of speech perception and automatic speech recognition through the lens of human and machine capabilities.
E N D
Modular Processing in Human Speech Perception andAutomatic Speech Recognition Mark Hasegawa-Johnsonjhasegaw@uiuc.edu
Outline I. How do People Process Speech? 1. “Front End” 2. “Classifier” II. How do Machines Process Speech? 3. Front End (e.g. MFCC, PLP) 4. Classifier (e.g. Mixture Gaussian) 5. Recognizer (e.g. DP, Stack Search)
Every content word in English has a stressed syllable Part of the dictionary entry of the word, e.g. “the large congressional building.” Many words follow the rule: Less than three syllables: first syllable stressed More than three: antepenultimate syllable is stressed There are many special cases. 2. Example of Pattern Matching: Prosody (Stress and Rhythm)
1. People use prosody to organize the listening experience. Infants use stress to learn new vocabulary. Adults recognize speech despite negative SNR partly by listening to stress patterns. 2. Prosody affects the signal in a probabilistic way. Fundamental frequency (F0) may be affected. Duration may be affected. Energy may be affected. … or, none of the above may be affected. 2. How do People Use Prosody?
Goals Disambiguate sentences with similar phonemic content. Create speech recognition algorithms which will fail less often in noisy environments. Example “The nurse brought a big Ernie doll.” “The nurse brought a bigger needle.” 2. How Would a Machine Use Prosody?
Classification: Choose the “most probable” C C = argmax p(C|O) = argmax p(O|C) p(C) / p(O) = argmax p(O|C) p(C) p(C) --- the “language model” p(O|C) --- the “acoustic model” 4. Classifier: Statistical Classification
5. “Recognition” = Classification across Multiple Times Find Q to maximize the “Recognition Probability,” P(O,Q) = p(i) p(o1|i) p(i|i) p(o2|i) …
5. Dynamic Programming Beam Search 1. Find the N best states at time t=1: maximize p(i) p(o1|i) 2. Find the N best states at time t=2: maximize p(i) p(o1|i) p(i|i) p(o2|i) 3. Find the N best states at time t=3: maximize p(i) p(o1|i) p(i|i) p(o2|i) p(i|i) p(o3|i)
5. Combining Words: Stack Search for t=1:T, for t0=1:t, • Find words w(t0,t) such that: p(o(t0),…,o(t) | w(t0,t)) > thresh1 • Create all possible word strings W(1,t) = [ W(1,t0-1), w(t0,t) ] • Prune: Eliminate W(1,t) if p(W(1,t)) < thresh2
5. Stack Search with Prosodic Model p(W(1,t)) = product of: • Acoustic probability: p(o(t0), o(t0+1), …, o(t) | w(t0,t)) • Syntactic/semantic probability: p( w(t0,t) | W(1,t0-1) ) • Prosodic probability: p( w(t0,t) | W(1,t0-1), PROSODY )
Conclusions • Front End mimics features of auditory processing (example: mel-scale spectrum, perceptual LPC). • Classifieruses statistical methods (e.g. mixture Gaussian model). • Recognizer combines classifier probabilities and language information (e.g. dynamic programming, stack search).
Toward More Flexible Recognition: Composite Acoustic Cues
Types of Measurement Error • Small Errors: Spectral Perturbation • Large Errors: Pick the Wrong Peak Amp. (dB) Frequency (Hertz)
Large Errors are 20% of Total Std Dev of Small Errors = 45-72 Hz Std Dev of Large Errors = 218-1330 Hz P(Large Error) = 0.17-0.22 LogPDF Measurement Error (Hertz) re: Manual Transcriptions
Solution: Composite Cues as State Variables
a PosterioriMeasurement Distributions:10ms After /d/ in “dark” DFT Amplitude DFT Convexity P(F | O, Q) Frequency (0-4000 Hertz)
II. What Can People Do That Machines Can’t Do? - Two Voices at Once (TV is on --- why can’t I talk to my toaster?) - Reverberation (Do I need to put padding on all of the walls?)
II. Example 2: Reverberation - Recorded speech equals input(t-delay 1) + input(t-delay 2) - Delays are longer than a vowel, so two different vowels get mixed together - Result: Just like 2 different speakers!!
II. Example 2: Reverberation The Only Way to Totally Avoid Reverberation:
IV. Response Generation Database Response: 12 flights Priority Ranking of Information: 1. Destination City 2. Origin City 3. Date 4. Price ….. Response Generation: “There are 12 flights tomorrow morning from Champaign to San Francisco. What price range would you like to consider?”