Modular Processing in Human and Machine Speech Perception

Modular Processing in Human Speech Perception andAutomatic Speech Recognition Mark Hasegawa-Johnsonjhasegaw@uiuc.edu

Outline I. How do People Process Speech? 1. “Front End” 2. “Classifier” II. How do Machines Process Speech? 3. Front End (e.g. MFCC, PLP) 4. Classifier (e.g. Mixture Gaussian) 5. Recognizer (e.g. DP, Stack Search)

1. How Do People Process Speech?

1. Mechanical Filtering

1. Mechanical Pseudo-Fourier Transform

1. Rectify, Low-Pass Filter, and Adaptive Gain Control

2. Pattern Matcher, Response Generator

Every content word in English has a stressed syllable Part of the dictionary entry of the word, e.g. “the large congressional building.” Many words follow the rule: Less than three syllables: first syllable stressed More than three: antepenultimate syllable is stressed There are many special cases. 2. Example of Pattern Matching: Prosody (Stress and Rhythm)

1. People use prosody to organize the listening experience. Infants use stress to learn new vocabulary. Adults recognize speech despite negative SNR partly by listening to stress patterns. 2. Prosody affects the signal in a probabilistic way. Fundamental frequency (F0) may be affected. Duration may be affected. Energy may be affected. … or, none of the above may be affected. 2. How do People Use Prosody?

2. Prosody: Meaning, Perception, and Acoustics

Goals Disambiguate sentences with similar phonemic content. Create speech recognition algorithms which will fail less often in noisy environments. Example “The nurse brought a big Ernie doll.” “The nurse brought a bigger needle.” 2. How Would a Machine Use Prosody?

3. How do Machines Recognize Speech?

3. Front End: Auditory Frequency Scaling

3. Front End: Mel-Scale Spectrum

3. Front End: MFCC, PLP

Classification: Choose the “most probable” C C = argmax p(C|O) = argmax p(O|C) p(C) / p(O) = argmax p(O|C) p(C) p(C) --- the “language model” p(O|C) --- the “acoustic model” 4. Classifier: Statistical Classification

4. Classifier: Mixture Gaussian Model

5. “Recognition” = Classification across Multiple Times Find Q to maximize the “Recognition Probability,” P(O,Q) = p(i) p(o1|i) p(i|i) p(o2|i) …

5. Recognition: Hidden Markov Models

5. HMM Phone Models

5. HMM Word Models

5. HMM Sentence Models

5. Dynamic Programming Beam Search 1. Find the N best states at time t=1: maximize p(i) p(o1|i) 2. Find the N best states at time t=2: maximize p(i) p(o1|i) p(i|i) p(o2|i) 3. Find the N best states at time t=3: maximize p(i) p(o1|i) p(i|i) p(o2|i) p(i|i) p(o3|i)

5. Combining Words: Stack Search for t=1:T, for t0=1:t, • Find words w(t0,t) such that: p(o(t0),…,o(t) | w(t0,t)) > thresh1 • Create all possible word strings W(1,t) = [ W(1,t0-1), w(t0,t) ] • Prune: Eliminate W(1,t) if p(W(1,t)) < thresh2

5. Stack Search with Prosodic Model p(W(1,t)) = product of: • Acoustic probability: p(o(t0), o(t0+1), …, o(t) | w(t0,t)) • Syntactic/semantic probability: p( w(t0,t) | W(1,t0-1) ) • Prosodic probability: p( w(t0,t) | W(1,t0-1), PROSODY )

Conclusions • Front End mimics features of auditory processing (example: mel-scale spectrum, perceptual LPC). • Classifieruses statistical methods (e.g. mixture Gaussian model). • Recognizer combines classifier probabilities and language information (e.g. dynamic programming, stack search).

Toward More Flexible Recognition: Composite Acoustic Cues

Types of Measurement Error • Small Errors: Spectral Perturbation • Large Errors: Pick the Wrong Peak Amp. (dB) Frequency (Hertz)

Large Errors are 20% of Total Std Dev of Small Errors = 45-72 Hz Std Dev of Large Errors = 218-1330 Hz P(Large Error) = 0.17-0.22 LogPDF Measurement Error (Hertz) re: Manual Transcriptions

Solution: Composite Cues as State Variables

Complexity of SolutionWithout Additional Constraints

Useful Constraint #1: State Independence

Useful Constraint #2:Hierarchical Dependence

Test System Results

a PosterioriMeasurement Distributions:10ms After /d/ in “dark” DFT Amplitude DFT Convexity P(F | O, Q) Frequency (0-4000 Hertz)

II. What Can People Do That Machines Can’t Do? - Two Voices at Once (TV is on --- why can’t I talk to my toaster?) - Reverberation (Do I need to put padding on all of the walls?)

II. Example 1: Two Voices at Once

II. Example 2: Reverberation - Recorded speech equals input(t-delay 1) + input(t-delay 2) - Delays are longer than a vowel, so two different vowels get mixed together - Result: Just like 2 different speakers!!

II. Example 2: Reverberation The Only Way to Totally Avoid Reverberation:

IV. Semantic Parsing

IV. Response Generation Database Response: 12 flights Priority Ranking of Information: 1. Destination City 2. Origin City 3. Date 4. Price ….. Response Generation: “There are 12 flights tomorrow morning from Champaign to San Francisco. What price range would you like to consider?”

Modular Processing in Human and Machine Speech Perception

Modular Processing in Human and Machine Speech Perception

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: