1 / 42

Modular Processing in Human and Machine Speech Perception

Explore how people and machines process speech, including mechanical filtering, prosody, front-end processing, and recognition techniques such as hidden Markov models. Understand the complexity of speech perception and automatic speech recognition through the lens of human and machine capabilities.

judsond
Download Presentation

Modular Processing in Human and Machine Speech Perception

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modular Processing in Human Speech Perception andAutomatic Speech Recognition Mark Hasegawa-Johnsonjhasegaw@uiuc.edu

  2. Outline I. How do People Process Speech? 1. “Front End” 2. “Classifier” II. How do Machines Process Speech? 3. Front End (e.g. MFCC, PLP) 4. Classifier (e.g. Mixture Gaussian) 5. Recognizer (e.g. DP, Stack Search)

  3. 1. How Do People Process Speech?

  4. 1. Mechanical Filtering

  5. 1. Mechanical Pseudo-Fourier Transform

  6. 1. Rectify, Low-Pass Filter, and Adaptive Gain Control

  7. 2. Pattern Matcher, Response Generator

  8. Every content word in English has a stressed syllable Part of the dictionary entry of the word, e.g. “the large congressional building.” Many words follow the rule: Less than three syllables: first syllable stressed More than three: antepenultimate syllable is stressed There are many special cases. 2. Example of Pattern Matching: Prosody (Stress and Rhythm)

  9. 1. People use prosody to organize the listening experience. Infants use stress to learn new vocabulary. Adults recognize speech despite negative SNR partly by listening to stress patterns. 2. Prosody affects the signal in a probabilistic way. Fundamental frequency (F0) may be affected. Duration may be affected. Energy may be affected. … or, none of the above may be affected. 2. How do People Use Prosody?

  10. 2. Prosody: Meaning, Perception, and Acoustics

  11. Goals Disambiguate sentences with similar phonemic content. Create speech recognition algorithms which will fail less often in noisy environments. Example “The nurse brought a big Ernie doll.” “The nurse brought a bigger needle.” 2. How Would a Machine Use Prosody?

  12. 3. How do Machines Recognize Speech?

  13. 3. Front End: Auditory Frequency Scaling

  14. 3. Front End: Mel-Scale Spectrum

  15. 3. Front End: MFCC, PLP

  16. Classification: Choose the “most probable” C C = argmax p(C|O) = argmax p(O|C) p(C) / p(O) = argmax p(O|C) p(C) p(C) --- the “language model” p(O|C) --- the “acoustic model” 4. Classifier: Statistical Classification

  17. 4. Classifier: Mixture Gaussian Model

  18. 5. “Recognition” = Classification across Multiple Times Find Q to maximize the “Recognition Probability,” P(O,Q) = p(i) p(o1|i) p(i|i) p(o2|i) …

  19. 5. Recognition: Hidden Markov Models

  20. 5. HMM Phone Models

  21. 5. HMM Word Models

  22. 5. HMM Sentence Models

  23. 5. Dynamic Programming Beam Search 1. Find the N best states at time t=1: maximize p(i) p(o1|i) 2. Find the N best states at time t=2: maximize p(i) p(o1|i) p(i|i) p(o2|i) 3. Find the N best states at time t=3: maximize p(i) p(o1|i) p(i|i) p(o2|i) p(i|i) p(o3|i)

  24. 5. Combining Words: Stack Search for t=1:T, for t0=1:t, • Find words w(t0,t) such that: p(o(t0),…,o(t) | w(t0,t)) > thresh1 • Create all possible word strings W(1,t) = [ W(1,t0-1), w(t0,t) ] • Prune: Eliminate W(1,t) if p(W(1,t)) < thresh2

  25. 5. Stack Search with Prosodic Model p(W(1,t)) = product of: • Acoustic probability: p(o(t0), o(t0+1), …, o(t) | w(t0,t)) • Syntactic/semantic probability: p( w(t0,t) | W(1,t0-1) ) • Prosodic probability: p( w(t0,t) | W(1,t0-1), PROSODY )

  26. Conclusions • Front End mimics features of auditory processing (example: mel-scale spectrum, perceptual LPC). • Classifieruses statistical methods (e.g. mixture Gaussian model). • Recognizer combines classifier probabilities and language information (e.g. dynamic programming, stack search).

  27. Toward More Flexible Recognition: Composite Acoustic Cues

  28. Types of Measurement Error • Small Errors: Spectral Perturbation • Large Errors: Pick the Wrong Peak Amp. (dB) Frequency (Hertz)

  29. Large Errors are 20% of Total Std Dev of Small Errors = 45-72 Hz Std Dev of Large Errors = 218-1330 Hz P(Large Error) = 0.17-0.22 LogPDF Measurement Error (Hertz) re: Manual Transcriptions

  30. Solution: Composite Cues as State Variables

  31. Complexity of SolutionWithout Additional Constraints

  32. Useful Constraint #1: State Independence

  33. Useful Constraint #2:Hierarchical Dependence

  34. Test System Results

  35. a PosterioriMeasurement Distributions:10ms After /d/ in “dark” DFT Amplitude DFT Convexity P(F | O, Q) Frequency (0-4000 Hertz)

  36. II. What Can People Do That Machines Can’t Do? - Two Voices at Once (TV is on --- why can’t I talk to my toaster?) - Reverberation (Do I need to put padding on all of the walls?)

  37. II. Example 1: Two Voices at Once

  38. II. Example 1: Two Voices at Once

  39. II. Example 2: Reverberation - Recorded speech equals input(t-delay 1) + input(t-delay 2) - Delays are longer than a vowel, so two different vowels get mixed together - Result: Just like 2 different speakers!!

  40. II. Example 2: Reverberation The Only Way to Totally Avoid Reverberation:

  41. IV. Semantic Parsing

  42. IV. Response Generation Database Response: 12 flights Priority Ranking of Information: 1. Destination City 2. Origin City 3. Date 4. Price ….. Response Generation: “There are 12 flights tomorrow morning from Champaign to San Francisco. What price range would you like to consider?”

More Related