1 / 16

Speech Recognition as a Pattern Matching Problem

Speech Recognition as a Pattern Matching Problem. Input waveform = X Each allophone Y modeled by parameters L (Y) Acoustic Model p(X|Y) modeled by a parameterized function f(X, L )

Thomas
Download Presentation

Speech Recognition as a Pattern Matching Problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Recognition as a Pattern Matching Problem • Input waveform = X • Each allophone Y modeled by parameters L(Y) • Acoustic Model p(X|Y) modeled by a parameterized function f(X,L) • Language Model p(Y1,…,YN,W1,…,WM) = probability of word & allophone sequence, modeled using very big lookup tables • Recognized word string: (W1,…,WM) = argmaxWSY p(X|Y1,…,YN) p(Y1,…,YN, W1,…,WM)

  2. The Problems of Speech Recognition • What allophones should be distinguished? • Minimum: ~50 phonemes, including schwa • Left and right neighboring phonemes? • Unreleased vs. released stops? • Function word vs. content word? • Lexical stress, Onset vs. Coda, Talker gender? • What acoustic features X? • Spectrum once per 10ms. Pitch discarded • What is the acoustic model f(X,L)? • CDHMM

  3. Prosody-Dependent Allophones • 100 monophones (incl. schwa, unreleased vs. released stops, function vs. content). • Split based on prosodic context: 200-600 prosody-dependent monophones • Split based on left, right phonemes: 300-6000 prosody-dependent triphones

  4. Prosodic Contexts that Might Matter • Accented vs. Unaccented • If word has a pitch accent, phones in the primary-stress syllable are “accented” • Phrase-Initial vs. Phrase-Medial • If word is phrase-initial, phones in onset and nucleus of 1st syllable are “phrase-initial” • Phrase-Final vs. Phrase-Medial • If word is phrase-final, phones in nucleus and coda of last syllable are “phrase-final” • How many levels of “phrase” should we model? How many levels of “accent?” • Boston Radio News database has only enough data for binary distinctions: IP/non-IP, accent/non-accent.

  5. Which Prosodic Contexts Matter? Method • Train L(Y) to maximize logp(X(train,Y)|Y) • Measure logp(X(test,Y)|Y) • For accent-dependent allophone P, does phrase position matter? Compare: logp(X(test,Y)|Y) ?< (1/3) (logp(X(test,Yinitial)|Yinitial) + logp(X(test,Ymedial)|Ymedial) + logp(X(test,Yfinal)|Yfinal)

  6. Which Prosodic Contexts Matter? Vowel Results: Everything Matters • Phrase-initial vowels that vary by accent: 7/12 • aa,ae,ah,ao,ay,ih,iy • Phrase-medial vowels that vary by accent: 13/15 • all but uh,ax • Phrase-final vowels that vary by accent: 6/8 • all but uh,ao • Accented vowels that vary by position: 12/14 • all but uh,oy • Unaccented vowels that vary by position: 10/14 • all but uh,ey,ay,ao

  7. Which Prosodic Contexts Matter? Syllable-Initial Consonants • Phrase-initial onsets that vary by accent: 4/13 • b,h,r,t • Phrase-medial onsets that vary by accent: 20/21 • all but z • Accented onsets that vary by position: 3/14 • s,r,f • Unaccented onsets that vary by position: 18/21 • all but y,g,ch

  8. Which Prosodic Contexts Matter? Syllable-Final Consonants • Phrase-medial codas that vary by accent: 17/19 • all but sh,v • Phrase-final codas that vary by accent: 5/15 • d,f,r,v,z • Accented codas that vary by position: 14/16 • all but ch,d,g • Unaccented codas that vary by position: 17/21 • all but ch,g,p,v

  9. Which Prosodic Contexts Matter? A Model of the Results Vowels Consonants

  10. Acoustic Features for Prosody-Dependent Speech Recognition • Spectrum once per 10ms (MFCC), dMFCC, ddMFCC • Energy, dEnergy, ddEnergy • Pitch: • Correct pitch halving and pitch doubling errors • Compute minF0 per utterance • f(t) = log(F0(t)/minF0) • TDNN or TDRNN computes f*(t) = P( accent(t) | f(t-50ms),…,f(t+50ms) ) • Use f*(t) as “observation” for an HMM

  11. TDRNN with One Output Unit Pitch Accented Output Layer 2nd Hidden Layer 1st Hidden Layer D . . Pitch Unaccented . . . . . . . . D D D Input Layer Internal State Layer F0 P_V

  12. Training the TDRNN to Recognize Pitch Accents

  13. Acoustic Model f(X,L) for Prosody-Dependent Speech Recognition • Normalized phoneme duration is highly correlated with phrase position • Duration is not available before phoneme recognition! • Solution: Semi-Markov model (aka HMM with explicit duration distributions) P(x1,…,xT|Y1,…,YN) = Sd p(d1|Y1)…p(dN|YN) p(x(1)…x(d1)|Y1) p(x(d1+1)…x(d1+d2)|Y2) …

  14. Example: Distributions of Duration, Phrase-Final vs. Phrase-Medial

  15. Some Recognition Results

  16. Work in Progress • Confirm these experiments w/state of the art phoneme set & acoustic features • Improve pitch features; improve duration modeling • Spontaneous speech database Switchboard: • Syntactic parse of available word transcriptions • “Guess” prosody from syntax • Train recognition models • Iteratively improve the prosodic transcription? • Study relationship between prosody, disfluency

More Related