170 likes | 572 Views
Speech Recognition as a Pattern Matching Problem. Input waveform = X Each allophone Y modeled by parameters L (Y) Acoustic Model p(X|Y) modeled by a parameterized function f(X, L )
E N D
Speech Recognition as a Pattern Matching Problem • Input waveform = X • Each allophone Y modeled by parameters L(Y) • Acoustic Model p(X|Y) modeled by a parameterized function f(X,L) • Language Model p(Y1,…,YN,W1,…,WM) = probability of word & allophone sequence, modeled using very big lookup tables • Recognized word string: (W1,…,WM) = argmaxWSY p(X|Y1,…,YN) p(Y1,…,YN, W1,…,WM)
The Problems of Speech Recognition • What allophones should be distinguished? • Minimum: ~50 phonemes, including schwa • Left and right neighboring phonemes? • Unreleased vs. released stops? • Function word vs. content word? • Lexical stress, Onset vs. Coda, Talker gender? • What acoustic features X? • Spectrum once per 10ms. Pitch discarded • What is the acoustic model f(X,L)? • CDHMM
Prosody-Dependent Allophones • 100 monophones (incl. schwa, unreleased vs. released stops, function vs. content). • Split based on prosodic context: 200-600 prosody-dependent monophones • Split based on left, right phonemes: 300-6000 prosody-dependent triphones
Prosodic Contexts that Might Matter • Accented vs. Unaccented • If word has a pitch accent, phones in the primary-stress syllable are “accented” • Phrase-Initial vs. Phrase-Medial • If word is phrase-initial, phones in onset and nucleus of 1st syllable are “phrase-initial” • Phrase-Final vs. Phrase-Medial • If word is phrase-final, phones in nucleus and coda of last syllable are “phrase-final” • How many levels of “phrase” should we model? How many levels of “accent?” • Boston Radio News database has only enough data for binary distinctions: IP/non-IP, accent/non-accent.
Which Prosodic Contexts Matter? Method • Train L(Y) to maximize logp(X(train,Y)|Y) • Measure logp(X(test,Y)|Y) • For accent-dependent allophone P, does phrase position matter? Compare: logp(X(test,Y)|Y) ?< (1/3) (logp(X(test,Yinitial)|Yinitial) + logp(X(test,Ymedial)|Ymedial) + logp(X(test,Yfinal)|Yfinal)
Which Prosodic Contexts Matter? Vowel Results: Everything Matters • Phrase-initial vowels that vary by accent: 7/12 • aa,ae,ah,ao,ay,ih,iy • Phrase-medial vowels that vary by accent: 13/15 • all but uh,ax • Phrase-final vowels that vary by accent: 6/8 • all but uh,ao • Accented vowels that vary by position: 12/14 • all but uh,oy • Unaccented vowels that vary by position: 10/14 • all but uh,ey,ay,ao
Which Prosodic Contexts Matter? Syllable-Initial Consonants • Phrase-initial onsets that vary by accent: 4/13 • b,h,r,t • Phrase-medial onsets that vary by accent: 20/21 • all but z • Accented onsets that vary by position: 3/14 • s,r,f • Unaccented onsets that vary by position: 18/21 • all but y,g,ch
Which Prosodic Contexts Matter? Syllable-Final Consonants • Phrase-medial codas that vary by accent: 17/19 • all but sh,v • Phrase-final codas that vary by accent: 5/15 • d,f,r,v,z • Accented codas that vary by position: 14/16 • all but ch,d,g • Unaccented codas that vary by position: 17/21 • all but ch,g,p,v
Which Prosodic Contexts Matter? A Model of the Results Vowels Consonants
Acoustic Features for Prosody-Dependent Speech Recognition • Spectrum once per 10ms (MFCC), dMFCC, ddMFCC • Energy, dEnergy, ddEnergy • Pitch: • Correct pitch halving and pitch doubling errors • Compute minF0 per utterance • f(t) = log(F0(t)/minF0) • TDNN or TDRNN computes f*(t) = P( accent(t) | f(t-50ms),…,f(t+50ms) ) • Use f*(t) as “observation” for an HMM
TDRNN with One Output Unit Pitch Accented Output Layer 2nd Hidden Layer 1st Hidden Layer D . . Pitch Unaccented . . . . . . . . D D D Input Layer Internal State Layer F0 P_V
Acoustic Model f(X,L) for Prosody-Dependent Speech Recognition • Normalized phoneme duration is highly correlated with phrase position • Duration is not available before phoneme recognition! • Solution: Semi-Markov model (aka HMM with explicit duration distributions) P(x1,…,xT|Y1,…,YN) = Sd p(d1|Y1)…p(dN|YN) p(x(1)…x(d1)|Y1) p(x(d1+1)…x(d1+d2)|Y2) …
Example: Distributions of Duration, Phrase-Final vs. Phrase-Medial
Work in Progress • Confirm these experiments w/state of the art phoneme set & acoustic features • Improve pitch features; improve duration modeling • Spontaneous speech database Switchboard: • Syntactic parse of available word transcriptions • “Guess” prosody from syntax • Train recognition models • Iteratively improve the prosodic transcription? • Study relationship between prosody, disfluency