380 likes | 559 Views
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA. Lecture 11. Articulatory Phonology.
E N D
Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA
Lecture 11. Articulatory Phonology • Surface phonology problems: reduction, assimilation, deletion • Articulatory Phonology • The mental lexicon (our mental storage for words) is made of Gestures, not phonemes • Overlap among the gestures results in inter-gesture competition; competition can result in reduction and/or assimilation • No mental concept of “sequencing” – instead, mental representation incldes pair-wise coupling constraints between gestures • Speech motor control • Constriction area matters more than non-constriction area • Motor control model: only control the constrictions • Tract variables • Task dynamics • Prosody • Units of prosody: phrases and pitch accents • Prosodic gestures: spatial scaling, time stretching • Prosodic landmark detection
Pronunciation Variability (Read Speech) Manner Class Assimilation: /t/ becomes part of the /n/ Vowel Reduction: /iy/ becomes /ix/
Pronunciation Variability (Read Speech) Syllable Merger: “carry an” becomes “carin” Vowel Reduction: /iy/ becomes /ax/
Autosegmental Phonology(Goldsmith, 1975) • Inter-word phonological rules all have a simple form: manner or place assimilation • Hypothesis: instructions to the speech articulators are arranged in “autosegmental tiers,” i.e., on a kind of musical score with asynchronous rows • Assimilation = feature spreading /s/ /sh/ /sh/ /sh/ [-nasal] [-nasal] [-nasal] [-nasal] [+strident] [+strident] [+strident] [+strident] [+blade] [+blade] [+blade] [+anterior] [-anterior] [-anterior]
TB-LOC VELUM TT-LOC LIP-OP TB-OPEN TT-OPEN VOICING Articulatory Phonology(Browman and Goldstein, 1990) • Word is composed of “gestures” • Gestures are MENTAL speech planning units, but they have close correspondence to articulatory controls • Example: Mental Lexicon Entry for “she:” • TT-OPEN→FRICATIVE (/š/) • TT-LOC→PALATAL (/š/) • TB-OPEN→NARROW (whole word) • TB-LOC→PALATAL (whole word) • GLOTTIS-OPEN→WIDE (/š/) then GLOTTIS-OPEN→CRITICAL (/i/)
Articulatory Phonology(Browman and Goldstein, 1990) • Rule-based Phonologies: • Reduction and assimilation are CHANGES in the value of a distinctive feature, just like morpho-phonological processes • Autosegmental Phonologies: • Reduction and assimilation are SUBSTITUTIONS of neighboring phone’s features in place of current phone’s features • Articulatory Phonology: • “Frozen” word construction processes may result in the deletion or substitution of gestures in the lexicon, but… • The process of sequencing words to create a sentence never deletes or changes any gesture; all gestures stay in the mental representation all the time!! • Reduction and Assimilation can be explained by • Overlap among gestures • Competition among overlapping gestures, for control of the same articulators
Example: Manner-Class Assimilation “Don’t Ask:” Careful Speech TT-CLOSED TT-CLOSED TT-FRIC TB-OPEN TB-OPEN TB-CLOSED GL-CLO GL-OPEN GL-CRIT GL-CRIT /d/ /o/ /n/ /t/ /ae/ /s/ /k/ “Don’t Ask:” Fast Speech TT-CLOSED TT-CLOSED TT-FRIC TB-OPEN TB-OPEN TB-CLOSED GL-CLO GL-OPEN GL-CRIT GL-CRIT /d/ /o/ /n/ /ae/ /s/ /k/
What’s in the Lexicon?(Browman and Goldstein, 2000) • Experimental Observation: consonant clusters at the beginning of a syllable (/sp/ in “spat”) show less production variability than consonant clusters at the end of a syllable (/ps/ in “taps”) • Hypothesis: the mental lexicon includes GESTURES and PAIRWISE COUPLING CONSTRAINTS • Two kinds of coupling: simultaneous or sequential • Coda consonants FOLLOW the vowel, e.g. in “taps:” TB-WIDE→LIP-CLOSED→TT-CRITICAL • Onset consonants are produced SIMULTANEOUSLY with start of the tongue body vowel gesture, but therefore in “spat:” both TT-CRITICAL→TB-WIDE and LIP-CLOSED→TB-WIDE. Competition among them yields reduced variability in production.
Production Planning: Lexical Entry Turned Into a Gestural Score “SPAT:”
From Gestural Score to Acoustics • Perturbation Theory (Chiba and Kajiyama, 1941) showed that dFn~ d logA(x) ≈ dA(x)/A(x) • The audibility of a change dA(x) is proportional to 1/A(x) • Changes near a constriction (small A(x)) are very audible • Changes elsewhere (large A(x)) are not very audible • Therefore, talkers carefully control A(x) only near a constriction: • Inter-utterance variability of A(x) is an increasing function of 1/A(x) (Perkell and Nelson, JASA 1985): E[(A(x)-mA(x))2] ~ 1/mA(x)mA(x)≡E[A(x)] • Inter-talker variability of A(x) is an increasing function of 1/A(x) (Hasegawa-Johnson et al., JSLHR 2003) • Inter-talker variability of log A(x) is independent of A(x) (Hasegawa-Johnson et al., JSLHR 2003): E[(logA(x)-mlogA(x))2] ~ constant
Constriction Control as a Model of Speech Motor Control(Stevens and House, JASA, 1955) • Vocal tract shape controlled by just three control parameters: • xPOS = POSition of tongue constriction • rCD = Constriction Degree = radius of the constriction • rLIP = effective radius of the lip constriction • All other vocal tract areas determined by A(x) = p r(x)2 r(x) = 0.7+0.144x2, 0 ≤ x ≤ 2.75 (larynx) = min(1.6, rCD–0.025(1.2–rCD)(x–xPOS)2), 2.75 ≤ x ≤ xPOS (pharynx) = rCD – 0.025(1.2–rCD) (x–xPOS)2, xPOS ≤ x ≤ 17 (mouth) = rL 17 ≤ x ≤ 18 (lips) x, r(x) are in centimeters, A(x) in cm2
Extending the Model: Tract Variables(Saltzmann and Munhall, 1989) • Languages treat tongue tip and tongue body differently, e.g., both can have constrictions at the same time • Therefore split (xPOS,ACD) → (TTPOS,TTCD,TBPOS,TBCD) • Talkers can independently control lip area and lip length • Therefore split (RL) → (LIPCD, LIPPOS) • Soft palate (“velum”) control: open vs. closed • Therefore we need a control variable VELCD • Glottis control: open (breathy), critical (voiced), closed (glottal stop) • Control variable GLOCD • The tract variable model: speech is controlled by a mental controller with an 8-dimensional control vector: a(t) = [LIPCD,LIPPOS,TTCD,TTPOS,TBCD,TBPOS,VELCD,GLOCD]T
Task Dynamics: Connecting Gestures to Tract Variables(Saltzmann and Munhall, 1989) • Lexicon Gestures sequenced into a GESTURAL SCORE • The Gestural Score is “played” like a musical score. Each Gesture onset is turned into TRACT VARIABLE TARGETS, a(t). • Relationship between tract variable targets, a(t), and physical articulator positions, x(t), given by 2nd order system M d2x/dt2 = K(t) (a(t)–x(t)) – R dx/dt • K(t) = effective tract-variable-stiffness matrix; controlled by the talker, but varies more slowly than a(t) • M = effective mass matrix • R = effective damping matrix
Production Planning: Lexical Entry Turned Into a Gestural Score “SPAT:”
1. Prosodic Phrases • Prosodic Phrasing = the PERCEPTUAL grouping of words • Prosodic phrase boundaries usually (not always) a subset of SYNTACTIC phrase boundaries • “I like ginger | chocolate ice cream | and cigars” • “I like ginger-chocolate ice cream | and cigars” • “I bought a book from | the old used bookstore downtown” • A hierarchy of phrases: • Intonational phrase = 1-5 accent phrases • Intermediate/Accent phrase = 1-5 prosodic words • Prosodic word = 1-2 dictionary words, e.g., “the+open | door” • Acoustic correlates of phrasing • Phrase-final syllable is MUCH LONGER (typically 50-100%) • Intonational phrase often followed by a PAUSE • (Language-dependent): Phrase may end in a PHRASE TONE • Intermediate Phrase Tones in English: L-, H- (low and high) • Intonational Phrase Tones in English: L-L%, L-H%, H-L%, H-H%
2. Prominence/Pitch Accent • Prominence: Usually, a listener can tell which syllable in an accent phrase the talker thinks is most important. That syllable is called “prominent.” • Acoustic correlates of prominence (language-dependent): • DURATION: • English, Dutch, and “stress-timed languages:” prominent syllables are longer • French, Japanese, and other “syllable-timed languages:” no • HYPER-ARTICULATION: • prominent syllables often more clearly pronounced • ENERGY: prominent syllables are louder • PITCH ACCENT (language-dependent) • English: • Extra high pitch: H* • Extra low pitch: L* • Various combinations (H*+L, L+H*, L*+H) • Swedish: • Single-peaked accents similar to English • Double-peaked accents perhaps unique to Swedish • Japanese: • F0 is high from beginning of accent phrase until prominent syllable, then drops • In Chinese: • Lexical tone is HYPER-ARTICULATED (e.g., 3rd tone dips MORE than usual)
Example: “Massachusetts” Unaccented Accented: /u/ is longer, louder
Example: “(if they think they can drink and drive, and) get away with it, they’ll pay. Probability of Voicing Pitch get away with it they’ll pay L* H* H-H% HiF0 L-L%
Do Prominence and Phrasing Affect Tongue Movement?(Fougeron and Keating, 1997) • Experiment: • Design an electropalate for each subject • Electropalate = a plastic insert covered with small electrodes. • When the tongue touches the palate, the touched electrodes detect contact • Keep track of the area and shape of tongue-palate contact as a function of time • Subjects read carrier sentences, target word in different positions • “book” Prominent: “the red book holder, not the red basket holder” • “book” Non-prominent: “the red book holder, not the blue book holder” • “book” Phrase-final: “the red book, Holbert, not the blue book” • Result: • Prominent words: longer + much more tongue-palate contact • Phrase-final wods: longer; little change in tongue-palate contact
Do Prominence and Phrasing Affect the MFCCs?(Borys, Hasegawa-Johnson, and Cole, 2003) Clustered Triphones Prosody-Dependent Allophones N N R Vowel? R Vowel? Yes Yes No No L Stop? N-VOW Pitch Accent? N-VOW No Yes No Yes N STOP+N N N* WER: 36.2% WER: 25.4% BUT: WER of baseline Monophone system = 25.1%
Prosody-dependent allophones: ASR clustering matches EPG • Fougeron & Keating • (1997) • EPG Classes: • Strengthened • Lengthened • Neutral
Why is there a relationship between Prosody and Tongue Movement?
What’s the Scale of a Gestural Score? TT-CLOSED TT-CLOSED TT-FRIC TT-OPEN TT-OPEN TB-CLOSED How much does the tongue tip open? (How many cm?) VEL-OPEN t t1 t2 What is t2-t1 in seconds?
Prosodic Gestures(Byrd and Saltzmann) TT-CLOSED TT-CLOSED TT-FRIC TT-OPEN TT-OPEN TB-CLOSED VEL-OPEN Relative Time ps Prosodic Gestures SPATIAL-SCALE-LARGE REDUCED pT Prosodic Gestures TIME-SCALE-STRETCHED Gestural Score “Playback Head” Time Scale for Gesture Playback Spatial Scale for Gesture Playback Tract Variable Targets a(t) Convert Gestural Score to Tract Variable Targets Absolute Time
Convert Tract Variable Targets to Tract Variables, Then to Acoustics
Prosodic Landmark Detection(Kim, Hasegawa-Johnson and Chen, IEEE Sign. Proc. Letters, 2003)
The Time-Delay Recursive Neural Network(Kim, Neurocomputing, 1998) Pitch Accented Output Layer 2nd Hidden Layer 1st Hidden Layer D . . Pitch Unaccented . . . . . . . . D D D Time-Delayed Inputs Time-Delayed Internal State F0Prob_Voice
Prosodic Landmark Detection(Kim, Hasegawa-Johnson and Chen, IEEE Sign. Proc. Letters, 2003)
Prosodic Landmark Detection(Kim, Hasegawa-Johnson and Chen, IEEE Sign. Proc. Letters, 2003)
Summary • Surface phonology problems: reduction, assimilation, deletion • Articulatory Phonology • The mental lexicon (our mental storage for words) is made of Gestures, not phonemes • No mental concept of “sequencing” – instead, mental representation incldes pair-wise coupling constraints between gestures • Speech motor control • Constriction area matters more than non-constriction area • Motor control model: only control the constrictions • Tract variables • Task dynamics • Prosody • Units of prosody: phrases and pitch accents • Prosodic gestures: spatial scaling, time stretching • Prosodic landmark detection