410 likes | 562 Views
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA. Lecture 3: Spectral Dynamics and the Production of Consonants.
E N D
Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA
Lecture 3: Spectral Dynamics and the Production of Consonants • International Phonetic Alphabet • Events in the Closure of a Nasal Consonant • Formant transitions: a perturbation model • Nasalized vowel • Nasal murmur • Events in the Release of a Stop Consonant • Pre-voicing (voiced stops in carefully read English) • Transient (stops and affricates) • Frication (stops, affricates, and fricatives) • Aspiration (aspirated stops and /h/) • Formant Transitions (any consonant-vowel transition) • Formant Tracking • Does it help Speech Recognition? • Methods for Vowels, and for Aspiration & Nasals • Reminder – lab 1 due Monday!
International Phonetic Alphabet: Purpose and Brief History • Purpose of the alphabet: to provide a universal notation for the sounds of the world’s languages • “Universal” = If any language on Earth distinguishes two phonemes, IPA must also distinguish them • “Distinguish” = Meaning of a word changes when the phoneme changes, e.g. “cat” vs. “bat.” • Very Brief History: • 1876: Alexander Bell publishes a distinctive-feature-based phonetic notation in “Visible Speech: The Science of the Universal Alphabetic.” His notation is rejected as being too expensive to print • 1886: International Phonetic Association founded in Paris by phoneticians from across Europe • 1991: Unicode provides a standard method for including IPA notation in computer documents
International Phonetic Alphabet: Vowels Pinyin ARPABET (Approx.) / u (zhu)/ UW o UH / oa/ OW / oAH / AO a (ma)AA Pinyin ARPABET (Approx.) i /u (xu) IY / UX EY EH a (zhang)AE a (ma) Pinyin:e ARPA:AX
IPA: Regular Consonants Tongue Body Tongue Blade Q NG DX HH/HV R Y ARPABET: F/V (labiodental), TH/DH (dental), S/Z (alveolar), SH/ZH (postalveolar or palatal) Pinyin: s (alveolar), x (postalveolar), sh/r (retroflex)
Affricates and Doubly-Articulated Consonants ARPABET WH W Affricates in English and Chinese: Pinyin ARPABET IPA Alveolar: c/z ts/dz Post-alveolar: q/jCH/JH tʃ/dʒ Retroflex: ch/zh ţş/ɖʐ
Events in the Closure of a Nasal Consonant Formant Transitions Vowel Nasalization Nasal Murmur
“the mom” Formant Transitions: Labial Consonants “the bug”
“the supper” Formant Transitions: Alveolar Consonants “the tug”
“the shoe” Formant Transitions: Post-alveolar Consonants “the zsazsa”
“the gut” Formant Transitions: Velar Consonants “sing a song”
Formant Transitions: A Perceptual Study The study: (1) Synthesize speech with different formant patterns, (2) record subject responses. Delattre, Liberman and Cooper, J. Acoust. Soc. Am. 1955.
Nasal Murmur “the mug” “the nut” “sing a song” Observations: Low-frequency resonance (about 300Hz) always present Low-frequency resonance has wide bandwidth (about 150Hz) Energy of low-frequency resonance is very constant Most high-frequency resonances cancelled by zeros Different places of articulation have different high frequency spectra High-frequency spectrum is talker-dependent and variable
Resonances of a Nasal Consonant Reference: Fujimura, JASA 1962
Events in the Release of a Stop “Burst” = transient + frication (the part of the spectrogram whose transfer function has poles only at the front cavity resonance frequencies, not at the back cavity resonances).
Events in the Release of a Stop Transient Frication Aspiration Voicing Aspirated (/t/) Unaspirated (/b/)
Pre-voicing during Closure To make a voiced stop in most European languages: Tongue root is relaxed, allowing it to expandm so that vocal folds can continue to vibrating for a little while after oral closure. Result is a low-frequency “voice bar” that may continue well into closure. In English, closure voicing is typical of read speech, but not casual speech. “the bug”
Transfer Function During Transient and Frication: Poles Turbulence striking an obstacle makes noise Front cavity resonance frequency: FR = c/4Lf
Are Formant Frequencies Useful for Speech Recognition? • Kopec and Bush (1992): WER(formants alone) > WER(cepstrum alone) > WER(formants and cepstrum together) • How should we track formants? • In vowels: Autoregressive (AR) modeling (also known as LPC) • In aspiration, nasals: Autoregressive Moving Average (ARMA) modeling. Problem: no closed-form solution • In aspiration, nasals: Exponentially Weighted Autoregressive (EWAR; Zheng and Hasegawa-Johnson, ICASSP 2004)
Formant Tracking for Aspiration: “Auto-Regressive Moving Average” Model (ARMA)
Formant Tracking for Aspiration: “Exponentially Weighted Auto-Regressive” Model (EWAR)(Zheng and Hasegawa-Johnson, ICSLP 2004)
Summary • International Phonetic Alphabet: • Useful on any computer with unicode • International encoding for all sounds of the world’s languages • Events in a nasal closure: • Formant transitions (perturbation model) • Vowel nasalization (sum of TFs) • Nasal murmur (impedance match at juncture) • Events in release of a stop: • Pre-voicing in English voiced stops (read speech) • Transient (dp/dt ~ dA/dt) • Frication ((zero at f=0)/(front cavity resonances)) • Aspiration ((zero at f=0)/(same poles as the vowel)) • Formant tracking • In a vowel: use LPC • In aspiration, frication, or nasal murmur: ARMA is theoretically optimum, but computationally expensive • Aspiration etcetera: EWAR can be a good approximation to ARMA