170 likes | 392 Views
Mandarin Chinese Speech Recognition. Mandarin Chinese. Tonal language (inflection matters!) 1 st tone – High, constant pitch (Like saying “aaah”) 2 nd tone – Rising pitch (“Huh?”) 3 rd tone – Low pitch (“ugh”) 4 th tone – High pitch with a rapid descent (“No!”)
E N D
Mandarin Chinese • Tonal language (inflection matters!) • 1st tone – High, constant pitch (Like saying “aaah”) • 2nd tone – Rising pitch (“Huh?”) • 3rd tone – Low pitch (“ugh”) • 4th tone – High pitch with a rapid descent (“No!”) • “5th tone” – Neutral used for de-emphasized syllables • Monosyllabic language • Each character represents a single base syllable and tone • Most words consist of 1, 2, or 4 characters • Heavily contextual language
Mandarin Chinese and Speech Processing • Accoustic representations of Chinese syllables • Structural Form • (consonant) + vowel + (consonant)
Mandarin Chinese and Speech Processing • Phone Sets • Initial/final phones [1] • e.g. Shi, ge, zi = (shi + ib), (ge + e), (z + if) • Initial phones: unvoiced • 1 phone • Final phones: voiced (tone 1-5) • Can consist of multiple phones
Mandarin Chinese and Speech Processing • Strong tonal recognition is crucial to distinguish between homonyms [3] (especially w/o context) • Creating tone models is difficult • Discontinuities exist in the F0 contour between voiced and unvoiced regions
Prosody Prosody: “the rhythmic and intonational aspect of language” [2] Embedded Tone Modeling[4] Explicit Tone Modeling[4]
Tone Modeling Embedded Tone Modeling Tonal acoustic units are joined with spectral features at each frame [4] Explicit Tone Modeling Tone recognition is completed independently and combined after post-processing [4]
Pitch, energy, and duration (Prosody) combined with lexical and syntactic features improves tonal labeling Coarticulation Variations in syllables can cause variations in tone: Bu4 + Dui4 = Bu2 Dui4 (wrong) Ni3 + Hao3 = Ni2 Hao3 (hello) Tone Modeling
Emebedded Tone Modeling:Two Stream ModelingNi, Liu, Xu Spectral Stream –MFCC’s (Mel frequency cepstral coefficients) Describe vocal tract information Distinctive for phones (short time duration) Pitch/Tone Stream – requires smoothing Describe vibrations of the vocal chords Independent of Spectral features d/dt(pitch) aka tone and d2/dt2(pitch) are added Embedded in an entire syllable Affected by coarticulation (requires a longer time window) – i.e. Sandhi Tone – context dependency
Embedded Tone Modeling:Two Stream Modeling [4] Tonal Identification Features F0 Energy Duration Coarticulation (cont. speech) Initially use 2 stream embedded model followed by explicit modeling during lattice rescoring (alignment?) Explicit tone modeling uses max. entropy framework [4] (discriminative model)
Other Work Chang, Zhou, Di, Huang, & Lee [1] • 3 Methods • Powerful Language Model (no tone modeling) • CER = 7.32% • Embedded 2 Stream • Tone Stream + Feature Stream • CER = 6.43% • Embedded 1 Stream • Developed Pitch extractor • pitch track added to feature vector • CER = 6.03%
Other WorkQian, Soong [3] • F0 contour smoothing • Multi-Space Distribution (MSD) • Models 2 prob. Spaces • Unvoiced: Discrete • Voiced (F0 Contour): Continuous
Other WorkLamel, Gauvain, Le, Oparin, Meng [6] • Multi-Layer Perceptron Features • Combined with MFCC’s and Pitch features • Compare Language Models • N-Gram: Back-off Language Model • Neural Network Language Model • Language Model Adaptation
Other WorkO. Kalinli [7] • Replace prosodic features with biologically inspired auditory attention cues • Cochlear filtering, inner hair cell, etc. • Other features are extracted from the auditory spectrum • Intensity • Frequency contrast • Temporal contrast • Orientation (phase)
Other WorkQian, Xu, Soong [8] • Cross-Lingual Voice Transformation • Phonetic mapping between languages • Difficult for Mandarin and English • Very different prosodic features
References [1] Eric Chang, Jianlai Zhou, Shuo Di, Chao Huang, & Kai-fu Li, “Large Vocabulary Mandarin Speech Recognition with different Approached in Modeling Tones” [2] Meriam-Webster Dictionary, http://www.merriam-webster.com/ [3] Yao Qian & Frank Soong, “A Multispace Distribution (MSD) and Two Stream Tone Modeling Approach to Mandarin Speech Recognition”, Science Direct, 2009 [4]Chongjia Ni, Wenju Liu, & Bo Xu, “Improved Large vocabulary Mandarin Speech Recognition using Prosodic and Lexical Information in Maximum Entropy Framework” [5] Yi Liu & Pascale Fung, “Pronunciation Modeling for Spontaneous Mandarin Speech Recognition”, International Journal of Speech Technology, 2004 [6] Lori Lamel, J.L. Gauvain, V.B. Le, I. Oparin, S. Meng, “Improved Models For Mandarin Speech to Text Transcription, ICASSP, 2011 [7] O. Kalinli, “Tone and Pitch Accent Classification Using Auditory Attention Cues”, ICASSP, 2011