220 likes | 226 Views
This article provides an introduction to Mandarin speech recognition, highlighting the importance of Mandarin in the world, the challenges in recognizing Mandarin, and the potential prospects for improving Mandarin speech recognition systems. Topics covered include Mandarin pronunciation, tone modeling, coarticulation, prosodic features, language models, and future studies in the field.
E N D
An Introduction to Mandarin Speech Recognition John Steinberg, Temple University
Speech Recognition Applications • Mobile Phone Technology • Translators & Prostheses • Automotive / GPS Devices • Intelligence Collection
Importance of Mandarin Mandarin English [1]
Importance of Mandarin • English speakers in the World: ~ 350 million [11] • Estimated # of current English learners in China: 200-350 million [12] • Estimated # of native Mandarin speakers: 1+ Billion [2]
Mandarin Chinese • Tonal language (inflection matters!) • 1st tone – High, constant pitch (Like saying “aaah”) • 2nd tone – Rising pitch (“Huh?”) • 3rd tone – Low pitch (“ugh”) • 4th tone – High pitch with a rapid descent (“No!”) • “5th tone” – Neutral used for de-emphasized syllables • Characters • 8000+ characters compose 80k-200k common words • Act as morphemes • Are primarily monosyllabic • Have a single associated tone
Coarticulation: Context can cause changes in tone Bu4 + Dui4 = Bu2 Dui4 (wrong) Ni3 + Hao3 = Ni2 Hao3 (hello) Mandarin Chinese
Mandarin Chinese • Heavily contextual language • Monosyllabic • Relatively few # of syllables compared to English [3] • English: ~10,000 syllables • Mandarin: ~1300 syllables including tones (400 excluding) • High # of homophones
Challenges in Mandarin Recognition • Requires highly developed language model due to highly contextual nature of Mandarin • Tone modeling • Coarticulation • Large # of homophones • Chinese text is unsegmented • No standard lexicon • Chinese sentence/word structure is very flexible • Ex: Beijing DaXue -> BeiDa
Modeling Methods Prosodic Features Describes tone (question vs. statement), rhythm, and focus of speech Pitch Extraction Yields more precise character recognition Stronger Language Models Determines context more accurately
Prosodic Units • Different prosodic units (labels) have been suggested [4] • EX: Syllable (SYL), Prosodic Word (PW), Minor Prosodic Phrase (MIP), Major Prosodic Phrase (MAP), & Intonation Group (IG) • Past labeling systems are primarily based on auditory perception • Prosodic break labeling is subjective and inconsistent • Auditory perception approach loses quantitative information • Impossible to replicate identical prosodic labels for an original speech signal
Prosodic Units • New, more objective Prosodic cues include [4]: • Pause duration (directly measured) • Segment/syllable duration (directly measured) • F0 reset • F0 contains utterance long intonation information which must be separated from inner-utterance tones to inter-utterance tones. • Quantitative Description of F0 = phrase components + accent or tone components + log(baseline frequency)
Language Models • N-grams – • 3 Steps: Syllable -> Character -> Word • Neural Networks – • Better suited to high dimensionality • Random Forests – • May be able to include morphology into language model [7]
Recent Experimentation Broadcast News and Conversational Telephone Speech [9]
Future Studies • Continue studying current baseline systems/data sets • Further investigate possible language models • Compare effectiveness of prosodic features
References [1] K. Kūriákī, A Grammar of Modern Indo-European, Asociación Cultural Dnghu, 2007 [2] Wikipedia [3] W. Gu, K. Hirose, H. Fujisaki, “Comparison of Perceived Prosodic Boundaries and Global Characteristics of Voice Fundamental Frequency Contours in Mandarin Speech”, ISCSLP, 2006 [4] J. Picone, A. Harati, "Why Study Engineering at Temple?," Temple University College of Engineering Open House, October 9, 2010 [5] Lee, C-H. “Advances in Chinese spoken language processing”, World Scientific Publishing Co., Singapore, 2007 [6] F.H. Liu, M. Picheny, P. Srinivasa, M. Monkowski, et al, “Speech Recognition on Mandarin Call Home: A Large Vocabulary, Conversational, and Telephone Speech Corpus” ICASSP, 1996 [7] I. Oparin, L. Lamel, J. Gauvain, “Improving Mandarin Chinese STT system with Random Forests language models “, IEEE Xplore, 2010 [8] “The History of Automatic Speech Recognition Evaluations at NIST,” 2009 http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html [9] Schwartz, R.; Colthurst, T.; Duta, N.; Gish, H.; Iyer, R.; Kao, C.-L.; Liu, D.; Kimball, O.; Ma, J.; Makhoul, J.; Matsoukas, S.; Nguyen, L.; Noamany, M.; Prasad, R.; Xiang, B.; Xu, D.-X.; Gauvain, J.-L.; Lamel, L.; Schwenk, H.; Adda, G.; Chen, L.; , "Speech recognition in multiple languages and domains: the 2003 BBN/LIMSI EARS system," Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on , vol.3, no., pp. iii- 753-6 vol.3, 17-21 May 2004 doi: 10.1109/ICASSP.2004.1326654
References [10] Lee, L.S.; Tseng, C.Y.; Gu, H.Y.; Liu, F.H.; Chang, C.H.; Lin, Y.H.; Lee, Y.; Tu, S.L.; Hsieh, S.H.; Chen, C.H.; , "Golden Mandarin (I)-A real-time Mandarin speech dictation machine for Chinese language with very large vocabulary," Speech and Audio Processing, IEEE Transactions on , vol.1, no.2, pp.158-179, Apr 1993 [11] Lin-Shan Lee; Keh-Jiann Chen; Chiu-Yu Tseng; RenyuanLyu; Lee-FengChien; Hsin-Min Wang; Jia-Lin Shen; Sung-Chien Lin; Yen-Ju Yang; Bo-RenBai; Chi-Ping Nee; Chun-Yi Liao; Shueh- ShengLin; Chung-Shu Yang; I-Jung Hung; Ming-Yu Lee; Rei-Chang Wang; Bo-Shen Lin; Yuan-Cheng Chang; Rung-Chiung Yang; Yung-Chi Huang; Chen-Yuan Lou; Tung-Sheng Lin; , "Golden Mandarin(II)-an intelligent Mandarin dictation machine for Chinese character input with adaptation/learning functions," Speech, Image Processing and Neural Networks, 1994. Proceedings, ISSIPNN '94., 1994 International Symposium on , vol., no., pp.155-159 vol.1, 13-16 Apr 1994 [12] Ren-Yuan Lyu; Lee-FengChien; Shiao-Hong Hwang; Hung-Yun Hsieh; Rung-Chiuan Yang; Bo- RenBai; Jia-Chi Weng; Yen-Ju Yang; Shi-Wei Lin; Keh-Jiann Chen; Chiu-Yu Tseng; Lin-Shan Lee; , "Golden Mandarin (III)-a user-adaptive prosodic-segment-based Mandarin dictation machine for Chinese language with very large vocabulary," Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on , vol.1, no., pp.57-60 vol.1, 9-12 May 1995