1 / 22

An Introduction to Mandarin Speech Recognition

An Introduction to Mandarin Speech Recognition. John Steinberg, Temple University. Speech Recognition Applications. M obile Phone Technology Translators & Prostheses. Automotive / GPS Devices Intelligence Collection. Speech Recognition: Basic Process. [5]. Importance of Mandarin.

kamin
Download Presentation

An Introduction to Mandarin Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Mandarin Speech Recognition John Steinberg, Temple University

  2. Speech Recognition Applications • Mobile Phone Technology • Translators & Prostheses • Automotive / GPS Devices • Intelligence Collection

  3. Speech Recognition: Basic Process [5]

  4. Importance of Mandarin Mandarin English [1]

  5. Importance of Mandarin • English speakers in the World: ~ 350 million [11] • Estimated # of current English learners in China: 200-350 million [12] • Estimated # of native Mandarin speakers: 1+ Billion [2]

  6. Importance of Mandarin [3]

  7. Importance of Mandarin [3]

  8. Mandarin Chinese • Tonal language (inflection matters!) • 1st tone – High, constant pitch (Like saying “aaah”) • 2nd tone – Rising pitch (“Huh?”) • 3rd tone – Low pitch (“ugh”) • 4th tone – High pitch with a rapid descent (“No!”) • “5th tone” – Neutral used for de-emphasized syllables • Characters • 8000+ characters compose 80k-200k common words • Act as morphemes • Are primarily monosyllabic • Have a single associated tone

  9. Coarticulation: Context can cause changes in tone Bu4 + Dui4 = Bu2 Dui4 (wrong) Ni3 + Hao3 = Ni2 Hao3 (hello) Mandarin Chinese

  10. Mandarin Chinese • Heavily contextual language • Monosyllabic • Relatively few # of syllables compared to English [3] • English: ~10,000 syllables • Mandarin: ~1300 syllables including tones (400 excluding) • High # of homophones

  11. Challenges in Mandarin Recognition • Requires highly developed language model due to highly contextual nature of Mandarin • Tone modeling • Coarticulation • Large # of homophones • Chinese text is unsegmented • No standard lexicon • Chinese sentence/word structure is very flexible • Ex: Beijing DaXue -> BeiDa

  12. Modeling Methods Prosodic Features Describes tone (question vs. statement), rhythm, and focus of speech Pitch Extraction Yields more precise character recognition Stronger Language Models Determines context more accurately

  13. Prosodic Units • Different prosodic units (labels) have been suggested [4] • EX: Syllable (SYL), Prosodic Word (PW), Minor Prosodic Phrase (MIP), Major Prosodic Phrase (MAP), & Intonation Group (IG) • Past labeling systems are primarily based on auditory perception • Prosodic break labeling is subjective and inconsistent • Auditory perception approach loses quantitative information • Impossible to replicate identical prosodic labels for an original speech signal

  14. Prosodic Units • New, more objective Prosodic cues include [4]: • Pause duration (directly measured) • Segment/syllable duration (directly measured) • F0 reset • F0 contains utterance long intonation information which must be separated from inner-utterance tones to inter-utterance tones. • Quantitative Description of F0 = phrase components + accent or tone components + log(baseline frequency)

  15. Language Models • N-grams – • 3 Steps: Syllable -> Character -> Word • Neural Networks – • Better suited to high dimensionality • Random Forests – • May be able to include morphology into language model [7]

  16. Timeline

  17. Benchmark History [8]

  18. Recent Experimentation Broadcast News and Conversational Telephone Speech [9]

  19. Future Studies • Continue studying current baseline systems/data sets • Further investigate possible language models • Compare effectiveness of prosodic features

  20. Questions?

  21. References [1]  K. Kūriákī, A Grammar of Modern Indo-European, Asociación Cultural Dnghu, 2007 [2] Wikipedia [3] W. Gu, K. Hirose, H. Fujisaki, “Comparison of Perceived Prosodic Boundaries and Global Characteristics of Voice Fundamental Frequency Contours in Mandarin Speech”, ISCSLP, 2006 [4] J. Picone, A. Harati, "Why Study Engineering at Temple?," Temple University College of Engineering Open House, October 9, 2010 [5] Lee, C-H. “Advances in Chinese spoken language processing”, World Scientific Publishing Co., Singapore, 2007 [6] F.H. Liu, M. Picheny, P. Srinivasa, M. Monkowski, et al, “Speech Recognition on Mandarin Call Home: A Large Vocabulary, Conversational, and Telephone Speech Corpus” ICASSP, 1996 [7] I. Oparin, L. Lamel, J. Gauvain, “Improving Mandarin Chinese STT system with Random Forests language models “, IEEE Xplore, 2010 [8] “The History of Automatic Speech Recognition Evaluations at NIST,” 2009 http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html [9] Schwartz, R.; Colthurst, T.; Duta, N.; Gish, H.; Iyer, R.; Kao, C.-L.; Liu, D.; Kimball, O.; Ma, J.; Makhoul, J.; Matsoukas, S.; Nguyen, L.; Noamany, M.; Prasad, R.; Xiang, B.; Xu, D.-X.; Gauvain, J.-L.; Lamel, L.; Schwenk, H.; Adda, G.; Chen, L.; , "Speech recognition in multiple languages and domains: the 2003 BBN/LIMSI EARS system," Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on , vol.3, no., pp. iii- 753-6 vol.3, 17-21 May 2004 doi: 10.1109/ICASSP.2004.1326654

  22. References [10] Lee, L.S.; Tseng, C.Y.; Gu, H.Y.; Liu, F.H.; Chang, C.H.; Lin, Y.H.; Lee, Y.; Tu, S.L.; Hsieh, S.H.; Chen, C.H.; , "Golden Mandarin (I)-A real-time Mandarin speech dictation machine for Chinese language with very large vocabulary," Speech and Audio Processing, IEEE Transactions on , vol.1, no.2, pp.158-179, Apr 1993 [11] Lin-Shan Lee; Keh-Jiann Chen; Chiu-Yu Tseng; RenyuanLyu; Lee-FengChien; Hsin-Min Wang; Jia-Lin Shen; Sung-Chien Lin; Yen-Ju Yang; Bo-RenBai; Chi-Ping Nee; Chun-Yi Liao; Shueh- ShengLin; Chung-Shu Yang; I-Jung Hung; Ming-Yu Lee; Rei-Chang Wang; Bo-Shen Lin; Yuan-Cheng Chang; Rung-Chiung Yang; Yung-Chi Huang; Chen-Yuan Lou; Tung-Sheng Lin; , "Golden Mandarin(II)-an intelligent Mandarin dictation machine for Chinese character input with adaptation/learning functions," Speech, Image Processing and Neural Networks, 1994. Proceedings, ISSIPNN '94., 1994 International Symposium on , vol., no., pp.155-159 vol.1, 13-16 Apr 1994 [12] Ren-Yuan Lyu; Lee-FengChien; Shiao-Hong Hwang; Hung-Yun Hsieh; Rung-Chiuan Yang; Bo- RenBai; Jia-Chi Weng; Yen-Ju Yang; Shi-Wei Lin; Keh-Jiann Chen; Chiu-Yu Tseng; Lin-Shan Lee; , "Golden Mandarin (III)-a user-adaptive prosodic-segment-based Mandarin dictation machine for Chinese language with very large vocabulary," Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on , vol.1, no., pp.57-60 vol.1, 9-12 May 1995

More Related