Mandarin Chinese Speech Recognition

Mandarin ChineseSpeech Recognition

Mandarin Chinese • Tonal language (inflection matters!) • 1st tone – High, constant pitch (Like saying “aaah”) • 2nd tone – Rising pitch (“Huh?”) • 3rd tone – Low pitch (“ugh”) • 4th tone – High pitch with a rapid descent (“No!”) • “5th tone” – Neutral used for de-emphasized syllables • Monosyllabic language • Each character represents a single base syllable and tone • Most words consist of 1, 2, or 4 characters • Heavily contextual language

Mandarin Chinese and Speech Processing • Accoustic representations of Chinese syllables • Structural Form • (consonant) + vowel + (consonant)

Mandarin Chinese and Speech Processing • Phone Sets • Initial/final phones [1] • e.g. Shi, ge, zi = (shi + ib), (ge + e), (z + if) • Initial phones: unvoiced • 1 phone • Final phones: voiced (tone 1-5) • Can consist of multiple phones

Mandarin Chinese and Speech Processing • Strong tonal recognition is crucial to distinguish between homonyms [3] (especially w/o context) • Creating tone models is difficult • Discontinuities exist in the F0 contour between voiced and unvoiced regions

Prosody Prosody: “the rhythmic and intonational aspect of language” [2] Embedded Tone Modeling[4] Explicit Tone Modeling[4]

Tone Modeling Embedded Tone Modeling Tonal acoustic units are joined with spectral features at each frame [4] Explicit Tone Modeling Tone recognition is completed independently and combined after post-processing [4]

Pitch, energy, and duration (Prosody) combined with lexical and syntactic features improves tonal labeling Coarticulation Variations in syllables can cause variations in tone: Bu4 + Dui4 = Bu2 Dui4 (wrong) Ni3 + Hao3 = Ni2 Hao3 (hello) Tone Modeling

Emebedded Tone Modeling:Two Stream ModelingNi, Liu, Xu Spectral Stream –MFCC’s (Mel frequency cepstral coefficients) Describe vocal tract information Distinctive for phones (short time duration) Pitch/Tone Stream – requires smoothing Describe vibrations of the vocal chords Independent of Spectral features d/dt(pitch) aka tone and d2/dt2(pitch) are added Embedded in an entire syllable Affected by coarticulation (requires a longer time window) – i.e. Sandhi Tone – context dependency

Embedded Tone Modeling:Two Stream Modeling [4] Tonal Identification Features F0 Energy Duration Coarticulation (cont. speech) Initially use 2 stream embedded model followed by explicit modeling during lattice rescoring (alignment?) Explicit tone modeling uses max. entropy framework [4] (discriminative model)

Explicit Tone Modeling [4]

Other Work Chang, Zhou, Di, Huang, & Lee [1] • 3 Methods • Powerful Language Model (no tone modeling) • CER = 7.32% • Embedded 2 Stream • Tone Stream + Feature Stream • CER = 6.43% • Embedded 1 Stream • Developed Pitch extractor • pitch track added to feature vector • CER = 6.03%

Other WorkQian, Soong [3] • F0 contour smoothing • Multi-Space Distribution (MSD) • Models 2 prob. Spaces • Unvoiced: Discrete • Voiced (F0 Contour): Continuous

Other WorkLamel, Gauvain, Le, Oparin, Meng [6] • Multi-Layer Perceptron Features • Combined with MFCC’s and Pitch features • Compare Language Models • N-Gram: Back-off Language Model • Neural Network Language Model • Language Model Adaptation

Other WorkO. Kalinli [7] • Replace prosodic features with biologically inspired auditory attention cues • Cochlear filtering, inner hair cell, etc. • Other features are extracted from the auditory spectrum • Intensity • Frequency contrast • Temporal contrast • Orientation (phase)

Other WorkQian, Xu, Soong [8] • Cross-Lingual Voice Transformation • Phonetic mapping between languages • Difficult for Mandarin and English • Very different prosodic features

References [1] Eric Chang, Jianlai Zhou, Shuo Di, Chao Huang, & Kai-fu Li, “Large Vocabulary Mandarin Speech Recognition with different Approached in Modeling Tones” [2] Meriam-Webster Dictionary, http://www.merriam-webster.com/ [3] Yao Qian & Frank Soong, “A Multispace Distribution (MSD) and Two Stream Tone Modeling Approach to Mandarin Speech Recognition”, Science Direct, 2009 [4]Chongjia Ni, Wenju Liu, & Bo Xu, “Improved Large vocabulary Mandarin Speech Recognition using Prosodic and Lexical Information in Maximum Entropy Framework” [5] Yi Liu & Pascale Fung, “Pronunciation Modeling for Spontaneous Mandarin Speech Recognition”, International Journal of Speech Technology, 2004 [6] Lori Lamel, J.L. Gauvain, V.B. Le, I. Oparin, S. Meng, “Improved Models For Mandarin Speech to Text Transcription, ICASSP, 2011 [7] O. Kalinli, “Tone and Pitch Accent Classification Using Auditory Attention Cues”, ICASSP, 2011

Mandarin Chinese Speech Recognition

Mandarin Chinese Speech Recognition

Presentation Transcript

Speech Recognition

Dialectal Chinese Speech Recognition

Dialectal Chinese Speech Recognition

An Introduction to Mandarin Speech Recognition

Speech recognition

Mandarin Chinese

Speech Recognition

Speech Recognition

Dialectal Chinese Speech Recognition

Dialectal Chinese Speech Recognition

Mandarin Chinese Ab Initio

ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION

Mandarin Chinese

Mandarin Chinese

SPEECH RECOGNITION:

Speech Recognition

Mandarin Chinese Translator

Speech Recognition

Mandarin Chinese