Mandarin Pronunciation Variation Modeling

NCMMSC’0120-22 NOV 01, Shenzhen, China Mandarin Pronunciation Variation Modeling Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University fzheng@sp.cs.tsinghua.edu.cn, http://sp.cs.tsinghua.edu.cn/~fzheng/

Change includes insertion, deletion and substitution. Motivation • In spontaneous speech, pronunciations of individual words are different, there are often • Sound changes, and • Phone changes. • For Chinese • an additional accent problem even people are speaking Mandarin, due to different dialect backgrounds (in Chinese, 7 major dialects) • colloquialism, grammar, style • Goal: modelling the pronunciation variations • Establishing a corpus with spontaneous phenomena, because we should know what the canonical phones change to. • Finding solutions to the pronunciation modelling theoretically and practically Center of Speech Technology, Tsinghua University

Overview Center of Speech Technology, Tsinghua University

Necessity to establish a new annotated spontaneous speech corpus • The existing databases (incl. Broadcast News, CallHome, CallFriend, …) do not cover all the Chinese spoken language phenomena • Sound changes: voiced, unvoiced, nasalization, … • Phone changes: retroflexed, OOV-phoneme, … • The existing databases do not contain pronunciation variation information for use of bootstrap training • A Chinese Annotated Spontaneous Speech (CASS) Corpus was established before WS00 on LSP in JHU • Completely spontaneous (discourses, lectures, ...) • Remarkable background noise, accent background, ... • Recorded onto tapes and then digitalized Center of Speech Technology, Tsinghua University

Chinese Annotated Spontaneous Speech (CASS) Corpus • CASS w/ Five-Tier Transcription • Character level : base form • Syllable (or Pinyin) Level (w/ tone) : base form • Initial/Final (IF) Level : w/ time boundary for baseform • SAMPA-C Level : surface form • Miscellaneous Level : used for garbage modeling • Lengthening, breathing, laughing, coughing, disfluency, noise, silence, murmur (unclear), modal, smack, non-Chinese • Example Center of Speech Technology, Tsinghua University

SAMPA-C: Machine Readable IPA • Phonologic Consonants - 23 • Phonologic Vowels - 9 • Initials - 21 • Finals - 38 • Retroflexed finals - 38 • Tones and Silences • Sound Changes • Spontaneous Phenomenon Labels Center of Speech Technology, Tsinghua University

Key Points in PM (1) • Choosing and generating speech recognition unit (SRU) set • So as to well describe the phone changes and sound changes • Could be syllable, semi-syllable, or INITIAL/FINAL. • Constructing a multi-pronunciation lexicon (MPL) • A syllable-to-SRU lexicon to reflect the relation between the grammatical units and acoustic models • Acoustically modeling spontaneous speech • Theoretical framework • CD modeling; confusion matrix; data-driven Center of Speech Technology, Tsinghua University

Key Points in PM (2) • Customizing decoding algorithm according to new lexicon • Improved time-synchronous search algorithm to reduce the path expansion (caused by CD modeling) • A* based algorithm based tree-trellis search algorithm to score multiple pronunciation variations simultaneously in the path • Modifying statistical language model Center of Speech Technology, Tsinghua University

Establishment of Multi-Pron. Lexicon • Two major approaches • Defined by linguists and phonetists • Data-driven: confusion matrix, rewritten rules, decision tree ... • Our method: • Find all possible pronunciations in SAMPA-C from database • Reduce the size according to occurring frequencies Center of Speech Technology, Tsinghua University

Collect all of them and choose the most frequent ones as GIFs. Define them according to GIF set. P ([GIFi] GIFf | Syllable ) Surface form for IF and Syllable • Learning pronunciations • Definition of Generalized Initial-Finals (GIFs) • z ts : canonical • z ts_v : voiced • z ts` : changed to ‘zh’ • z ts`_v : changed to voiced ‘zh’ • e 7 : canonical • e 7` : retroflexed or changed to ‘er’ • e @ : changed • Definition of Generalized Syllables (GSs) – the lexicon • chang [0.7850] ts`_h AN • chang [0.1215] ts`_h_v AN • chang [0.0280] ts`_v AN • chang [0.0187] <deletion> AN • chang [0.0187] z` AN • chang [0.0093] <deletion> iAN • chang [0.0093] ts_h AN • chang [0.0093] ts`_h UN Probabilistic lexicon. Center of Speech Technology, Tsinghua University

AM LM Refined AM Output Prob. Probabilistic Pronunciation Modeling • Theory • Recognizer goal • K*=argmaxKP(K|A) = argmaxKP(A|K) P(K) • Applying independent assumption • P(A|K) = nP(an|kn) • Pronunciation modeling part – via introducing surface s • P(a|k) = sP(a|k,s)P(s|k) • Symbols • a: Acoustic signal, k: IF, s: GIF • A, K, S: corresponding string Center of Speech Technology, Tsinghua University

Refined Acoustic Modeling (RAM) • P(a|k, s) -- RAM • It cannot be trained directly, the solutions could be: • Use P(a|k) instead -- IF modeling  • Use P(a|s) instead -- GIF modeling  • Adapt P(a|k) to P(a|k, s) -- B-GIF modeling  • Adapt P(a|s) to P(a|k, s) -- S-GIF modeling  • IF-GIF transcription should be generated from the IF and GIF transcriptions • Need more data, but the data amount is fixed • Using adaptation Center of Speech Technology, Tsinghua University

Adapt P(a|k) to P(a|k, s) -- B-GIF scheme Adapt P(a|s) to P(a|k, s) -- S-GIF scheme IF1 IF GIF1 IF2 IF3 GIF2 GIF3 GIF Generate RAM via adaptation technology Center of Speech Technology, Tsinghua University

AM LM Refined AM Output Prob. Probabilistic Pronunciation Modeling • Theory • Recognizer goal • K*=argmaxKP(K|A) = argmaxKP(A|K) P(K) • Applying independent assumption • P(A|K) = nP(an|kn) • Pronunciation modeling part (2/2) • P(a|k) = sP(a|k,s)P(s|k) • Symbols • a: Acoustic signal, k: IF, s: GIF • A, K, S: corresponding string Center of Speech Technology, Tsinghua University

Surface-form Output Probability Modeling (SOPM) • P(s|k) - SOPM • Solution: Direct Output Prob. (DOP) learned from CASS • Problem: data sparseness • Idea: syllable level data sparseness DOESN’T mean IF/GIF level data sparseness • New solution – Context-Dependent Weighting (CDW): • P(GIF|IF) = IFLP(GIF| (IFL, IF)) P (IFL| IF) • P(GIF| (IFL, IF)): GIF output prob. given context • P (IFL| IF): IF transition prob. • Above two items can be learned from CASS Center of Speech Technology, Tsinghua University

Can CDW be better (1) ? • Pronunciation Lexicon’s Intrinsic Confusion • Introduction of MPL useful for pronunciation variation modeling, but • Enlarges the among syllable confusion • The recognition target: IF string • What we actually get: GIF string • Even the GIF recognizer achieves 100%, we cannot get the 100% IF string because of MPL Center of Speech Technology, Tsinghua University

Can CDW be better (2) ? • Pronunciation Lexicon’s Intrinsic Confusion • To reflect syllable level intrinsic confusion extent • Is the lower bound of syllable error rate • CDW can reduce PLIC Center of Speech Technology, Tsinghua University

Can CDW be better (3) ? Center of Speech Technology, Tsinghua University

Experiment condition • CASS Corpus was used for the experiment • Training Set: 3 hours’ data • Testing Set: 15 minutes’ data • Feature • MFCC +  +  + E (with CMN) • HTK • Accuracy calculated based on syllable • %Acc = Hit / Num * 100% • %Cor = (Hit – Ins) / Num * 100% Center of Speech Technology, Tsinghua University

Experimental results Center of Speech Technology, Tsinghua University

Question ? Does it work when more data without phonetic transcription is available ? Center of Speech Technology, Tsinghua University

Using more data w/o IF transcription • A question: is the above method useful when only a small amount of data with IF transcription is available? • The answer depends on how we use the data w/o IF transcription. • Two parts of data: • Seed database: that w/ phonetic transcription • Extra database: that w/o phonetic transcription Center of Speech Technology, Tsinghua University

What’s the purpose of these two databases? • Seed Database • To define the SRU set (surface form) • To train initial acoustic models • To train initial CDW weights • Extra Database • To refine the existing acoustic models • To refine the CDW weights Center of Speech Technology, Tsinghua University

How to use extra database? • The problem is that extra database contains only higher level transcriptions (say syllable instead of IF) • An algorithm is needed to generate the phonetic level (IF level) transcription • Our solution is the iterative forced-alignment based transcription (IFABT) algorithm Center of Speech Technology, Tsinghua University

Steps for IFABT (1) • Use the forced-alignment technique and the MPL to decode both the seed database and the extra database • To generate IF-GIF transcription under the constraints of previous canonical syllable level transcription • Use these two databases with IF-GIF transcription • To redefine MPL • To retrain CDW weights • To retrain IF-GIF models • The above two steps will be repeated until satisfying. Center of Speech Technology, Tsinghua University

Steps for IFABT (2) Center of Speech Technology, Tsinghua University

Experiments done on CASS-II (1) • Database • Enlarge the database: from 3 hrs  6 hrs, to • cover more spontaneous phenomena, and • provide more training data • The additional 3 hrs data are transcribed only in the canonical syllable level Center of Speech Technology, Tsinghua University

Experiments done on CASS-II (2) CASS-I # CASS-I/-II Center of Speech Technology, Tsinghua University

Summary • An annotated spontaneous speech corpus is important • At the syllable level, the use of GIFs as acoustic models always achieves better results than IFs. • Either the context dependent modeling or the Gaussian density sharing is a good method for pronunciation variation modeling. • The context-dependent weighting is more useful than the Gaussian density sharing for pronunciation modeling, because it can reduce MPL's PLIC value. • The IFABT method is helpful when more data with higher level transcription yet without the phonetic transcription is available. Center of Speech Technology, Tsinghua University

References • Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Mandarin Pronunciation Modeling Based on CASS Corpus,” Sino-French Symposium on Speech and Language Processing, pp. 47-53, Oct. 16, 2000, Beijing • Pascale Fung, William Byrne, ZHENG Fang Thomas, Terri Kamm, LIU Yi, SONG Zhanjiang, Veera Venkataramani, and Umar Ruhi, “Pronunciation modeling of Mandarin casual speech,”Workshop 2000 on Speech and Language Processing: Final Report for MPM Group, http://www.clsp.jhu.edu/index.shtml • Zhanjiang Song, “Research on pronunciation modeling for spontaneous Chinese speech recognition,” Ph.D. Dissertation: Tsinghua University, Beijing, China, Apr. 2001 • Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Modeling Pronunciation Variation Using Context-Dependent Weighting and B/S Refined Acoustic Modeling,” EuroSpeech, 1:57-60, Sept. 3-7, 2001, Aalborg, Denmark • Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “MANDARIN PRONUNCIATION MODELING BASED ON CASS CORPUS,” to appear in J. Computer Science & Technology Center of Speech Technology, Tsinghua University

Announcement (1) • ISCA Tutorial and Research Workshop onPronunciation Modeling and Lexicon Adaptation for Spoken Language TechnologySeptember 14-15, 2002, Colorado, USA • http://www.clsp.jhu.edu/pmla2002/ Center of Speech Technology, Tsinghua University

Announcement (2) • International Joint Conference of SNLP-O-COCOSDA May 9-11, 2002, Prachuapkirikhan, Thailand • http://kind.siit.tu.ac.th/snlp-o-cocosda2002/ orhttp://www.links.nectec.or.th/itech/snlp-o-cocosda2002/ Center of Speech Technology, Tsinghua University

Thanks for listening Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University fzheng@sp.cs.tsinghua.edu.cn, http://sp.cs.tsinghua.edu.cn/~fzheng/

Mandarin Pronunciation Variation Modeling