250 likes | 270 Views
Explore the cutting-edge research activities at the Center of Speech Technology at Tsinghua University, focusing on Chinese pronunciation modeling and annotated spontaneous speech corpus development.
E N D
17 Jan 01 at CMU Speech Activities in CST Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University fzheng@sp.cs.tsinghua.edu.cn, http://sp.cs.tsinghua.edu.cn/~fzheng/
Center of Speech Technology • Founded in 1979, named as Speech Laboratory • Joined the State Key Laboratory of Intelligent Technology and Systems in 1999, renamed as Center of Speech Technology • http://sp.cs.tsinghua.edu.cn/ Center of Speech Technology, Tsinghua University
Members of CST in 2001 Center of Speech Technology, Tsinghua University
Founding Resources • State fundamental research plan: NSF, 863, 973, 985 • Collaboration with industries: • Microsoft • IBM • Intel • Lucent Technologies • Nokia • Weniwen • SoundTek • Keysun • ... Center of Speech Technology, Tsinghua University
Acoustic Modeling Feature Extraction and Selection Acoustic Modeling Accurate & fast AM Search Robustness Speech Enhancement Fractals Speaker Adaptation Speaker Normalization Chinese Pronunciation Modeling Language Modeling Characteristics of Chinese Language Modeling and Search LM Adaptation & New Word Induction Natural/Spoken Speech Understanding (NLU/SLU) NLU - GLR Based Parsing SLU - KW based robust parsing Dialogue Manager Applications Command and control Keyword spotting Language Learning Input method editor Chinese dictation machine Spoken dialogues Speaker identification and verification Resources Speech Research Activities Center of Speech Technology, Tsinghua University
Change includes insertion, deletion and substitution. Chinese Pronunciation Modeling • Motivation • In spontaneous speech, pronunciations of individual words are different, there are often • Sound changes, and • Phone changes. • For Chinese • an additional accent problem even people are speaking Mandarin, due to different dialect backgrounds (in Chinese, 7 major dialects) • colloquialism, grammar, style • Goal: modelling the pronunciation variations • Establishing a corpus with spontaneous phenomena, because we should know what the canonical phones change to. • Finding solutions to the pronunciation modelling theoretically and practically Center of Speech Technology, Tsinghua University
Necessity to establish a new annotated spontaneous speech corpus • The existing databases (incl. Broadcast News, CallHome, CallFriend, …) do not cover all the Chinese spoken language phenomena • Sound changes: voiced, unvoiced, nasalization, … • Phone changes: retroflexed, OOV-phoneme, … • The existing databases do not contain pronunciation variation information for use of bootstrap training • A Chinese Annotated Spontaneous Speech (CASS) Corpus was established before WS00 on LSP in JHU • Completely spontaneous (discourses, lectures, ...) • Remarkable background noise, accent background, ... • Recorded onto tapes and then digitalized Center of Speech Technology, Tsinghua University
Chinese Annotated Spontaneous Speech (CASS) Corpus w/ Five-Tier Transcription • Character level : base form • Syllable (or Pinyin) Level (w/ tone) : base form • Initial/Final (IF) Level : w/ time boundary for baseform • SAMPA-C Level : surface form • Miscellaneous Level : used for garbage modeling • Lengthening, breathing, laughing, coughing, disfluency, noise, silence, murmur (unclear), modal, smack, non-Chinese • Example Center of Speech Technology, Tsinghua University
SAMPA-C: Machine Readable IPA • Phonologic Consonants - 23 • Phonologic Vowels - 9 • Initials - 21 • Finals - 38 • Retroflexed finals - 38 • Tones and Silences • Sound Changes • Spontaneous Phenomenon Labels Center of Speech Technology, Tsinghua University
Key Points in Pronunciation Modeling • Choosing and generating the recognition units • Semi-syllable, i.e., Initial/Final (IF), is suitable for Chinese • Establishing the multi-pronunciation (syllable to semi-syllable) lexicon • to reflect the pronunciation variation • Acoustic modeling of the spontaneous speech • Theory framework • Customized decoding algorithm • according to the new lexicon • Modified language modeling • according to the new lexicon Center of Speech Technology, Tsinghua University
Establishment of Multi-Entry Lexicon • Two major approaches • Defined by linguists • Data-driven: confusion matrix, rewritten rules, decision tree ...) • Our method: • Find all possible pronunciations in SAMPA-C from database • Reduce the size according to occurring frequencies Center of Speech Technology, Tsinghua University
Collect all of them and choose the most frequent ones as GIFs. Define them according to GIF set. P ([GIFi] GIFf | Syllable ) • Surface form for IF and Syllable - learning pronunciations • Definition of Generalized Initial-Finals (GIFs) • z ts : canonical • z ts_v : voiced • z ts` : changed to ‘zh’ • z ts`_v : changed to voiced ‘zh’ • e 7 : canonical • e 7` : retroflexed or changed to ‘er’ • e @ : changed • Definition of Generalized Syllables (GSs) – the lexicon • chang [0.7850] ts`_h AN • chang [0.1215] ts`_h_v AN • chang [0.0280] ts`_v AN • chang [0.0187] <deletion> AN • chang [0.0187] z` AN • chang [0.0093] <deletion> iAN • chang [0.0093] ts_h AN • chang [0.0093] ts`_h UN Probabilistic lexicon. Center of Speech Technology, Tsinghua University
AM LM Refined AM Output Prob. • Probabilistic Pronunciation Modeling: Theory • Recognizer goal • K*=argmaxKP(K|A) = argmaxKP(A|K) P(K) • Applying independent assumption • P(A|K) = nP(an|kn) • Pronunciation modeling part – via introducing surface s • P(a|k) = sP(a|k,s)P(s|k) • Symbols • a: Acoustic signal, k: IF, s: GIF • A, K, S: corresponding string Center of Speech Technology, Tsinghua University
Refined Acoustic Modeling (RAM): P(a|k, s) • It cannot be trained directly, the solutions could be: • Use P(a|k) instead -- IF modeling • Use P(a|s) instead -- GIF modeling • Adapt P(a|k) to P(a|k, s) -- B-GIF modeling • Adapt P(a|s) to P(a|k, s) -- S-GIF modeling • IF-GIF transcription should be generated from the IF and GIF transcriptions • Need more data, but the data amount is fixed • Using adaptation Center of Speech Technology, Tsinghua University
Adapt P(a|k) to P(a|k, s) -- B-GIF scheme Adapt P(a|s) to P(a|k, s) -- S-GIF scheme IF1 IF GIF1 IF2 IF3 GIF2 GIF3 GIF • Generate RAM via adaptation technology Center of Speech Technology, Tsinghua University
AM LM Refined AM Output Prob. • Probabilistic Pronunciation Modeling: Theory • Recognizer goal • K*=argmaxKP(K|A) = argmaxKP(A|K) P(K) • Applying independent assumption • P(A|K) = nP(an|kn) • Pronunciation modeling part (2/2) • P(a|k) = sP(a|k,s)P(s|k) • Symbols • a: Acoustic signal, k: IF, s: GIF • A, K, S: corresponding string Center of Speech Technology, Tsinghua University
Surface-form Output Probability Modeling (SOPM) - P(s|k) • Solution: Direct Output Prob. (DOP) learned from CASS • Problem: data sparseness • Idea: syllable level data sparseness DOESN’T mean IF/GIF level data sparseness • New solution – Context-Dependent Weighting (CDW): • P(GIF|IF) = IFLP(GIF| (IFL, IF)) P (IFL| IF) • P(GIF| (IFL, IF)): GIF output prob. given context • P (IFL| IF): IF transition prob. • Above two items can be learned from CASS Center of Speech Technology, Tsinghua University
Generate SOPM via CDW • P(S-Syl | B-Syl): B-Syl=(i, f), S-Syl = (gi, gf) • CDW: • P(S-Syl | B-Syl ) = P(gi | i) P(gf | f) • P(GIF|IF) = IFLP(GIF| (IFL, IF)) P (IFL| IF) • Q(GIF|IF) = maxIFLP(GIF| (IFL, IF)) P (IFL| IF) • ML (GIF|IF) = P(GIF| (L, IF)) P (L| IF) • Different estimation of P(S-Syl | B-Syl) • P(gi | i) · P(gf | f) • P(gi | i) · Q(gf | f) • P(gi | i) · Mi(gf | f) Center of Speech Technology, Tsinghua University
Experiment condition • CASS Corpus was used for the experiment • Training Set: 3 hours’ data • Testing Set: 15 minutes’ data • Feature • MFCC + + + E (with CMN) • HTK • Accuracy calculated based on syllable • %Acc = Hit / Num * 100% • %Cor = (Hit – Ins) / Num * 100% Center of Speech Technology, Tsinghua University
Experimental results Center of Speech Technology, Tsinghua University
Experiments done after WS00 • Database • Enlarge the database: from 3 hrs 6 hrs, to • cover more spontaneous phenomena, and • provide more training data • The additional 3 hrs data are transcribed only in the canonical syllable level • A recursive procedure is adopted to generate the annotation in GIF level • Forced-alignment using IF-to-GIF multi-pronunciation lexicon • Retraining the GIF models • Adjusting the GIF sets Center of Speech Technology, Tsinghua University
Experiments done after WS00 (cont’d) • Acoustic Modeling • Using context-dependent modeling: Tri-IF/Tri-GIF (decision tree) • Performing the state-level Gaussian sharing • Performance improved: ~6% absolutely • Future work: • Deleted-Interpolation • Optimizing AM decoder • Refining and integrating LM Center of Speech Technology, Tsinghua University
Summary • An annotated spontaneous speech corpus is important • Data-sparseness should be considered • The transcription effort is too big (1:150) • S-GIF adaptation is useful in RAM • CDW is helpful in estimating SOPM • A data-driven procedure is useful in borrowing data w/o annotation • Context-dependent modeling and State-level Gaussian Sharing are good approaches • Deleted-interpolation could be used to borrow data Center of Speech Technology, Tsinghua University
References • Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Mandarin Pronunciation Modeling Based on CASS Corpus,” Sino-French Symposium on Speech and Language Processing, pp. 47-53, Oct. 16, 2000, Beijing • Pascale Fung, William Byrne, ZHENG Fang Thomas, Terri Kamm, LIU Yi, SONG Zhanjiang, Veera Venkataramani, and Umar Ruhi, “Pronunciation modeling of Mandarin casual speech,”Workshop 2000 on Speech and Language Processing: Final Report for MPM Group, http://www.clsp.jhu.edu/index.shtml • Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Modeling Pronunciation Variation Using Context-Dependent Weighting and B/S Refined Acoustic Modeling,” To appear in EuroSpeech, Sept. 3-7, 2001, Aalborg, Denmark Center of Speech Technology, Tsinghua University
Thanks for listening Thomas Fang Zheng Center of Speech Technology State Key Lab of Intelligent Technology and Systems Department of Computer Science & Technology Tsinghua University fzheng@sp.cs.tsinghua.edu.cn, http://sp.cs.tsinghua.edu.cn/~fzheng/