Building A Highly Accurate Mandarin Speech Recognizer

Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf 12/12/2007

Outline • Goal: A highly accurate Mandarin ASR • Baseline: System-2006 • Improvement • Acoustic segmentation • Two complementary comparable systems • Language models and adaptation • More Data • Error analysis  Future

Background: System-2006 • 849M words training text • 60K-word lexicon • Static 5-gram rescoring • 465 hrs acoustic training • Two AMs (same phone-72 pronunciation) • MFCC+pitch (42-dim), SAT+fMPE, CW MPE, 3000x128 Gaussians. • MFCC+MLP+pitch (74-dim), SAT+fMPE, nonoCW MPE, 3000x64 Gaussians • CER 18.4% on Eval06.

2007 Increased Training Data • 870 hours of acoustic training data. 3500x128 Gaussians. • 1.2G words of training text. Trigrams and 4-grams.

silence noise Start / null End / null speech silence silence silence noise noise noise Start Start Start / / / null null null End End End / / / null null null speech speech speech Acoustic segmentation • Former segmenter caused high deletion errors. It mis-classified some speech segments as noises. • Speech segment min duration 18*30=540ms=0.5s

noise silence Foreign Start / null End / null Mandarin 1 Mandarin 2 New Acoustic Segmenter • Allow shorter speech duration • Model Mandarin vs. Foreign (English) separately.

Improved Acoustic Segmentation Pruned trigram, SI nonCW-MLP MPE, on Eval06

Decoding Architecture Aachen MLP nonCW qLM3 PLP CW SAT+fMPE MLLR, LM3 MLP CW SAT MLLR, LM3 qLM4 Adapt/Rescore qLM4 Adapt/Rescore Confusion Network Combination

Two Sets of Acoustic Models • For cross adaptation and system combo • Different error behaviors • Similar error rate performance

MLP Phoneme Posterior Features • Compute Tandem features with pitch+PLP input. • Compute HATs features with 19 critical bands • Combine Tandem and HATs posterior vectors into one. • PCA(Log(71))  32 • MFCC + pitch + MLP = 74-dim

(42x9)x15000x71 PLP (39x9) Pitch (3x9) Tandem Features [T1,T2,…,T71] • Input: 9 frames of PLP+pitch

HATS Features [H1,H2,…,H71] 51x60x71 (60*19)x8000x71 E1 E2 … E19

MLP and Pitch Features nonCW ML, Hub4 Training, MLLR, LM2 on Eval04

Phone-81: Diphthongs for BC • Add diphthongs (4x4=16) for fast speech and modeling longer triphone context. • Maintain unique syllabification. • Syllable ending W and Y not needed anymore.

Phone-81: Frequent Neutral Tones for BC • Neural tones more common in conversation. • Neutral tones were not modeled. The 3rd tone was used as replacement. • Add 3 neutral tones for frequent chars.

Phone-81: Special CI Phones for BC • Filled pauses (hmm, ah) common in BC. Add two CI phones for them. • Add CI /V/ for English.

Phone-81: Simplification of Other Phones • Now 72+14+3+3=92 phones, too many triphones to model. • Merge similar phones to reduce #triphones. I2 was modeled by I1, now i2. • 92 – (4x3–1) = 81 phones.

Different Phone Sets Pruned trigram, SI nonCW-PLP ML, on dev07 Indeed different error behaviors --- good for system combo.

PLP Models with fMPE Transform • PLP model with fMPE transform to compete with MLP model. • Smaller ML-trained Gaussian posterior model: 3500x32 CW+SAT • 5 Neighboring frames of Gaussian posteriors. • M is 42 x (3500*32*5), h is (3500*32*5)x1. • Ref: Zheng ICASSP 07 paper

{w | w same story (4secs) } Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model q q0 One sentence • 4s window is used to make adaptation more robust against ASR errors. • {w} are weighted based on distance.

Topic-based LM Adaptation • Training: one topic per sentence • Train 64 topic-dependent LMs. • Testing: top n topics per sentence, weighting on neighboring 4s of speech

Topic-based LM Adaptation • LMi still 60K-words? • Per-sentence adaptation? • Computational cost?

LM Adaptation and CNC on Dev07 UW 2 systems only

LM Adaptation and CNC on Eval07

Eval07

2006 vs. 2007 on Eval07 37% relative improvement!!

Progress

RWTH Demo • UW acoustic segmenter. • RWTH single-system ASR. Foreign (Korean) speech skipped. Mis-reco highlighted. • Manual sentence segmentation. • Machine translation. • Not real-time.

MT Error Analysis on Extreme Cases • CER not directly related to HTER; genre matters. • Better CER does ease MT.

MT Error Analysis • (a) worst BN: OOV names • (b) worst BC: overlapped speech • (c) best BN: composite sentences • (d) best BC: simple sentences with disfluency and re-starts. • *.html, *.wav

Error Analysis • OOV (especially names): problematic for ASR, MT, distillation.

Error Analysis • MT BN high errors • Composite syntax structure. • Syntactic parsing would be useful. • MT BC high errors • Overlapped speech • ASR high errors due to disfluency • Conjecture: MT on perfect BC ASR is easy, for its simple/short sentence structure

Next ASR: Chinese Organization Names • Semi-auto abbreviation generation for long words. • Segment a long word into a sequence of shorter words • Extract the 1st char of each shorter words: • 世界卫生组织 世卫 (Make sure they are in MT translation table, too)

Next ASR: Chinese Person Names • Mandarin high rate of homophones: 408 syllables  6000 common characters. 14 homophone chars / syllable!! • Given a spoken Chinese OOV name, no way to be sure which characters to use. But for MT, don’t care anyway as long as the syllables are correct.!! • Recognizing repetition of the same name in the same snippet: CNC at syllable level • Xu  {Chang, Cheng}  {Lin, Min, Ming} • Huang  Zhu  {Qin, Qi} • After syllable CNC, apply the same name to all occurrences in Pinyin.

Next ASR: Foreign Names • English spelling in Lexicon, with (multiple) Mandarin pronunciations: • Bush /bu4 shi2/ or /bu4 xi1/ • Bin Laden /ben1 la1 deng1/ or /ben3 la1 deng1/ • John /yue1 han4/ • Sadr /sa4 de2 er3/ • Name mapping from MT? • Need to do name tagging on training text (Yang Liu), convert Chinese names to English spelling, re-train n-gram.

Next ASR: LM • LM adaptation with fine topics, each topic with small vocabulary size. • Spontaneous speech: n-gram backtraces to content words in search or N-best? Text paring modeling? • 我想那(也)(也)也是 我想那也是 • I think it, (too), (too), is, too.  I think it is, too. • If optimizing CER, stm needs to be designed such that disfluency is optionally deletable.小孩(儿)

Next ASR: AM • Add explicit tone modeling (Lei07). • Prosody info: duration and pitch contour at word level • Various backoff schemes for infrequent words • More understanding why outside regions not helping with AM adaptation. • Add SD MLLR regression tree (Mandal06). • Improve auto speaker clustering • Smaller clusters, better performance • Gender ID first.

ASR & MT Integration • Do we need to merge lexicon? ASR MT. • Do we need to use the same word segmenter? • Is word/char -level CNC output better for MT? • Open questions and feedback!!!

Building A Highly Accurate Mandarin Speech Recognizer