1 / 37

Building A Highly Accurate Mandarin Speech Recognizer

Building A Highly Accurate Mandarin Speech Recognizer. Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf 12/12/2007. Outline. Goal: A highly accurate Mandarin ASR Background Acoustic segmentation Acoustic models and adaptation

romeo
Download Presentation

Building A Highly Accurate Mandarin Speech Recognizer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf 12/12/2007

  2. Outline • Goal: A highly accurate Mandarin ASR • Background • Acoustic segmentation • Acoustic models and adaptation • Language models and adaptation • Cross adaptation • System combination • Error analysis  Future

  3. Background • 870 hours of acoustic training data. • N-gram based (N=1) ML Chinese word segmentation. • 60K-word lexicon. • 1.2G words of training text. Trigrams and 4-grams.

  4. silence noise Start / null End / null speech silence silence silence noise noise noise Start Start Start / / / null null null End End End / / / null null null speech speech speech Acoustic segmentation • Former segmenter caused high deletion errors. It mis-classified some speech segments as noises. • Speech segment min duration 18*30=540ms=0.5s

  5. noise silence Foreign Start / null End / null Mandarin 1 Mandarin 2 New Acoustic Segmenter • Allow shorter speech duration • Model Mandarin vs. Foreign (English) separately.

  6. Two Sets of Acoustic Models • For cross adaptation and system combo • Different error behaviors • Similar error rate performance

  7. MLP Phoneme Posterior Features • Compute Tandem features with pitch+PLP input. • Compute HATs features with 19 critical bands • Combine two Tandem and HATs posterior vectors into one. • Log(PCA(71)  32) • MFCC + pitch + MLP = 74-dim • 3500x128 Gaussians, MPE trained. • Both cross-word (CW) and nonCW triphones trained.

  8. (42x9)x15000x71 PLP (39x9) Pitch (3x9) Tandem Features [T1,T2,…,T71] • Input: 9 frames of PLP+pitch

  9. HATS Features [H1,H2,…,H71] 51x60x71 (60*19)x8000x71 E1 E2 … E19

  10. Phone-81: Diphthongs for BC • Add diphthongs (4x4=16) for fast speech and modeling longer triphone context. • Maintain unique syllabification. • Syllable ending W and Y not needed anymore.

  11. Phone-81: Frequent Neutral Tones for BC • Neural tones more common in conversation. • Neutral tones were not modeled. The 3rd tone was used as replacement. • Add 3 neutral tones for frequent chars.

  12. Phone-81: Special CI Phones for BC • Filled pauses (hmm,ah) common in BC. Add two CI phones for them. • Add CI /V/ for English.

  13. Phone-81: Simplification of Other Phones • Now 72+14+3+3=92 phones, too many triphones to model. • Merge similar phones to reduce #triphones. I2 was modeled by I1, now i2. • 92 – (4x3–1) = 81 phones.

  14. PLP Models with fMPE Transform • PLP model with fMPE transform to compete with MLP model. • Smaller ML-trained Gaussian posterior model: 3500x32 CW+SAT • 5 Neighboring frames of Gaussian posteriors. • M is 42 x (3500*32*5), h is (3500*32*5)x1. • Ref: Zheng ICASSP 07 paper

  15. {w | w same story (4secs) } Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model q q0 One sentence • 4s window is used to make adaptation more robust against ASR errors. • {w} are weighted based on distance.

  16. Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model One sentence • Training • One topic per sentence. • Train 64 topic-dep. 4-gram LM1 , LM2, … LM64. • Decoding • Top n topics per sentence, where qi’ > threshold. Topic-based LM Adaptation Latent Dirichlet Allocation Topic Model Latent Dirichlet Allocation Topic Model Latent Dirichlet Allocation Topic Model One sentence One sentence

  17. Improved Acoustic Segmentation Pruned trigram, SI nonCW-MLP MPE, on eval06

  18. Different Phone Sets Pruned trigram, SI nonCW-PLP ML, on dev07 Indeed different error behaviors --- good for system combo.

  19. Decoding Architecture Aachen MLP nonCW qLM3 PLP CW+SAT+fMPE MLLR, LM3 MLP CW+SAT MLLR, LM3 qLM4 Adapt/Rescore qLM4 Adapt/Rescore Confusion Network Combination

  20. Topic-based LM Adaptation (NTU) • Training, per sentence: • 64 topics: q = (q 1, q 2, …, q m) • Topic(sentence) = k = argmax {q1, q2, …, qm} • Train 64 topic-dep (TD) 4-grams • Testing, per utterance: • {w}: N-best confidence based weighting + distance weighting • Pick all TD 4-grams whose qi is above a threshold. • Interpolate with the topic-indep. 4-gram. • Rescore N-best list.

  21. CERs with diff LMs (internal use)

  22. Topic-based LM Adaptation (NTU) “q” represents “quick” or tightly pruned. Oracle CNC: 4.7%. Could it be a broken word sequence? Need to verify that with word perplexity and HTER.

  23. 2006 ASR System vs. 2007 CER on Eval07 37% relative improvement!!

  24. Eval07 BN ASR Error Distribution 66 BN snippets (Avg CER 3.4%) 20 15 CER (%) 10 SRI 5 0 0.0% 50.0% 100.0% 150.0% % snippets

  25. Eval07 BC ASR Error Distribution 53 BC snippets (avg CER 15.9%) 50 40 30 CER (%) SRI 20 10 0 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% % snippets

  26. What Worked for Mandarin ASR? • MLP features • MPE • CW+SAT • fMPE • Improved acoustic segmentation, particularly for deletion errors. • CNC Rover.

  27. Small Help for ASR • Topic-dep. LM adaptation. • Outside regions for additional AM adaptation data. • A new phone set with diphthongs to offer different error behaviors. • Pitch input in tandem features. • Cross adaptation with Aachen  Successful collaboration among 5 team members from 3 continents.

  28. Error Analysis on Extreme Cases • CER not directly related to HTER; genre matters. • Better CER does ease MT.

  29. Error Analysis • (a) worst BN: OOV names • (b) worst BC: overlapped speech • (c) best BN: composite sentences • (d) best BC: simple sentences with disfluency and re-starts.

  30. Error Analysis • OOV (especially names) Problematic for both ASR/MT • Overlapped speech  What to do? • Content word mis-reco (not all errors are equal!) • 升值(increase in value)  甚至 (even) Parsing scores?

  31. Error Analysis • MT BN high errors • Composite syntax structure. • Syntactic parsing would be useful. • MT BC high errors • Overlapped speech • ASR high errors due to disfluency • Conjecture: MT on perfect BC ASR is easy, for its simple/short sentence structure

  32. Next ASR: Chinese OOV Org Names • Semi-auto abbreviation generation for long words. • Segment a long word into a sequence of shorter words • Extract the 1st char of each shorter words: • World Health Organization  WHO (Make sure they are in MT translation table, too)

  33. Next ASR: Chinese OOV Per. Names • Mandarin high rate of homophones: 408 syllables  6000 common characters. 14 homophone chars / syllable!! • Given a spoken Chinese OOV name, no way to be sure which characters to use. But for MT, don’t care anyway as long as the syllables are correct.!! • Recognizing repetition of the same name in the same snippet: CNC at syllable level • Xu  {Chang, Cheng}  {Lin, Min, Ming} • Huang  Zhu  {Qin, Qi} • After syllable CNC, apply the same name to all occurrences in Pinyin.

  34. Next ASR: English OOV Names • English spelling in Lexicon, with (multiple) Mandarin pronunciations: • Bush /bu4 shi2/ or /bu4 xi1/ • Bin Laden /ben1 la1 deng1/ or /ben1 la1 dan1/ • John /yue1 han4/ • Sadr /sa4 de2 er3/ • Name mapping from MT? • Need to do name tagging on training text (Yang Liu), convert Chinese names to English spelling, re-train n-gram.

  35. Next ASR: LM • LM adaptation with fine topics, each topic with small vocabulary size. • Spontaneous speech: n-gram backtraces to content words in search or N-best? Text paring modeling? • 我想那(也)(也)也是 我想那也是 • I think it, (too), (too), is, too.  I think it is, too. • If optimizing CER, stm needs to be designed such that disfluency is optionally deletable.

  36. Next ASR: AM • Add explicit tone modeling (Lei07). • Prosody info: duration and pitch contour at word level • Various backoff schemes for infrequent words • More understanding why outside regions not helping with AM adaptation. • Add SD MLLR regression tree (Mandal06). • Improve auto speaker clustering • Smaller clusters, better performance

  37. ASR & MT Integration • Do we need to merge lexicon? ASR <= MT. • Do we need to use the same word segmenter? • Is word/char -level CNC output better for MT? • Open questions and feedback!!!

More Related