10 likes | 142 Views
ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION. M.Y. Hwang 1 , W. Wang 2 , X. Lei 1 , J. Zheng 2 , O. Cetin 3 ,, G. Peng 1 1. Department of Electrical Engineering, University of Washington, Seattle, WA, USA
E N D
ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION M.Y. Hwang1, W. Wang2, X. Lei1, J. Zheng2, O. Cetin3,, G. Peng1 1. Department of Electrical Engineering, University of Washington, Seattle, WA, USA 2. SRI International, Menlo Park, CA, USA 3. ICSI Berkeley, Berkeley, USA • Overview • Goal • Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast conversation (BC). • Improvements over the previous version • Increased training data, • Discriminative features, • Frame-level discriminative training criterion, • Multiple-pass AM adaptation, • System combination, • LM adaptation, • Total 24%--64% relative CER reduction. • Perplexity • Two Acoustic Models • MFCC, 39-dim, CW+SAT fMPE+MPE, 3000x128 Gaussians. • 2. MFCC + MPE-phoneme posterior feature, 74-dim, nonCW fMPE+MPE, 3000x64 Gaussians. • MLP features: • [1] Zheng, ICASSP-2007, “Combining discriminative feature, transform, and model training for large vocabulary speech recognition”. • [2] Chen, ICSLP-2004, “Learning long-term temporal features in LVCSR using neural networks”. Character Error Rates • Increased Training Data • Acoustic training data increased from 97 hours to 465 hours (2/3 BN, 1/3 BC) • Test data • ML word segmentation used for Chinese text. • Training text increased from 420M words to 849M words. • Lexicon size increased from 49K to 60K (1700 English words) • 6 LMs trained for interpolation into one • Bigrams (qLM2), trigrams (LM3, qLM3), 5-grams (LM5a, LM5b) are trained. LM5b uses count-based smoothing. • LM Adaptation for BC (Dev05bc) • LMBN = (1) ~ (6) BN + EARS Conversational Telephony Speech (159M words) • LMBC = (2)+(6) BC • LMALL = interpolation (LMBN, LMBC) • LMBN-C = LMBN adapted by (2) GALE-BC • One LM adaptation per show i, to maximize the likelihood of h • Re-start the entire recognition process after LM adaptation. • LMBN’ = LMBN adapted by h dynamically per show i • Same strategy no improvement on BN test data --- plenty of BN training text. Decoding Architecture Acoustic segmentation VTLN/CMN/CVN Speaker clustering 2. CW MFCC MLLR, LM3 • nonCW MLP • qLM3 3. nonCW MLP MLLR, LM3 h LM5a , LM5b rescore LM5a , LM5b rescore Confusion Network Combination • Future Work • Topic-dependent language model adaptation. • Machine-translation (MT) targeted error rates. Top 1