150 likes | 273 Views
Investigation on Mandarin Broadcast News Speech Recognition. Mei-Yuh Hwang , Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006, Interspeech, Pittsburgh. Outline. The task Text training data and language modeling Acoustic training data and acoustic modeling
E N D
Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006, Interspeech, Pittsburgh
Outline • The task • Text training data and language modeling • Acoustic training data and acoustic modeling • Decoding structure • Experimental results • Recent progress and future direction
The Task • Mandarin broadcast news (BN) transcription • Mainland Mandarin speech • TV/radio programs in China, USA • CCTV中央电视台 • NTDTV 新唐人电视台 • PHOENIX TV 凤凰卫视 • VOA 美国之音 • RFA 自由亚洲电台 • CNR 中国广播网
Text Training Data • LM1: • 1997 Mandarin BN Hub4 transcriptions • Chinese TDT2,3,4 • Multiple-translation Chinese (MTC) corpus, part 1, 2, 3 • LM2: Gigaword XIN 2001-2004 (China) • LM3: Gigaword ZBN 2001-2004 (Singapore) • LM4: Gigaword CNA 2001-2004 (Taiwan) • All together 420M words. • 4 LMs interpolated
Chinese Word Segmentation • BBN 64k-word lexicon, derived from LDC • Longest-first match with the 64k-lexicon • Choose most frequent 49k words as new lexicon • Train n-gram • Use unigram part to re-do word segmentation based on the ML path
Chinese Word Segmentation • Longest-first • 民进党/和亲/民党… • The Green Party made peace with the Min Party via marriage… • Maximum-likelihood • 民进党/和/亲民党… • The Green Party and the Qin-Min Party...
Perplexity • 49k-word lexicon
Acoustic Training Data *auto selection via a flexible alignment with closed caption
Acoustic Feature Representation • 39-dim MFCC cepstra + D + D D • 3-dim pitch + D + D D • Auto speaker clustering • VTLN per auto speaker • Speaker-based CMN+CVN for training
Acoustic Models • 2500 senones (clustered states) x 32 Gaussians • ML training vs. MPE training with phone lattices • Gender indepdent. • nonCW vs. CW triphones • Speaker-adaptive training (SAT): N(x; am+b, ASAt) = |A|-1N(A-1(x-b); m, S) Linear transformation A-1x + (-A-1b) applied to the feature domain.
2-Pass Search Architecture nonCW,nonSAT, ML model Small bigram Search 1 hypothesis SAT MLLR CW,SAT,MPE model Search 2 Big 4-gram Final word sequence
More Recent Progress • Add more acoustic (440 hrs) and text training data (840M words). • Increased and improved lexicon (60k words). • fMPE training. • Add ICSI feature as a second system. • 5-gram LM. • Between MFCC system and ICSI system, • Cross adaptation • Rover • 3.7% on dev04, 12.1% on eval04. • Submitted to ICASSP 2007
Challenges • Channel compensation • Conversational speech • Overlapped speech • Speech with music background • Commercial • Language ID (in addition to English) • Is CER the best measurement for MT?