160 likes | 340 Views
Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition. Konstantin Markov and Satoshi Nakamura. ATR Spoken Language Communication Research Laboratories (ATR-SLC) Kyoto, Japan. Outline. Motivation and previous studies. HMM based accent acoustic modeling.
E N D
Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition Konstantin Markov and Satoshi Nakamura ATR Spoken Language Communication Research Laboratories (ATR-SLC) Kyoto, Japan SRIV2006, Toulouse, France
Outline • Motivation and previous studies. • HMM based accent acoustic modeling. • Hybrid HMM/BN acoustic model for accented speech. • Evaluation and results. • Conclusion. SRIV2006, Toulouse, France
Motivation and Previous Studies • Accent variability: • Causes performance degradation due to training / testing conditions mismatch. • Becomes major factor for ASR’s public applications. • Differences due to accent variability are mainly: • Phonetic - • lexicon modification (Liu, ICASSP,98). • accent dependent dictionary (Humphries, ICASSP,98). • Acoustic – (addressed in this work) • Pooled data HMM (Chengalvarayan, Eurospeech’01). • Accent identification (Huang, ICASSP’05). SRIV2006, Toulouse, France
HMM based approaches (1) Accent-dependent data → A B C A,B,C Pooled data → MA-HMM Multi-accent AM → input speech recognition result Feature Extraction Decoder SRIV2006, Toulouse, France
HMM based approaches (2) Accent-dependent data → A B C Accent-dependent HMMs → A-HMM B-HMM C-HMM PA-HMM Parallel AM → input speech recognition result Feature Extraction Decoder SRIV2006, Toulouse, France
HMM based approaches (3) Accent-dependent data → A B C Gender-dependent HMMs → M-HMM F-HMM GD-HMM Parallel AM → input speech recognition result Feature Extraction Decoder SRIV2006, Toulouse, France
Hybrid HMM/BN Background • HMM/BN Structure: HMM at the top level. Models speech temporal characteristic by state transitions. BN at the bottom level. Represents states PDF. • BN Topologies: • Simple BN Example: State PDF: State output probability: If M is hidden, then: HMM q1 q2 q3 Bayesian Network HMM State Mixture component index Observation Q M X SRIV2006, Toulouse, France
HMM/BN based Accent Model • Accent and Gender are modeled as additional variables of the BN. • The BN topology: • G = {F,M} • A = {A,B,C} • When G and A are hidden: SRIV2006, Toulouse, France
HMM/BN Training • Initial conditions • Bootstrap HMM: gives the (tied) state structure. • Labelled data: each feature vector has accent and gender label. • Training algorithm: Step 1: Viterbi alignment of the training data using the bootstrap HMM to obtain state labels. Step 2: Initialization of BN parameters. Step 3: Forwards-Backward based embedded HMM/BN training. Step 4: If convergence criterion is met Stop Otherwise go to Step 3 SRIV2006, Toulouse, France
HMM/BN approach A(M) B(M) C(M) Accent-dependent and gender-dependent data → A(F) B(F) C(F) HMM/BN HMM/BN AM → input speech recognition result Feature Extraction Decoder SRIV2006, Toulouse, France
Comparison of state distributions MA-HMM PA-HMM GD-HMM HMM/BN SRIV2006, Toulouse, France
Database and speech pre-processing • Database • Accents: • American (US). • British (BRT). • Australian (AUS). • Speakers / Utterances: • 100 per accent (90 for training + 10 for evaluation). • 300 utterances per speaker. • Speech material same for each accent. • Travel arrangement dialogs. • Speech feature extraction: • 20ms frames at 10ms rate. • 25 dim. features vectors (12MFCC + 12ΔMFCC + ΔE). SRIV2006, Toulouse, France
Models • Acoustic models: • All HMM based AMs have: • Three states, left-to-right, triphone contexts • 3,275 states (MDL-SSS) • Variants with 6,18, 30 and 42 total Gaussians per state. • HMM/BN model: • Same state structure as the HMM models. • Same number of Gaussian components. • Language model: • Bi-gram, Tri-gram (600,000 training sentences). • 35,000 word vocabulary. • Test data perplexity: 116.5 and 27.8 • Pronunciation lexicon – American English. SRIV2006, Toulouse, France
Evaluation results SRIV2006, Toulouse, France
Evaluation results Word accuracies (%), all models with total of 42 Gaussians per state. SRIV2006, Toulouse, France
Conclusions • In the matched accent case, accent-dependent models are the best choice. • The HMM/BN is the best, almost matching the results of accent-dependent models, but requires more mixture components. • Multi-accent HMM is the most efficient in terms of performance and complexity. • Different performance levels of accent-dependent models apparently caused by the phonetic accent differences. SRIV2006, Toulouse, France