Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition

Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition Konstantin Markov and Satoshi Nakamura ATR Spoken Language Communication Research Laboratories (ATR-SLC) Kyoto, Japan SRIV2006, Toulouse, France

Outline • Motivation and previous studies. • HMM based accent acoustic modeling. • Hybrid HMM/BN acoustic model for accented speech. • Evaluation and results. • Conclusion. SRIV2006, Toulouse, France

Motivation and Previous Studies • Accent variability: • Causes performance degradation due to training / testing conditions mismatch. • Becomes major factor for ASR’s public applications. • Differences due to accent variability are mainly: • Phonetic - • lexicon modification (Liu, ICASSP,98). • accent dependent dictionary (Humphries, ICASSP,98). • Acoustic – (addressed in this work) • Pooled data HMM (Chengalvarayan, Eurospeech’01). • Accent identification (Huang, ICASSP’05). SRIV2006, Toulouse, France

HMM based approaches (1) Accent-dependent data → A B C A,B,C Pooled data → MA-HMM Multi-accent AM → input speech recognition result Feature Extraction Decoder SRIV2006, Toulouse, France

HMM based approaches (2) Accent-dependent data → A B C Accent-dependent HMMs → A-HMM B-HMM C-HMM PA-HMM Parallel AM → input speech recognition result Feature Extraction Decoder SRIV2006, Toulouse, France

HMM based approaches (3) Accent-dependent data → A B C Gender-dependent HMMs → M-HMM F-HMM GD-HMM Parallel AM → input speech recognition result Feature Extraction Decoder SRIV2006, Toulouse, France

Hybrid HMM/BN Background • HMM/BN Structure:  HMM at the top level. Models speech temporal characteristic by state transitions.  BN at the bottom level. Represents states PDF. • BN Topologies: • Simple BN Example: State PDF: State output probability: If M is hidden, then: HMM q1 q2 q3 Bayesian Network HMM State Mixture component index Observation Q M X SRIV2006, Toulouse, France

HMM/BN based Accent Model • Accent and Gender are modeled as additional variables of the BN. • The BN topology: • G = {F,M} • A = {A,B,C} • When G and A are hidden: SRIV2006, Toulouse, France

HMM/BN Training • Initial conditions • Bootstrap HMM: gives the (tied) state structure. • Labelled data: each feature vector has accent and gender label. • Training algorithm: Step 1: Viterbi alignment of the training data using the bootstrap HMM to obtain state labels. Step 2: Initialization of BN parameters. Step 3: Forwards-Backward based embedded HMM/BN training. Step 4: If convergence criterion is met  Stop Otherwise  go to Step 3 SRIV2006, Toulouse, France

HMM/BN approach A(M) B(M) C(M) Accent-dependent and gender-dependent data → A(F) B(F) C(F) HMM/BN HMM/BN AM → input speech recognition result Feature Extraction Decoder SRIV2006, Toulouse, France

Comparison of state distributions MA-HMM PA-HMM GD-HMM HMM/BN SRIV2006, Toulouse, France

Database and speech pre-processing • Database • Accents: • American (US). • British (BRT). • Australian (AUS). • Speakers / Utterances: • 100 per accent (90 for training + 10 for evaluation). • 300 utterances per speaker. • Speech material same for each accent. • Travel arrangement dialogs. • Speech feature extraction: • 20ms frames at 10ms rate. • 25 dim. features vectors (12MFCC + 12ΔMFCC + ΔE). SRIV2006, Toulouse, France

Models • Acoustic models: • All HMM based AMs have: • Three states, left-to-right, triphone contexts • 3,275 states (MDL-SSS) • Variants with 6,18, 30 and 42 total Gaussians per state. • HMM/BN model: • Same state structure as the HMM models. • Same number of Gaussian components. • Language model: • Bi-gram, Tri-gram (600,000 training sentences). • 35,000 word vocabulary. • Test data perplexity: 116.5 and 27.8 • Pronunciation lexicon – American English. SRIV2006, Toulouse, France

Evaluation results SRIV2006, Toulouse, France

Evaluation results Word accuracies (%), all models with total of 42 Gaussians per state. SRIV2006, Toulouse, France

Conclusions • In the matched accent case, accent-dependent models are the best choice. • The HMM/BN is the best, almost matching the results of accent-dependent models, but requires more mixture components. • Multi-accent HMM is the most efficient in terms of performance and complexity. • Different performance levels of accent-dependent models apparently caused by the phonetic accent differences. SRIV2006, Toulouse, France

Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition

Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition

Presentation Transcript

Speech Recognition

Speech Recognition

Large Vocabulary Continuous Speech Recognition (LVCSR)

Using Speech Recognition for Speech Therapy

DIGITAL SIGNAL PROCESSING ARCHITECTURE FOR LARGE VOCABULARY SPEECH RECOGNITION

Speech recognition

TANDEM ACOUSTIC MODELING IN LARGE-VOCABULARY RECOGNITION

Relevance Language Modeling For Speech Recognition

Combining Speech Attributes for Speech Recognition

Speech Recognition

Speech Recognition

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition

Boosting HMM acoustic models in large vocabulary speech recognition

Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition

Duration modeling for speech recognition

SPEECH RECOGNITION:

Applications of Large Vocabulary Continuous Speech Recognition for Fatigue Detection

Speech Recognition

Language Modeling for Speech Recognition

Acoustic Modeling for Speech Recognition

Applications of Large Vocabulary Continuous Speech Recognition for Fatigue Detection

Speech Recognition