1 / 25

Boosting HMM acoustic models in large vocabulary speech recognition

Boosting HMM acoustic models in large vocabulary speech recognition. Carsten Meyer, Hauke Schramm Philips Research Laboratories, Germany SPEECH COMMUNICATION 2006. AdaBoost introduction. The AdaBoost algorithm was presented for transforming a “weak” learning rule into a “strong” one

lamis
Download Presentation

Boosting HMM acoustic models in large vocabulary speech recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories, Germany SPEECH COMMUNICATION 2006

  2. AdaBoost introduction • The AdaBoost algorithm was presented for transforming a “weak” learning rule into a “strong” one • The basic idea is to train a series of classifiers based on the classification performance of the previous classifier on the training data • In multi-class classification, a popular variant is the AdaBoost.M2 algorithm • AdaBoost.M2 is applicable when a mapping can be defined for classifier which is related to the classification criterion

  3. AdaBoost.M2 (Freund and Schapire, 1997)

  4. AdaBoost introduction • The update rule is designed to guarantee an upper bound on the training error of the combined classifier which is exponentially decreasing with the number of individual classifiers • In multi-class problems, the weights are summed up to give a weight for each training pattern :

  5. Introduction • Why there are only a few studies so far applying boosting to acoustic model training? • Speech recognition is an extremely complex large scale classification problem • The main motivation to apply AdaBoost to speech recognition is • Its theoretical foundation providing explicit bounds on the training and–in terms of margins–on the generalization error

  6. Introduction • In most previous applications to speech recognition, boosting was applied to classifying each individual feature vector to a phoneme symbol [ICASSP04][Dimitrakakis] • Needing the phoneme posterior probabilities • But the problem is.. • The conventional HMM speech recognizers do not involve an intermediate phoneme classification step for individual feature vectors • So the frame-level boosting approach cannot straightforwardly be applied

  7. Utterance approach for boosting in ASR • An intuitive way of applying boosting to HMM speech recognition is at the utterance level • Thus, boosting is used to improve upon an initial ranking of candidate word sequences • The utterance approach has two advantages: • First, it is directly related to the sentence error rate • Second, it is computationally much less expensive than boosting applied at the level of feature vectors

  8. Utterance approach for boosting in ASR • In utterance approach, we define the input patterns to be the sequence of feature vectors corresponding to the entire utterance • denotes one possible candidate word sequence of the speech recognizer, being the correct word sequence for utterance • The a posteriori confidence measure is calculated on basis of the N-best list for utterance

  9. Utterance approach for boosting in ASR • Based on the confidence values and AdaBoost.M2 algorithm, we calculate an utterance weight for each training utterance • Subsequently, the weight are used in maximum likelihood and discriminative training of Gaussian mixture model

  10. Utterance approach for boosting in ASR • Some problem encountered when apply it to large-scale continuous speech application: • The N-best lists of reasonable length (e.g. N=100) generally contain only a tiny fraction of the possible classification results • This has two consequences: • In training, it may lead to sub-optimal utterance weights • In recognition, Eq. (1) cannot be applied appropriately

  11. Utterance approach for CSR--Training • Training • A convenient strategy to reduce the complexity of the classification task and to provide more meaningful N-best lists consists in “chopping” of the training data • For long sentences, it simply means to insert additional sentence break symbols at silence intervals with a given minimum length • This reduces the number of possible classifications of each sentence “fragment”, so that the resulting N-best lists should cover a sufficiently large fraction of hypotheses

  12. Utterance approach for CSR--Decoding • Decoding: lexical approach for model combination • A single pass decoding setup, where the combination of the boosted acoustic models is realized at a lexical level • The basic idea is to add a new pronunciation model by “replicating” the set of phoneme symbols in each boosting iteration (e.g. by appending the suffix “_t” to the phoneme symbol) • The new phoneme symbols, represent the underlying acoustic model of boosting iteration “au”, “au_1” ,“au_2”,…

  13. Utterance approach for CSR--Decoding • Decoding: lexical approach for model combination (cont.) • Add to each phonetic transcription in the decoding lexicon a new transcription using the corresponding phoneme set • Use the reweighted training data to train the boosted classifier • Decoding is then performed using the extended lexicon and the set of acoustic models weighted by their unigram prior probabilities which are estimated on the training data “sic_a”, “sic_1 a_1” ,… weighted summation

  14. In more detail Training Training corpus “_t” Boosting Iteration t Mt phonetically transcribed training corpus(Mt) ML/MMI training pronunciation variant “sic_a”, “sic_1 a_1” ,… Decoding Lexicon M1,M2,…,Mt unweighted model combination weighted model combination extend

  15. In more detail

  16. Weighted model combination • Word level model combination

  17. Experiments • Isolated word recognition • Telephone-bandwidth large vocabulary isolated word recognition • SpeechDat(II) German meterial • Continuous speech recognition • Professional dictation and Switchboard

  18. Isolated word recognition • Database: • Training corpus: consists of 18k utterances (4.3h) of city, company, first and family names • Evaluations: • LILI test corpus: 10k single word utterances (3.5h); 10k words lexicon; (matched conditions) • Names corpus: an inhouse collection of 676 utterances (0.5h); two different decoding lexica: 10k lex, 190k lex; (acoustic conditions are matched, whereas there is a lexical mismatch) • Office corpus: 3.2k utterances (1.5h), recorded over microphone in clean conditions; 20k lexicon; (an acoustic mismatch to the training conditions)

  19. Isolated word recognition • Boosting ML models

  20. Isolated word recognition • Combining boosting and discriminative training • The experiments in isolated word recognition showed that boosting may improve the best test error rates

  21. Continuous speech recognition • Database • Professional dictation • An inhouse data collection of real-life recordings of medical reports • The acoustic training corpus consists of about 58h of data • Evaluations were carried out on two test corpora: • Development corpus consists of 5.0h of speech • Evaluation corpus consists of 3.3h of speech • Switchboard • Consisting of spontaneous conversations recorded over telephone line; 57h(73h) of male(female) • Evaluations corpus: • Containing about 1h(0.5h) of male(female)

  22. Continuous speech recognition • Professional dictation:

  23. Switchboard:

  24. Conclusions • In this paper, a boosting approach which can be applied to any HMM based speech recognizer was be presented and evaluated • The increased recognizer complexity and thus decoding effort of the boosted systems is a major drawback compared to other training techniques like discriminative training

  25. References • [ICASSP02][C.Meyer] Utterance-Level Boosting of HMM Speech Recognizers • [ICML02][C.Meyer] Towards Large Margin Speech Recognizers by Boosting and Discriminative Training • [ICSLP00][C.Meyer] Rival Training: Efficient Use of Data in Discriminative Training • [ICASSP00][Schramm and Aubert] Efficient Integration of Multiple Pronunciations in a Large Vocabulary Decoder

More Related