1 / 29

ICASSP 2009: Acoustic Model Survey

ICASSP 2009: Acoustic Model Survey. Yueng-Tien,Lo. Discriminative Training of Hierarchical Acoustic Models For Large Vocabulary Continuous Speech Recognition. Hung-An Chang and James R. Glass MIT Computer Science and Artificial Intelligence Laboratory Cambridge, Massachusetts, 02139, USA

shiri
Download Presentation

ICASSP 2009: Acoustic Model Survey

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICASSP 2009: Acoustic Model Survey Yueng-Tien,Lo

  2. Discriminative Training of Hierarchical Acoustic Models For LargeVocabulary Continuous Speech Recognition Hung-An Chang and James R. Glass MIT Computer Science and Artificial Intelligence Laboratory Cambridge, Massachusetts, 02139, USA {hung_an, glass}@csail.mit.edu

  3. outline • Introduction • Hierarchical Gaussian Mixture Models • Discriminative Training • Experiments • Conclusion

  4. Introduction-1 • In recent years, discriminative training methods have demonstrated considerable success. • Another approach to improve the acoustic modeling component is to utilize a more flexible model structure by constructing a hierarchical tree for the models.

  5. Introduction-2 • A hierarchical acoustic modeling scheme that can be trained using discriminative methods for LVCSR tasks • A model hierarchy is constructed by first using a top-down divisive clustering procedure to create a decision tree; hierarchical layers are then selected by using different stopping criterion to traverse the decision tree.

  6. Hierarchical Gaussian Mixture Models • A tree structure to represent the current status of clustering, and each node in the tree represents a set of acoustic contexts and their corresponding training data. • In general, the node and the question are chosen such that the clustering objective such as the data log-likelihood can be optimized at each step.

  7. Model Hierarchy Construction • A hierarchical model can be constructed from a decision tree by running the clustering algorithm multiple times using different stopping criteria.

  8. Model Scoring • The leaf nodes of a hierarchical model represent the set of its output acoustic labels. • the log-likelihood of a feature vector x with respect to c can be computed by

  9. Discriminative Training : MCE training • MCE training seeks to minimize the number of incorrectly recognized utterances in the training set by increasing the difference between the log-likelihood score of the correct transcription and that of an incorrect hypothesis.

  10. Discriminative Training with Hierarchical Model • Given the loss function L, the gradient can be decomposed into a combination of the gradients of the individual acoustic scores, as would be done for discriminative training of non-hierarchical models: • As a result, the training on a hierarchical model can be reduced to first computing the gradients with respect to all acoustic scores as would be done in the training for a non-hierarchical model, and then distribute the contribution of the gradient into different levels according to ωk.

  11. Experiments- MIT Lecture Corpus • The MIT Lecture Corpus contains audio recordings and manual transcriptions for approximately 300 hours of MIT lectures from eight different courses, and nearly 100 MITWorld seminars given on a variety of topics.

  12. SUMMIT Recognizer • The SUMMIT landmark-based speech recognizer first computes a set of perceptually important time points as landmarks based on an acoustic difference measure, and extracts a feature vector around each landmark. • The acoustic landmarks are represented by a set of diphones labels to model the left and right contexts of the landmarks. • The diphones are clustered using top-down decision tree clustering and are modeled by a set of GMM parameters.

  13. Baseline Models • Although discriminative training always reduced the WERs, the improvements were not as effective as the number of mixture component increased. • This fact suggests that discriminative training has made the model over-fit the training data as the number of parameters increases.

  14. Hierarchical Models • Table 1 summarizes the WERs of the hierarchical model before and after discriminative training. • The statistics of log-likelihood differences show that the hierarchy prevents large decreases in the log-likelihood and can potentially make the model be more resilient to over-fitting.

  15. Conclusion • In this paper, we have described how to construct a hierarchical model for LVCSR task using top-down decision tree clustering, and how to combine a hierarchical acoustic model with discriminative training. • In the future, we plan to further improve the model by applying a more flexible hierarchical structure

  16. Discriminative Pronunciation Learning Using Phonetic Decoder andMinimum-Classification-Error Criterion OriolVinyals, Li Deng, Dong Yu, and Alex Acero Microsoft Research, Redmond, WA International Computer Science Institute, Berkeley, CA

  17. outline • Introduction • Discriminative Pronunciation Learning • Experimental Evaluation • Discussions and Future Work

  18. Introduction-1 • Standard pronunciations are derived from existing dictionaries, and do not cover all diversity of possible lexical items that represent various ways of pronunciating the same word. • On the one hand, alternative pronunciations, obtained typically by manual addition or by maximum likelihood learning, increase the coverage of pronunciation variability. • On the other hand, they may also lead to greater confusability between different lexical items.

  19. Introduction-2 • To overcome the difficulty of increased lexical confusability while introducing alternative pronunciations, one can use discriminative learning to intentionally minimize the confusability. • In our work, we directly exploit the MCE for selecting the more “discriminative” pronunciation alternatives, where these alternatives are derived from high-quality N-best lists or lattices in the phonetic recognition results.

  20. Discriminative Pronunciation Learning-1 • In speech recognition exploiting alternative pronunciations, the decision rule can be approximated by

  21. where X is the sequence of acoustic observations (generally • feature vectors extracted from the speech waveform), • W is a sequence of hypothesized words or a sentence, and • q = 1, 2, . . . , N is the index to multiple phone sequences • that are alternative pronunciations to sentence W. Each • is associated with the q-th path in the lattice of word pronunciations, • and has an implicit dependency onW

  22. Discriminative Pronunciation Learning-2 • Selects pronunciation(s) from the phonetic recognition results for each word such that the sentence error rate in the training data is minimized.

  23. Discriminative Pronunciation Learning-3 • where is the observation from the r-th sentence or utterance, refers to the correct sequence of words for the r-thsentence, refers to all the incorrect hypotheses (obtained from N-best lists produced by the word decoder), is the total number of hypotheses considered for a particular sentence, and Λ is the family of parameters to be estimated to optimize the objective function.

  24. Discriminative Pronunciation Learning-4 • We now define the “MCE score”: • Note that is an approximation of the number of corrected errors after using the new pronunciation p. • Thus, if the value is positive, then there is an improvement of the recognizer performance measured by training-set error reduction with the new pronunciation p for word w.

  25. Discriminative Pronunciation Learning-5 • One of the limitations of finding the pairs in a greedy way is the assumption that errors corrected by a particular pair are uncorrelated with errors corrected by any other pair.

  26. Experimental Evaluation • In the experimental evaluation of the discriminative pronunciation learning technique described in the preceding section, we used the Windows-Live-Search-for-Mobile (WLS4M) database, which consists of very large quantities of short utterances from spoken queries to the WLS4M engine.

  27. Baseline System • A standard HMM-GMM speech recognizer provided as part of the Microsoft Speech API (SAPI) • The acoustic model is trained on about 3000 hours of speech, using PLP features and an HLDA transformation. • The bigram language model is used and trained with realistic system deployment data. • The recognizer has a dictionary with 64,000 distinctive entries.

  28. Performance results • The comparative performance results between the baseline recognizer and the new one with MCE-based pronunciation learning are summarized in Table 2.

  29. Discussions and Future Work • Maximum-likelihood-based approach to generating multiple pronunciation creates greater flexibility in ways that people pronounce words, but the addition of new entries often vastly enlarges confusability across different lexical items. • This is because the maximum likelihood learning does not take into account such lexical confusability.

More Related