290 likes | 434 Views
ICASSP 2009: Acoustic Model Survey. Yueng-Tien,Lo. Discriminative Training of Hierarchical Acoustic Models For Large Vocabulary Continuous Speech Recognition. Hung-An Chang and James R. Glass MIT Computer Science and Artificial Intelligence Laboratory Cambridge, Massachusetts, 02139, USA
E N D
ICASSP 2009: Acoustic Model Survey Yueng-Tien,Lo
Discriminative Training of Hierarchical Acoustic Models For LargeVocabulary Continuous Speech Recognition Hung-An Chang and James R. Glass MIT Computer Science and Artificial Intelligence Laboratory Cambridge, Massachusetts, 02139, USA {hung_an, glass}@csail.mit.edu
outline • Introduction • Hierarchical Gaussian Mixture Models • Discriminative Training • Experiments • Conclusion
Introduction-1 • In recent years, discriminative training methods have demonstrated considerable success. • Another approach to improve the acoustic modeling component is to utilize a more flexible model structure by constructing a hierarchical tree for the models.
Introduction-2 • A hierarchical acoustic modeling scheme that can be trained using discriminative methods for LVCSR tasks • A model hierarchy is constructed by first using a top-down divisive clustering procedure to create a decision tree; hierarchical layers are then selected by using different stopping criterion to traverse the decision tree.
Hierarchical Gaussian Mixture Models • A tree structure to represent the current status of clustering, and each node in the tree represents a set of acoustic contexts and their corresponding training data. • In general, the node and the question are chosen such that the clustering objective such as the data log-likelihood can be optimized at each step.
Model Hierarchy Construction • A hierarchical model can be constructed from a decision tree by running the clustering algorithm multiple times using different stopping criteria.
Model Scoring • The leaf nodes of a hierarchical model represent the set of its output acoustic labels. • the log-likelihood of a feature vector x with respect to c can be computed by
Discriminative Training : MCE training • MCE training seeks to minimize the number of incorrectly recognized utterances in the training set by increasing the difference between the log-likelihood score of the correct transcription and that of an incorrect hypothesis.
Discriminative Training with Hierarchical Model • Given the loss function L, the gradient can be decomposed into a combination of the gradients of the individual acoustic scores, as would be done for discriminative training of non-hierarchical models: • As a result, the training on a hierarchical model can be reduced to first computing the gradients with respect to all acoustic scores as would be done in the training for a non-hierarchical model, and then distribute the contribution of the gradient into different levels according to ωk.
Experiments- MIT Lecture Corpus • The MIT Lecture Corpus contains audio recordings and manual transcriptions for approximately 300 hours of MIT lectures from eight different courses, and nearly 100 MITWorld seminars given on a variety of topics.
SUMMIT Recognizer • The SUMMIT landmark-based speech recognizer first computes a set of perceptually important time points as landmarks based on an acoustic difference measure, and extracts a feature vector around each landmark. • The acoustic landmarks are represented by a set of diphones labels to model the left and right contexts of the landmarks. • The diphones are clustered using top-down decision tree clustering and are modeled by a set of GMM parameters.
Baseline Models • Although discriminative training always reduced the WERs, the improvements were not as effective as the number of mixture component increased. • This fact suggests that discriminative training has made the model over-fit the training data as the number of parameters increases.
Hierarchical Models • Table 1 summarizes the WERs of the hierarchical model before and after discriminative training. • The statistics of log-likelihood differences show that the hierarchy prevents large decreases in the log-likelihood and can potentially make the model be more resilient to over-fitting.
Conclusion • In this paper, we have described how to construct a hierarchical model for LVCSR task using top-down decision tree clustering, and how to combine a hierarchical acoustic model with discriminative training. • In the future, we plan to further improve the model by applying a more flexible hierarchical structure
Discriminative Pronunciation Learning Using Phonetic Decoder andMinimum-Classification-Error Criterion OriolVinyals, Li Deng, Dong Yu, and Alex Acero Microsoft Research, Redmond, WA International Computer Science Institute, Berkeley, CA
outline • Introduction • Discriminative Pronunciation Learning • Experimental Evaluation • Discussions and Future Work
Introduction-1 • Standard pronunciations are derived from existing dictionaries, and do not cover all diversity of possible lexical items that represent various ways of pronunciating the same word. • On the one hand, alternative pronunciations, obtained typically by manual addition or by maximum likelihood learning, increase the coverage of pronunciation variability. • On the other hand, they may also lead to greater confusability between different lexical items.
Introduction-2 • To overcome the difficulty of increased lexical confusability while introducing alternative pronunciations, one can use discriminative learning to intentionally minimize the confusability. • In our work, we directly exploit the MCE for selecting the more “discriminative” pronunciation alternatives, where these alternatives are derived from high-quality N-best lists or lattices in the phonetic recognition results.
Discriminative Pronunciation Learning-1 • In speech recognition exploiting alternative pronunciations, the decision rule can be approximated by
where X is the sequence of acoustic observations (generally • feature vectors extracted from the speech waveform), • W is a sequence of hypothesized words or a sentence, and • q = 1, 2, . . . , N is the index to multiple phone sequences • that are alternative pronunciations to sentence W. Each • is associated with the q-th path in the lattice of word pronunciations, • and has an implicit dependency onW
Discriminative Pronunciation Learning-2 • Selects pronunciation(s) from the phonetic recognition results for each word such that the sentence error rate in the training data is minimized.
Discriminative Pronunciation Learning-3 • where is the observation from the r-th sentence or utterance, refers to the correct sequence of words for the r-thsentence, refers to all the incorrect hypotheses (obtained from N-best lists produced by the word decoder), is the total number of hypotheses considered for a particular sentence, and Λ is the family of parameters to be estimated to optimize the objective function.
Discriminative Pronunciation Learning-4 • We now define the “MCE score”: • Note that is an approximation of the number of corrected errors after using the new pronunciation p. • Thus, if the value is positive, then there is an improvement of the recognizer performance measured by training-set error reduction with the new pronunciation p for word w.
Discriminative Pronunciation Learning-5 • One of the limitations of finding the pairs in a greedy way is the assumption that errors corrected by a particular pair are uncorrelated with errors corrected by any other pair.
Experimental Evaluation • In the experimental evaluation of the discriminative pronunciation learning technique described in the preceding section, we used the Windows-Live-Search-for-Mobile (WLS4M) database, which consists of very large quantities of short utterances from spoken queries to the WLS4M engine.
Baseline System • A standard HMM-GMM speech recognizer provided as part of the Microsoft Speech API (SAPI) • The acoustic model is trained on about 3000 hours of speech, using PLP features and an HLDA transformation. • The bigram language model is used and trained with realistic system deployment data. • The recognizer has a dictionary with 64,000 distinctive entries.
Performance results • The comparative performance results between the baseline recognizer and the new one with MCE-based pronunciation learning are summarized in Table 2.
Discussions and Future Work • Maximum-likelihood-based approach to generating multiple pronunciation creates greater flexibility in ways that people pronounce words, but the addition of new entries often vastly enlarges confusability across different lexical items. • This is because the maximum likelihood learning does not take into account such lexical confusability.