Discriminative MLE training using a product of Gaussian Likelihoods

Discriminative MLE training using a product of Gaussian Likelihoods T. Nagarajan and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Canada ICSLP 06

Outline • Introduction • Model selection methods • Bayesian information criterion • Product of Gaussian likelihoods • Experimental setup • PoG-based model selection • Performance analysis • Conclusions

References • [SAP 2000][Padmanabhan and Bahl], Model Complexity Adaptation Using a Discriminant • [ICASSP 03][Airey and Gales], Product of Gaussians and Multiple Stream Systems

Introduction • Defining a structure of a HMM is one of the major issues in acoustic modeling • The number of states is usually chosen based on either the number of acoustic variations that one may expect across the utterance or the length of the utterance • The number of mixtures per state can be chosen based on the amount of training data available • Proportional to the number of data samples (PD) • Alternative criteria of model selection • Bayesian Information Criterion (also referred to as MDL, Minimum Description Length) • Discriminative Model Complexity Adaptation (MAC)

Introduction • The major difference between the proposed technique with others lies in the fact that the complexity of a model is adjusted by considering only the training examples of other classes and not their corresponding models • In this technique, the focus is given to how well a given model can discriminative training data of different classes

Model selection Methods • The focus is given to choosing a proper topology for the models to be generated, especially the number of mixtures per state • We compare the performance of the proposed technique with that of the conventional BIC-based technique • For this work, acoustic modeling is carried out using the MLE algorithm only • These systems are implemented using the HTK

Bayesian Information Criterion • The number of components of a model is chosen by maximizing an objective function • That is essentially the likelihood of the training examples of a model penalized by the number of components in that model and the number of training examples • α is an additional penalty factor used to control the complexities of the resultant models

Bayesian Information Criterion • Disadvantage • It does not consider information about other classes • This may lead to an increased error rate especially when most competitive and closely resembling other classes

Product of Gaussian likelihoods • Figure 1(b) is possible only when the model is λj well trained • During training of the model, the acoustic likelihoods of all the utterances of the Cj should be maximized to a greater extent • To maximize the likelihood on the training data, the estimation procedure often tries to make the variances of all the mixture components very small • But it often provides poor matches to independent test data

Product of Gaussian likelihoods • Thus, it is always better to reduce the overlap between the likelihood Gaussians (say, Nii and Nij) of utterances of different classes (say, Ci and Cj) for a given model (λi) • We assume that two Gaussians overlap with each other if either of the following conditions is met: • If , irrespective of their corresponding variances • If is wide enough so that both the Gaussians overlap considerably • In order to quantify the amount of overlap between two Gaussians, we can use error bounds, like the Chernoff or Bhattacharyya bounds

Product of Gaussian likelihoods • However, these bounds are sensitive to the variances of the Gaussians • Here, we use a similar logic of “product of Gaussian” for estimating the amount of overlap between two Gaussians

Product of Gaussian likelihoods • In order to quantify the amount of overlap between two different Gaussians, we define the following ration

Product of Gaussian likelihoods • However, for this case we expect the overlap O to be equal to 1 • The resultant ON is used as a measure to estimate the amount of overlap between two Gaussians

Experimental setup • The TIMIT corpus is considered for both training and testing • The word-lexicon is created only with the test words • For the words that in common in train and test data, pronunciation variations are taken from the transcriptions provided with the corpus • For the rest of the test words only one transcription is considered • Syllables extracted from the phonetic segmentation boundaries • Only considered 200 syllables that have more than 50 examples • The rest are replaced by their corresponding phonemes • 200 syllables and 46 monophone models are initalized

Experimental setup • For initialized models • the number of states is fixed based on the number of phonemes for a given sub-word unit • the number of mixtures per state in considered as one • For the re-estimation of model parameters • A standard Viterbi alignment procedure is used • The number of mixtures per state for each model is then increased to 30, in steps of 1, by a conventional mixture splitting procedure • Each time, the model parameters are re-estimated twic

PoG-based model selection

Performance analysis • Since the model size seems to grow uncontrollably, we fixed the maximum number of mixtures per state at 30

Conclusions • In conventional techniques for model optimization, the topology of a model is optimized either without considering other classes, or considering a subset of competing models • While in this work, it consider whether a given model can discriminative training utterances of other classes from its own

Discriminative model complexity adaptation • The discriminant measure is a two-dimensional vector

Discriminative MLE training using a product of Gaussian Likelihoods

Discriminative MLE training using a product of Gaussian Likelihoods

Presentation Transcript

Ch 5b: Discriminative Training (temporal model)

Solution of linear equations using Gaussian elimination

Discriminative Training of Markov Logic Networks

Product Training

Product Training Ideas Using Articulate Storyline

LECTURE 33: DISCRIMINATIVE TRAINING

LECTURE 32: DISCRIMINATIVE TRAINING

Gaussian 03 Calculations using GridNexus

Discriminative Rescoring using Landmarks

MLE Training For Curriculum Managers

Discriminative Dialog Analysis Using a Massive Collection of BBS comments

Discriminative Training of Chow-Liu tree Multinet Classifiers

Solution of linear equations using Gaussian elimination

LECTURE 31: DISCRIMINATIVE TRAINING

MLE professional development Basic Training 1

Survey ICASSP 2007 Discriminative Training

Using the MLE as a Leadership tool

Multi-document summarization using A* search and discriminative training

Additivity of auditory masking using Gaussian-shaped tones

Discriminative Training of Markov Logic Networks