Large scale discriminative training for speech recognition

Large scale discriminative training for speech recognition P.C. Woodland and D. Povey Cambridge University Engineering Department 2000present by 士弘

Outline • Introduction • MMIE criterion • Extended Baum-Welch algorithm • Improving MMIE generalization • Experimental setup • Hub5 MMIE training experiments • conclusion

Introduction • MMIE training was an alternative to MLE. • During MLE training, model parameters are adjusted to increase the likelihood of the word strings corresponding to the training utterances without taking account of the probability of other possible word strings. • In contrast to MLE, discriminative training schemes take account of possible competing word hypotheses and try and reduce the probability of incorrect hypotheses.

Three issues • Unfortunately, the discriminative optimization of HMM parameters is much more complex than the conventional MLE framework. • Computational load on LVCSR. • Lattice • No closed-form solution. • Approximation • More iterations required. • Step size

Introducing MMIE Training • It is well known that the decoder with minimum probability of error is the so-called maximum a posteriori (MAP) decoder W Y Channel

Introducing MMIE Training • Of course, we know neither of these values, so instead we use

MMIE criterion Language Model

auxiliary function auxiliary function for smoothing MMIE criterion

Estimating HMM Parameters with MMIE Basic Concepts • Consider the discrete HMM case :

Estimating HMM Parameters with MMIE Basic Concepts • Then • In Baum-Welch (MLE) training, is used to reestimate the values of HMM parameters.

Estimating HMM Parameters with MMIE Basic Concepts • Note that this can also be expressed in terms of the gradient

Estimating HMM Parameters with MMIE Basic Concepts • Proof :

Estimating HMM Parameters with MMIE Basic Concepts • For Gaussian densities, the reestimated mean vector and covariance matrix will be computed as

Estimating HMM Parameters with MMIE Basic Concepts • Now, whatever MMIE parameter estimation technique is used, the value of the gradient will have to be computed at each iteration for every parameter that must be estimated. • In order to do so, let us define a model such that • is a model containing a path corresponding to every possible word sequence W in the application.

Estimating HMM Parameters with MMIE Basic Concepts • In defining such a model is that it implicitly takes care of the sum in the denominator of • Then, the value of the gradient is :

Estimating HMM Parameters with MMIE Basic Concepts

Estimating HMM Parameters with MMIE Basic Concepts • If , then will be dominated by the paths corresponding to the correct transcription and roughly the same amount will be added to and subtracted from the same count, with negligible effect on the ultimate value of these count. • If , then some counts in the model for will be incremented while other counts in other models will be decremented.

Estimating HMM Parameters with MMIE Alternatives to Gradient Descent • Gradient descent is a fairly safe parameter estimation technique in that, with a small enough step size, it should converge to some local optimum in the objective function. • The problem is that of course, we don’t want to use small step sizes since this will also mean slow convergence, something that cannot usually be afforded given the computational cost involved in each iteration.

Beyond the Standard MMIE Training Formulation • The use of this compensation factor fits quite naturally within the MMIE framework where is an empirically estimated compensation factor.

Extended Baum-Welch algorithm • The MMIE objective function can be optimized by any of the standard gradient methods, however, such approaches are often slow to converge. • Analogous to the Baum-Welch algorithm for MLE training, Gopalakrishnan (ICASSP 89) have shown that a re-estimation formulae of the form (4)

Mean and variance updates • For a continuous density HMM system, Eq.(4) does not lead to a closed form solution for the means and variances. However, Normandin has shown a re-estimation formulae (5) (6)

Setting the constant D • The speed of convergence of MMI-based optimization using Eqs. (4) and (5) is directly related to the value of the constant D. • Small D results in a larger step size and hence a potentially faster rate of convergence. • However, using small values of D typically results in oscillatory behavior which reduces the rate of convergence. • In practice a useful lower bound on D is the value which ensures that all variances remain positive.

Setting the constant D

Setting the constant D • Furthermore, using a single global value of D can lead to very slow convergence. • In preliminary experiments, it was found that the convergence speed could be further improved if D was set on a per-Gaussian level. • It was set at the maximum of • Twice the value necessary to ensure positive variance updates for all dimensions for the Gaussian • A global constant E multiplied by the denominator occupancy • E=1,2 or halfmax

Mixture weight and transition probability updates • The originally proposed EBW re-estimation formula for the mixture weight parameters follows directly from Equation (4) • The constant C is chosen such that all mixture weights are positive.

Mixture weight and transition probability updates • However, the derivative is extremely sensitive to small-valued parameters. As an alternative, a more robust approximation for the derivative was suggested.

Improving MMIE generalization • A key issue in MMIE training is the generalization performance. • While MMIE training often greatly reduces training set error from an MLR baseline, the reduction in error rate on an independent test set is normally much less, the generalization performance is poorer. • There have been a number of approaches to try to improve generalization performance for MMIE-type training schemes, some of which are discussed late.

Frame Discrimination • FD replaces the recognition model likelihood in the denominator of with all Gaussians in parallel. (A unigram Gaussian level language model based on training set occurrences is used) • This in turn, as would be expected, reduces training set performance compared to MMIE but improves generalization.

Weakened Language Models • It was shown that improved test-set performance could be obtained using a unigram LM during MMIE training, even though a bigram or trigram was used during recognition. • This aim is to provide more focus on the discrimination provided by the acoustic model by loosening the language model constraints.

Acoustic Model “Scaling” • When combining the likelihood from an GMM-based acoustic model and the LM, it is usual to scale the LM log probability. • This is necessary because, primarily due to invalid modeling assumptions, the HMM underestimates the probability of acoustic vector sequences leading to a very wide dynamic range of likelihood values.

Acoustic Model “Scaling” • If language model scaling is used, one particular state-sequence tends to dominate the likelihood at any point in time and hence dominates any sums using path likelihoods. • If acoustic scaling is used, there will be several paths that have fairly similar likelihoods which make a non-negligible contribution to the summations.

Experimental setup • Basic CU-HTK Hub5 system • CMS + VTLN + unsupervised adaptation • LM (interpolation of Hub5 training transcriptions and Broadcast News) • Trigram LM

Training Corpus

Result

Conclusion

Large scale discriminative training for speech recognition

Large scale discriminative training for speech recognition

Presentation Transcript

Speech Recognition

Large Vocabulary Continuous Speech Recognition (LVCSR)

Using Speech Recognition for Speech Therapy

DIGITAL SIGNAL PROCESSING ARCHITECTURE FOR LARGE VOCABULARY SPEECH RECOGNITION

Large Scale Visual Recognition Challenge 2011

Discriminative and Generative Recognition

Analysis of Large Scale Visual Recognition

Speech recognition

Large-Margin Feature Adaptation for Automatic Speech Recognition

Speech Recognition

Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition

SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS

Speech Recognition

Discriminative Feature Optimization for Speech Recognition

Discriminative Training Approaches for Continuous Speech Recognition

SPEECH RECOGNITION:

Large Scale Visual Recognition Challenge 2011

Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

Speech Recognition

Network Training for Continuous Speech Recognition

Speech Recognition