1 / 39

Large scale discriminative training for speech recognition

Large scale discriminative training for speech recognition. P.C. Woodland and D. Povey Cambridge University Engineering Department 2000 present by 士弘. Outline. Introduction MMIE criterion Extended Baum-Welch algorithm

giza
Download Presentation

Large scale discriminative training for speech recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large scale discriminative training for speech recognition P.C. Woodland and D. Povey Cambridge University Engineering Department 2000present by 士弘

  2. Outline • Introduction • MMIE criterion • Extended Baum-Welch algorithm • Improving MMIE generalization • Experimental setup • Hub5 MMIE training experiments • conclusion

  3. Introduction • MMIE training was an alternative to MLE. • During MLE training, model parameters are adjusted to increase the likelihood of the word strings corresponding to the training utterances without taking account of the probability of other possible word strings. • In contrast to MLE, discriminative training schemes take account of possible competing word hypotheses and try and reduce the probability of incorrect hypotheses.

  4. Three issues • Unfortunately, the discriminative optimization of HMM parameters is much more complex than the conventional MLE framework. • Computational load on LVCSR. • Lattice • No closed-form solution. • Approximation • More iterations required. • Step size

  5. Introducing MMIE Training • It is well known that the decoder with minimum probability of error is the so-called maximum a posteriori (MAP) decoder W Y Channel

  6. Introducing MMIE Training • Of course, we know neither of these values, so instead we use

  7. MMIE criterion Language Model

  8. auxiliary function auxiliary function for smoothing MMIE criterion

  9. Estimating HMM Parameters with MMIE Basic Concepts • Consider the discrete HMM case :

  10. Estimating HMM Parameters with MMIE Basic Concepts • Then • In Baum-Welch (MLE) training, is used to reestimate the values of HMM parameters.

  11. Estimating HMM Parameters with MMIE Basic Concepts • Note that this can also be expressed in terms of the gradient

  12. Estimating HMM Parameters with MMIE Basic Concepts • Proof :

  13. Estimating HMM Parameters with MMIE Basic Concepts • For Gaussian densities, the reestimated mean vector and covariance matrix will be computed as

  14. Estimating HMM Parameters with MMIE Basic Concepts • Now, whatever MMIE parameter estimation technique is used, the value of the gradient will have to be computed at each iteration for every parameter that must be estimated. • In order to do so, let us define a model such that • is a model containing a path corresponding to every possible word sequence W in the application.

  15. Estimating HMM Parameters with MMIE Basic Concepts • In defining such a model is that it implicitly takes care of the sum in the denominator of • Then, the value of the gradient is :

  16. Estimating HMM Parameters with MMIE Basic Concepts

  17. Estimating HMM Parameters with MMIE Basic Concepts

  18. Estimating HMM Parameters with MMIE Basic Concepts • If , then will be dominated by the paths corresponding to the correct transcription and roughly the same amount will be added to and subtracted from the same count, with negligible effect on the ultimate value of these count. • If , then some counts in the model for will be incremented while other counts in other models will be decremented.

  19. Estimating HMM Parameters with MMIE Alternatives to Gradient Descent • Gradient descent is a fairly safe parameter estimation technique in that, with a small enough step size, it should converge to some local optimum in the objective function. • The problem is that of course, we don’t want to use small step sizes since this will also mean slow convergence, something that cannot usually be afforded given the computational cost involved in each iteration.

  20. Beyond the Standard MMIE Training Formulation • The use of this compensation factor fits quite naturally within the MMIE framework where is an empirically estimated compensation factor.

  21. Extended Baum-Welch algorithm • The MMIE objective function can be optimized by any of the standard gradient methods, however, such approaches are often slow to converge. • Analogous to the Baum-Welch algorithm for MLE training, Gopalakrishnan (ICASSP 89) have shown that a re-estimation formulae of the form (4)

  22. Mean and variance updates • For a continuous density HMM system, Eq.(4) does not lead to a closed form solution for the means and variances. However, Normandin has shown a re-estimation formulae (5) (6)

  23. Setting the constant D • The speed of convergence of MMI-based optimization using Eqs. (4) and (5) is directly related to the value of the constant D. • Small D results in a larger step size and hence a potentially faster rate of convergence. • However, using small values of D typically results in oscillatory behavior which reduces the rate of convergence. • In practice a useful lower bound on D is the value which ensures that all variances remain positive.

  24. Setting the constant D

  25. Setting the constant D • Furthermore, using a single global value of D can lead to very slow convergence. • In preliminary experiments, it was found that the convergence speed could be further improved if D was set on a per-Gaussian level. • It was set at the maximum of • Twice the value necessary to ensure positive variance updates for all dimensions for the Gaussian • A global constant E multiplied by the denominator occupancy • E=1,2 or halfmax

  26. Mixture weight and transition probability updates • The originally proposed EBW re-estimation formula for the mixture weight parameters follows directly from Equation (4) • The constant C is chosen such that all mixture weights are positive.

  27. Mixture weight and transition probability updates • However, the derivative is extremely sensitive to small-valued parameters. As an alternative, a more robust approximation for the derivative was suggested.

  28. Improving MMIE generalization • A key issue in MMIE training is the generalization performance. • While MMIE training often greatly reduces training set error from an MLR baseline, the reduction in error rate on an independent test set is normally much less, the generalization performance is poorer. • There have been a number of approaches to try to improve generalization performance for MMIE-type training schemes, some of which are discussed late.

  29. Frame Discrimination • FD replaces the recognition model likelihood in the denominator of with all Gaussians in parallel. (A unigram Gaussian level language model based on training set occurrences is used) • This in turn, as would be expected, reduces training set performance compared to MMIE but improves generalization.

  30. Weakened Language Models • It was shown that improved test-set performance could be obtained using a unigram LM during MMIE training, even though a bigram or trigram was used during recognition. • This aim is to provide more focus on the discrimination provided by the acoustic model by loosening the language model constraints.

  31. Acoustic Model “Scaling” • When combining the likelihood from an GMM-based acoustic model and the LM, it is usual to scale the LM log probability. • This is necessary because, primarily due to invalid modeling assumptions, the HMM underestimates the probability of acoustic vector sequences leading to a very wide dynamic range of likelihood values.

  32. Acoustic Model “Scaling” • If language model scaling is used, one particular state-sequence tends to dominate the likelihood at any point in time and hence dominates any sums using path likelihoods. • If acoustic scaling is used, there will be several paths that have fairly similar likelihoods which make a non-negligible contribution to the summations.

  33. Experimental setup • Basic CU-HTK Hub5 system • CMS + VTLN + unsupervised adaptation • LM (interpolation of Hub5 training transcriptions and Broadcast News) • Trigram LM

  34. Training Corpus

  35. Result

  36. Result

  37. Result

  38. Conclusion

More Related