1 / 23

Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition

Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition. Fei Sha*, Lawrence K. Saul University of Pennsylvania *University of California. Outline. Introduction Large margin mixture models Large margin classification Mixture models Extensions

orde
Download Presentation

Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania *University of California

  2. Outline • Introduction • Large margin mixture models • Large margin classification • Mixture models • Extensions • Handling of outliers • Segmental training • Experimental results • Conclusion

  3. Introduction(1/3) • Much of the acoustic-phonetic modeling in ASR is handled by GMMs but the ML estimation of GMMs does not directly optimize the performance • Therefore, it is interesting to develop alternative learning paradigms that optimize discriminative measures of performance • Large margin GMMs have many parallels to support vector machines (SVMs) but use ellipsoids to model classes instead of half-spaces Figure from mathworld Figure from wiki

  4. Introduction(2/3) • SVMs currently provide state-of-the-art performance in pattern recognition • The simplest setting for SVMs is binary classification which compute the linear decision boundary that maximizes the margin of correct classification • For various reasons, it can be challenging to apply SVMs to large problems in multiway as opposed to binary classification • First, to apply the kernel trick, one must construct a large kernel matrix with as many rows and columns as training examples • Second, the training complexity increases with the numbers of classes

  5. Introduction(3/3) • As in SVMs, the proposed approach is based on the idea of margin maximization • Model parameters are trained discriminatively to maximize the margin of correct classification, as measured in terms of Mahalanobis distance • The required optimization is convex over the model’s parameter space of positive semidefinite matrices and can be performed efficiently

  6. Large margin mixture models(1/6) • The simplest large margin GMM represents each class of labeled examples by a single ellipsoid • Each ellipsoid is parameterized by • a vector “centroid” • a positive semidefinite “orientation” matrix • these are analogous to the means and inverse covariance matrices of multivariate Gaussians • in addition, a nonnegative scalar offset for each class is used in the scoring procedure • Thus, let representing examples in class

  7. Large margin mixture models(2/6) • We label an example by whichever ellipsoid has the smallest Mahalanobis distance (plus offset) to its centroid: • The goal of learning is to estimate the parameters for each class of labeled examples that optimize the performance of this decision rule (1) Mahalanobis distance

  8. Large margin mixture models(3/6) • It’s useful to collect the ellipsoid parameters of each class in a single enlarged positive semidefinite matrix: • We can then rewrite the decision rule in eq.(1) as simply: • The goal of learning is simply to estimate the single matrix for each class of labeled examples (2) (3)

  9. Large margin mixture models(4/6)

  10. Large margin classification(5/6) • In more detail, let denote a set of N labeled examples drawn from C classes, where and • In large margin GMMs, we seek matrices Φc such that all the examples in the training set are correctly classified by a large margin, i.e., situated far from the decision boundaries that define competing classes • For the nth example with class label yn , this condition can be written as: • State that for each competing class c ≠ yn , the Mahalanobis distance (plus offset) to the cth centroid exceeds the Mahalanobis distance (plus offset) to the target centroid by a margin of at least one unit (4)

  11. Large margin classification(6/6) • We adopt a convex loss function for training large margin GMMs • Letting [f]+ = max(0,f) denote the so-called “hinge” function • The loss function is a piecewise linear, convex function of the Φc, which are further constrained to be positive semidefinite • Its optimization can thus be formulated as a problem in semidefinite programming that can be generically solved by interior point algorithms with polynomial time guarantees (5) Regularizes the matrices Φc Penalizes margin violations of eq.(4)

  12. Mixture models(1/2) • Now, extend the previous model to represent each class by multiple ellipsoids • Let Φcm denote the matrix for the mth ellipsoid in class c • Each example xn has not only a class label yn , but also a mixture component label mn • which are by fitting a GMM to the examples in each class by ML estimation, then for each example, computing the mixture component with the highest posterior probability under this GMM

  13. Mixture models(2/2) • Given joint labels (yn, mn), we replace the large margin criterion in eq. (4) by: • To see this, note that • Replace the loss function: • It is no longer piecewise linear, but remains convex (6) Need derivation!!

  14. Extensions • Two further extensions of large margin GMMs are important for problems in ASR: • Handling of outliers • Segmental training • Handling of outliers • Many discriminative learning algorithms are sensitive to outliers • We adopt a simple strategy to detect outliers and reduce their malicious effect on learning

  15. Handling of outliers(1/2) • Outliers are detected using ML estimates of the mean and covariance matrix of each class • These estimates are used to initialize matrices ΦcML of the form in eq. (2) • For each example xn, we compute the accumulated hinge loss incurred by violations of the large margin constraints • We associate the outliers with large values of (7)

  16. Handling of outliers(2/2) • In particular, correcting one badly misclassified outlier decreases the cost function proposed in eq. (5) more than correcting multiple examples that lie just barely on the wrong side of a decision boundary • We reweight the hinge loss terms in eq. (5) involving example xn by a multiplicative factor of • This reweighting equalizes the losses incurred by all initially misclassified examples

  17. Segmental training • We can relax margin constraints to apply, collectively, to multiple examples known to share the same class label • Let p index the l frames in the nth phonetic segment • For segmental training, we rewrite the constraints as: (8)

  18. Experimental results(1/3) • We applied large margin GMMs to problems in phonetic classification and recognition on the TIMIT database • We mapped the 61 phonetic labels in TIMIT to 48 classes and trained ML and large margin GMMs for each class • Our front end computed MFCCs with 25 ms windows at a 10 ms frame rate • We retained the first 13 MFCC coefficients of each frame, along with their first and second time derivatives • GMMs modeled these 39-dimensinal feature vectors after they were whitened by PCA

  19. Experimental results(2/3) • Phonetic classification • The speech has been correctly segmented into phonetic units, but that the phonetic class label of each segment is unknown • The best large margin GMM yields a slightly lower classification error rate than state-of-the-art results (21.7%) obtained by hidden conditional random fields • Phonetic recognition • The recognizers were first-order HMMs with one context-independent state per phonetic class • The large margin GMMs lead to consistently lower error rates

  20. Experimental results(3/3)

  21. Conclusion • We have shown how to learn GMMs for multiway classification based on similar principles as large margin classification in SVMs • Classes are represented by ellipsoids whose location, shape, and size are discriminatively trained to maximize the margin of correct classification, as measured in terms of Mahalanobis distance • In ongoing work • Investigate the use of context-dependent phone models • Study schemes for integrating the large margin training of GMMs with sequence models such as HMMs and/or conditional random fields

  22. Mahalanobis distance • In statistics, Mahalanobis distance is a distance measure introduced by P.C. Mahalanobis in 1936 • It is based on correlations between variables by which different pattern can be identified and analysed • It’s a useful way of determining similarity of an unknown sample set to a known one

  23. Ellipsoid • Any quadratic form can be written as the form of XTAX • For example:

More Related