1 / 72

Linear Discriminant Feature Extraction for Speech Recognition

This article discusses the history, statistical facts, and solutions of linear discriminant analysis (LDA) for speech recognition. It covers the problem, solution, and limitations of LDA in speech recognition applications.

bburkhalter
Download Presentation

Linear Discriminant Feature Extraction for Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Discriminant Feature Extraction for Speech Recognition Hung-Shin Lee Master Student Spoken Language Processing Lab National Taiwan Normal University 2008/08/14

  2. Arborescence TotalVariation (Campbell, 1984) (Liang, 2007) (Li, 2000) (Loog, 2001) (Loog, 2004) WLDA(Lee, 2008a) DE-LDA(Lee, 2008b) HDA(Saon, 2000) PLDA(Sakai, 2007) HLDA(Kumar, 1998) Linear Discriminant Analysis Discriminant Analysis Error-Rate RelatedLimitations Homoscedasticity Overemphasis Distance Measure Empirical Error Formulations Optimizations Generalized Variance AlternativeFormulations Linear Discriminant Feature Extraction for Speech Recognition

  3. Outline • History (till 1998) - LDA and Speech Recognition • Statistical Facts • Linear Discriminant Analysis (LDA)- Formulations - Optimizations • Some Remarks- Trace & Determinant - Solvability- Decorrelation - Geometry • Error-Rate Related Limitations- Overemphasis - Distance Measure - Empirical Error • Alternative Formulations • Conclusion Linear Discriminant Feature Extraction for Speech Recognition

  4. 1979 • Hunt first used LDA for separating syllables. History (till 1998) - LDA and Speech Recognition 1987 • Brown verified that LDA is superior to PCA for a DHMM classifier. • Brown captured the contextual information by applying LDA on an augmented feature vector. 1989~ • Researchers applied LDA to continuous HMM speech recognition system and reported improved performance on small vocabulary tasks. 1990~ • On large vocabulary phoneme-based systems, LDA led to mixed results. 1992~ • Haeb-Umbach Defined sub-phone units (states) as classes to be discriminated in the LDA transform proved most effective for a continuous mixture density based speech recognizer. Linear Discriminant Feature Extraction for Speech Recognition

  5. Statistical Facts (1) Note that here the sample covariance is a biased estimate of the population covariance. • Class mean vector (sample): • Class covariance (sample): • Total mean vector (sample): Linear Discriminant Feature Extraction for Speech Recognition

  6. Statistical Facts (2) • Within-class scatter: • Between-class scatter: An estimate of the prior probability for class i Note that in general a covariance matrix is symmetric and positive definite with positive eigenvalues. cf. Appendix A class-mean difference Linear Discriminant Feature Extraction for Speech Recognition

  7. Statistical Facts (3) • Total covariance (sample): cf. Appendix B Linear Discriminant Feature Extraction for Speech Recognition

  8. Linear Discriminant Analysis - Problem (1) • PROBLEM: To separate populations. • Suppose we have C classes from n-dimensional distributions. We want to project these classes onto a p-dimensional subspace (p < n) so that the variation among the classes is as large as possible, relative to the variation within the classes. (Wilks, 1963) X2 ? 2 1 ? 3 X1 Linear Discriminant Feature Extraction for Speech Recognition

  9. Linear Discriminant Analysis - Problem (2) • Practically speaking, after the projection, we want Linear Discriminant Feature Extraction for Speech Recognition

  10. Linear Discriminant Analysis - Solution (1) • SOLUTION: Linear Discriminant Analysis (LDA) • Thus, The goal of LDA is to seek a linear transformation that reduces the dimensionality of a given n-dimensional feature vector to p (p < n) by maximizing the discrimination criteria: (Fukunaga, 1990)where This technique was developed by R. A. Fisher (1936) for the two-class case and extended by C. R. Rao (1948)to handle the multiclass case. Linear Discriminant Feature Extraction for Speech Recognition

  11. Linear Discriminant Analysis - Solution (2) • Theoretically, LDA can be interpreted from two aspects: (Johnson et al., 2002), (Gao et al., 2006), (Welch, 1939) • One comes from the Bayesian decision theory with LDA as a straightforward application of C Gaussians with equal covariance. • The other is Fisher’s LDA, which is defined by maximizing the ratio of the between-class and within-class scatter in a linear feature space. • Here only the latter is discussed. Linear Discriminant Feature Extraction for Speech Recognition

  12. Linear Discriminant Analysis - Trace (1) • In , what is the meaning of “trace”? • The Mahalanobis distance can be used as a measurement of class separation. (Fisher, 1936) • Why don’t we first use principal component analysis (PCA) to decorrelate variables, and then use the Euclidean distance and a measure of class separation? Mahalanobis distance is a distance measure based on correlations between variables and is scale-invariant. Linear Discriminant Feature Extraction for Speech Recognition

  13. Linear Discriminant Analysis - Trace (2) • In , what is the meaning of “trace”? (cont.) • Assume all classes share the same covariance , the square of the Mahalanobis distance between and is • The average of for all classes is Linear Discriminant Feature Extraction for Speech Recognition

  14. Linear Discriminant Analysis - Determinant (1) • In , what is the meaning of “determinant”? • The determinant is the product of the eigenvalues, and hence is the product of the “variances” in the principal directions, thereby measuring the square of the hyperellipsoidal scattering volume. (Duda et al., 2001) • The determinant of a nonsingular covariance matrix can also be viewed as the generalized variance. (Wilks, 1963) • The volume of space occupied by the cloud of data points is proportional to the square root of the generalized variance. Linear Discriminant Feature Extraction for Speech Recognition

  15. Linear Discriminant Analysis - Determinant (2) • The concepts of the generalized variance ( ): Linear Discriminant Feature Extraction for Speech Recognition

  16. Linear Discriminant Analysis - Determinant (3) • The concepts of the generalized variance ( ): (cont.) Linear Discriminant Feature Extraction for Speech Recognition

  17. Linear Discriminant Analysis - Determinant (4) • The concepts of the generalized variance ( ): (cont.) Linear Discriminant Feature Extraction for Speech Recognition

  18. Linear Discriminant Analysis - Determinant (5) • The concepts of the generalized variance ( ): • Assume all classes share the same covariance Linear Discriminant Feature Extraction for Speech Recognition

  19. Linear Discriminant Analysis - Determinant (6) • The concepts of the generalized variance ( ): (cont.) • Assume all classes share the same covariance Linear Discriminant Feature Extraction for Speech Recognition

  20. Linear Discriminant Analysis - Optimization (1) • Optimization of : (Fukunaga, 1990) • Differentiating with respect to , and setting the result to zero. Linear Discriminant Feature Extraction for Speech Recognition

  21. Linear Discriminant Analysis - Optimization (2) cf. Appendix C • Optimization of : (cont.) • Simultaneously diagonalize and . • are the eigenvalues of . Note that the eigenvector matrix is not unique. Linear Discriminant Feature Extraction for Speech Recognition

  22. Linear Discriminant Analysis - Optimization (3) Note that all of the eigenvalues are positive. • Optimization of : (cont.) • Note that the original criterion becomes • That is, we can maximize by selecting the largest p eigenvalues of . The corresponding eigenvectors form the transformation matrix. • Note also that , since has a maximum rank of C-1. is the sum of C rank one or less matrices, and because only C-1 of these are independent, Linear Discriminant Feature Extraction for Speech Recognition

  23. Linear Discriminant Analysis - Optimization (4) • Optimization of : • Differentiating with respect to , and setting the result to zero. Linear Discriminant Feature Extraction for Speech Recognition

  24. Linear Discriminant Analysis - Optimization (5) • Optimization of : (cont.) • We can see that the procedure is similar to that of • That is, we can maximize by selecting the largest p eigenvalues of . The corresponding eigenvectors form the transformation matrix. Note that all of the eigenvalues are positive. Linear Discriminant Feature Extraction for Speech Recognition

  25. Some Remarks - Trace & Determinant • Comparisons between two measures of class separation: (Pena, 2003) Linear Discriminant Feature Extraction for Speech Recognition

  26. Some Remarks - Solvability • The solving procedure of LDA is lightweight. • We can see that if a optimization problem can be expressed asthen it can be solved as a generalized eigen-analysis problem with and being the i-th eigenvector and eigenvalue of . Prieto (2003) demonstrated a general solution of the optimization problem of LDA. Linear Discriminant Feature Extraction for Speech Recognition

  27. Some Remarks - Decorrelation (1) • Unlike principal component analysis (PCA), the linear discriminant transformation from the original variates to the new variates is not orthogonal. • But the linear discriminant transformation makes the transformed variates statistically uncorrelated. (Krzanowski, 1988) • That’s why we sometimes use LDA to replace discrete cosine transform (DCT). (Li, 2004) LDA Speech Filter Bank Feature DCT Linear Discriminant Feature Extraction for Speech Recognition

  28. Some Remarks - Decorrelation (2) • From Any two particular eigenvalue/eigenvector pairs and , • Pre-multiplying by and , respectively. Linear Discriminant Feature Extraction for Speech Recognition

  29. Some Remarks - Decorrelation (3) • To overcome arbitrary scaling of , it is usual to adopt the normalization • The optimization problems and can be equivalent to • Many researchers tried to modify the constraint to make their tasks applicable. (Sammon, 1970), (Foley, 1975), (Duchene, 1988) With the constraint, will be unique. Linear Discriminant Feature Extraction for Speech Recognition

  30. Some Remarks - Decorrelation (4) • Thus, the LDA transformation are uncorrelated both within and between classes, and are scaled to have unit variance within classes. • With the constraint, we can give LDA a geometrical meaning. Linear Discriminant Feature Extraction for Speech Recognition

  31. Some Remarks - Geometry (1) • The derivation of the LDA transformation matrix can be geometrically viewed as a two-stage procedure: (Campbell, 1981) • In the first stage, is used as an orthogonal and whitening transformation of the original feature vectors (or variables). X2 Y2 The distribution contour for each class turns to be a unit circle - that’s convenient for measuring class separation by the Euclidean distance. 2 2 1 1 3 3 X1 Y1 Linear Discriminant Feature Extraction for Speech Recognition

  32. Some Remarks - Geometry (2) • The second stage involves a principal component analysis (PCA) on the transformed class means, which seeks new axes that coincide with the directions having the maximum variations of the class means. Z1 PCA is an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on Y2 2 1 3 Y1 Linear Discriminant Feature Extraction for Speech Recognition

  33. Some Remarks - Geometry (3) • The algebraic meaning: (Lee, 2008a) • After the first stage (the whitening stage): • The second stage (the PCA stage): Linear Discriminant Feature Extraction for Speech Recognition

  34. Some Remarks - Geometry (4) • We can see that after the two stages, both between- and within-class scatter matrices become diagonal. • To get back the transformation matrix for the original variables, has to be pre-multiplied by . • It can be algebraically proven that also maximizes the LDA criterion in the original feature space. Linear Discriminant Feature Extraction for Speech Recognition

  35. Some Remarks - Geometry (5) • Thus, we can give an alternative algorithm for deriving an unique LDA transformation. • According to the geometric analysis of LDA, at least two possible directions are offered to further generalize LDA. • To obtain more effective estimates of the within-class scatter • To modify the between-class scatter for better class discrimination. Linear Discriminant Feature Extraction for Speech Recognition

  36. Error-Rate Related Limitations - Overemphasis (1) • After the whitening stage, we can see: Z1 After the projection, the distance between the class pairs with larger distance in the original space are still larger. Y2 2 1 3 Y1 Linear Discriminant Feature Extraction for Speech Recognition

  37. Error-Rate Related Limitations - Overemphasis (2) • The larger , the more the direction becomes visible in the eigenvectors corresponding to the larger eigenvalues of . • Thus, there are large distances between class pairs completely dominate the eigenvalue decomposition. • Consequently, there is a overlap among the remaining classes, leading to an overall low and suboptimal classification error rate. Y2 2 1 The principal axe is dominantly determined by the distance between classes 2 and 3, but classes 1 and 2 need to get more separation for better classification accuracy. 3 Z1 overlap Y1 Linear Discriminant Feature Extraction for Speech Recognition

  38. Error-Rate Related Limitations - Overemphasis (3) • To alleviate the overemphasis of the influence of classes that are already well-separated, some weighting based approaches were proposed. • Modifying the LDA criterion by replacing with the following weighted form: weighting function Linear Discriminant Feature Extraction for Speech Recognition

  39. Error-Rate Related Limitations - Overemphasis (4) • The weighting based LDA has a good solvability, since is invariant to any linear transformations (whitening). • Thus, similar to LDA, we can give an algorithm for deriving an unique transformation. Linear Discriminant Feature Extraction for Speech Recognition

  40. Error-Rate Related Limitations - Overemphasis (5) • A simple and intuitional weighting function is: (Li, 2000), (Liang, 2007) • The above function in essence is a monotonically decreasing functions of such that those class pairs with large will not be overemphasized. Linear Discriminant Feature Extraction for Speech Recognition

  41. Error-Rate Related Limitations - Distance Measure (1) • As a distance-measure based approach, LDA tries to maximize the Mahalanobis distance between each class-mean pairs. • LDA is not directly associated with the classification error. • But, LDA is optimal for classification in a Bayesian sense on the following conditions: • The two-class problem • The classes are normal-distributed with equal-covariance. Linear Discriminant Feature Extraction for Speech Recognition

  42. Limitations & Improvements - Distance Measure (2) • Under the equal-covariance assumption, for two normal-distributed classes, LDA is an optimal approach on two-class classification. • The Bayes error: Linear Discriminant Feature Extraction for Speech Recognition

  43. Error-Rate Related Limitations - Distance Measure (3) • The Chernoff bound: • That is, maximizing the LDA criterion is equivalent to minimizing the Bayes error. Linear Discriminant Feature Extraction for Speech Recognition

  44. Error-Rate Related Limitations - Distance Measure (4) • On multi-class classification, LDA is suboptimal. • The multi-class classification error rates have not been fully investigated. • Geometrically speaking, after the whitening stage, even if the equal-covariance assumption is satisfied, LDA can not guarantee the smaller (not necessarily minimum) overlap among overall classes. • Loog (2001) proposed a new criterion, the approximate pairwise accuracy criterion (aPAC), to solve this problem. • aPAC is derived from an attempt to approximate the Bayes error for pairs of classes. • aPAC still retains the equal-covariance assumption of LDA, and simultaneously retains the computational simplicity of LDA. Linear Discriminant Feature Extraction for Speech Recognition

  45. Error-Rate Related Limitations - Distance Measure (5) • The criterion of aPAC is . • means the error function that is twice the integral of the Gaussian distribution with 0 mean and variance of 1/2. • aPAC can well confine the influence of outlier classes - that makes them more robust than LDA. • But, it cannot be guaranteed that aPAC always lead to improved classification rate. Linear Discriminant Feature Extraction for Speech Recognition

  46. Error-Rate Related Limitations - Distance Measure (6) • Loog’s method (aPAC) (2001) can also solve the overemphasis problem. • The above function in essence is a monotonically decreasing functions of such that those class pairs with large will not be overemphasized. Linear Discriminant Feature Extraction for Speech Recognition

  47. Error-Rate Related Limitations - Distance Measure (7) • To deal with the heteroscedastic data, Loog (2004) modified the aPAC, and proposed a new idea of directed distance matrices (DDMs). • DDMs can be considered as generalizations of , which is based on the Chernoff distance. • The Chernoff distance gives an upper bound on the error probability for two normally distributed densities. Linear Discriminant Feature Extraction for Speech Recognition

  48. Error-Rate Related Limitations - Distance Measure (8) • The Chernoff criterion is It can be showed that Linear Discriminant Feature Extraction for Speech Recognition

  49. Error-Rate Related Limitations - Distance Measure (9) • Although Loog’s method (2004) skillfully transformed the distance-measure based approach into classification-accuracy based one, it suffers some limitations. • It is not distribution-free. • Similar to Loog’s method (2001), it approximates the theoretical C-class Bayes error by sum of the two-class errors, which is an upper bound to the C-class error. • Not all of the classifiers are completely designed as Bayesian contextual ones. Linear Discriminant Feature Extraction for Speech Recognition

  50. Error-Rate Related Limitations - Empirical Error Rate (1) • Lee (2008a) incorporated the empirical classification information from the training data into the derivation of LDA to form a classifier-related objective function. • Advocating the concept of pairwise class confusion information, Lee’s method takes the empirical classification error rates resulted from a given classifier into consideration, rather than operating merely in the Bayesian sense. • The corresponding weighting function is expressed as the number of samples that originally belong to class i but are misallocated to class j by the classifier. Linear Discriminant Feature Extraction for Speech Recognition

More Related