730 likes | 898 Views
Linear Discriminant Feature Extraction for Speech Recognition. Hung-Shin Lee Master Student Spoken Language Processing Lab National Taiwan Normal University 2008/08/14. Arborescence. Total Variation. (Campbell, 1984). (Liang, 2007). (Li, 2000). (Loog, 2001). (Loog, 2004).
E N D
Linear Discriminant Feature Extraction for Speech Recognition Hung-Shin Lee Master Student Spoken Language Processing Lab National Taiwan Normal University 2008/08/14
Arborescence TotalVariation (Campbell, 1984) (Liang, 2007) (Li, 2000) (Loog, 2001) (Loog, 2004) WLDA(Lee, 2008a) DE-LDA(Lee, 2008b) HDA(Saon, 2000) PLDA(Sakai, 2007) HLDA(Kumar, 1998) Linear Discriminant Analysis Discriminant Analysis Error-Rate RelatedLimitations Homoscedasticity Overemphasis Distance Measure Empirical Error Formulations Optimizations Generalized Variance AlternativeFormulations Linear Discriminant Feature Extraction for Speech Recognition
Outline • History (till 1998) - LDA and Speech Recognition • Statistical Facts • Linear Discriminant Analysis (LDA)- Formulations - Optimizations • Some Remarks- Trace & Determinant - Solvability- Decorrelation - Geometry • Error-Rate Related Limitations- Overemphasis - Distance Measure - Empirical Error • Alternative Formulations • Conclusion Linear Discriminant Feature Extraction for Speech Recognition
1979 • Hunt first used LDA for separating syllables. History (till 1998) - LDA and Speech Recognition 1987 • Brown verified that LDA is superior to PCA for a DHMM classifier. • Brown captured the contextual information by applying LDA on an augmented feature vector. 1989~ • Researchers applied LDA to continuous HMM speech recognition system and reported improved performance on small vocabulary tasks. 1990~ • On large vocabulary phoneme-based systems, LDA led to mixed results. 1992~ • Haeb-Umbach Defined sub-phone units (states) as classes to be discriminated in the LDA transform proved most effective for a continuous mixture density based speech recognizer. Linear Discriminant Feature Extraction for Speech Recognition
Statistical Facts (1) Note that here the sample covariance is a biased estimate of the population covariance. • Class mean vector (sample): • Class covariance (sample): • Total mean vector (sample): Linear Discriminant Feature Extraction for Speech Recognition
Statistical Facts (2) • Within-class scatter: • Between-class scatter: An estimate of the prior probability for class i Note that in general a covariance matrix is symmetric and positive definite with positive eigenvalues. cf. Appendix A class-mean difference Linear Discriminant Feature Extraction for Speech Recognition
Statistical Facts (3) • Total covariance (sample): cf. Appendix B Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Problem (1) • PROBLEM: To separate populations. • Suppose we have C classes from n-dimensional distributions. We want to project these classes onto a p-dimensional subspace (p < n) so that the variation among the classes is as large as possible, relative to the variation within the classes. (Wilks, 1963) X2 ? 2 1 ? 3 X1 Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Problem (2) • Practically speaking, after the projection, we want Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Solution (1) • SOLUTION: Linear Discriminant Analysis (LDA) • Thus, The goal of LDA is to seek a linear transformation that reduces the dimensionality of a given n-dimensional feature vector to p (p < n) by maximizing the discrimination criteria: (Fukunaga, 1990)where This technique was developed by R. A. Fisher (1936) for the two-class case and extended by C. R. Rao (1948)to handle the multiclass case. Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Solution (2) • Theoretically, LDA can be interpreted from two aspects: (Johnson et al., 2002), (Gao et al., 2006), (Welch, 1939) • One comes from the Bayesian decision theory with LDA as a straightforward application of C Gaussians with equal covariance. • The other is Fisher’s LDA, which is defined by maximizing the ratio of the between-class and within-class scatter in a linear feature space. • Here only the latter is discussed. Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Trace (1) • In , what is the meaning of “trace”? • The Mahalanobis distance can be used as a measurement of class separation. (Fisher, 1936) • Why don’t we first use principal component analysis (PCA) to decorrelate variables, and then use the Euclidean distance and a measure of class separation? Mahalanobis distance is a distance measure based on correlations between variables and is scale-invariant. Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Trace (2) • In , what is the meaning of “trace”? (cont.) • Assume all classes share the same covariance , the square of the Mahalanobis distance between and is • The average of for all classes is Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Determinant (1) • In , what is the meaning of “determinant”? • The determinant is the product of the eigenvalues, and hence is the product of the “variances” in the principal directions, thereby measuring the square of the hyperellipsoidal scattering volume. (Duda et al., 2001) • The determinant of a nonsingular covariance matrix can also be viewed as the generalized variance. (Wilks, 1963) • The volume of space occupied by the cloud of data points is proportional to the square root of the generalized variance. Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Determinant (2) • The concepts of the generalized variance ( ): Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Determinant (3) • The concepts of the generalized variance ( ): (cont.) Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Determinant (4) • The concepts of the generalized variance ( ): (cont.) Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Determinant (5) • The concepts of the generalized variance ( ): • Assume all classes share the same covariance Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Determinant (6) • The concepts of the generalized variance ( ): (cont.) • Assume all classes share the same covariance Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Optimization (1) • Optimization of : (Fukunaga, 1990) • Differentiating with respect to , and setting the result to zero. Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Optimization (2) cf. Appendix C • Optimization of : (cont.) • Simultaneously diagonalize and . • are the eigenvalues of . Note that the eigenvector matrix is not unique. Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Optimization (3) Note that all of the eigenvalues are positive. • Optimization of : (cont.) • Note that the original criterion becomes • That is, we can maximize by selecting the largest p eigenvalues of . The corresponding eigenvectors form the transformation matrix. • Note also that , since has a maximum rank of C-1. is the sum of C rank one or less matrices, and because only C-1 of these are independent, Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Optimization (4) • Optimization of : • Differentiating with respect to , and setting the result to zero. Linear Discriminant Feature Extraction for Speech Recognition
Linear Discriminant Analysis - Optimization (5) • Optimization of : (cont.) • We can see that the procedure is similar to that of • That is, we can maximize by selecting the largest p eigenvalues of . The corresponding eigenvectors form the transformation matrix. Note that all of the eigenvalues are positive. Linear Discriminant Feature Extraction for Speech Recognition
Some Remarks - Trace & Determinant • Comparisons between two measures of class separation: (Pena, 2003) Linear Discriminant Feature Extraction for Speech Recognition
Some Remarks - Solvability • The solving procedure of LDA is lightweight. • We can see that if a optimization problem can be expressed asthen it can be solved as a generalized eigen-analysis problem with and being the i-th eigenvector and eigenvalue of . Prieto (2003) demonstrated a general solution of the optimization problem of LDA. Linear Discriminant Feature Extraction for Speech Recognition
Some Remarks - Decorrelation (1) • Unlike principal component analysis (PCA), the linear discriminant transformation from the original variates to the new variates is not orthogonal. • But the linear discriminant transformation makes the transformed variates statistically uncorrelated. (Krzanowski, 1988) • That’s why we sometimes use LDA to replace discrete cosine transform (DCT). (Li, 2004) LDA Speech Filter Bank Feature DCT Linear Discriminant Feature Extraction for Speech Recognition
Some Remarks - Decorrelation (2) • From Any two particular eigenvalue/eigenvector pairs and , • Pre-multiplying by and , respectively. Linear Discriminant Feature Extraction for Speech Recognition
Some Remarks - Decorrelation (3) • To overcome arbitrary scaling of , it is usual to adopt the normalization • The optimization problems and can be equivalent to • Many researchers tried to modify the constraint to make their tasks applicable. (Sammon, 1970), (Foley, 1975), (Duchene, 1988) With the constraint, will be unique. Linear Discriminant Feature Extraction for Speech Recognition
Some Remarks - Decorrelation (4) • Thus, the LDA transformation are uncorrelated both within and between classes, and are scaled to have unit variance within classes. • With the constraint, we can give LDA a geometrical meaning. Linear Discriminant Feature Extraction for Speech Recognition
Some Remarks - Geometry (1) • The derivation of the LDA transformation matrix can be geometrically viewed as a two-stage procedure: (Campbell, 1981) • In the first stage, is used as an orthogonal and whitening transformation of the original feature vectors (or variables). X2 Y2 The distribution contour for each class turns to be a unit circle - that’s convenient for measuring class separation by the Euclidean distance. 2 2 1 1 3 3 X1 Y1 Linear Discriminant Feature Extraction for Speech Recognition
Some Remarks - Geometry (2) • The second stage involves a principal component analysis (PCA) on the transformed class means, which seeks new axes that coincide with the directions having the maximum variations of the class means. Z1 PCA is an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on Y2 2 1 3 Y1 Linear Discriminant Feature Extraction for Speech Recognition
Some Remarks - Geometry (3) • The algebraic meaning: (Lee, 2008a) • After the first stage (the whitening stage): • The second stage (the PCA stage): Linear Discriminant Feature Extraction for Speech Recognition
Some Remarks - Geometry (4) • We can see that after the two stages, both between- and within-class scatter matrices become diagonal. • To get back the transformation matrix for the original variables, has to be pre-multiplied by . • It can be algebraically proven that also maximizes the LDA criterion in the original feature space. Linear Discriminant Feature Extraction for Speech Recognition
Some Remarks - Geometry (5) • Thus, we can give an alternative algorithm for deriving an unique LDA transformation. • According to the geometric analysis of LDA, at least two possible directions are offered to further generalize LDA. • To obtain more effective estimates of the within-class scatter • To modify the between-class scatter for better class discrimination. Linear Discriminant Feature Extraction for Speech Recognition
Error-Rate Related Limitations - Overemphasis (1) • After the whitening stage, we can see: Z1 After the projection, the distance between the class pairs with larger distance in the original space are still larger. Y2 2 1 3 Y1 Linear Discriminant Feature Extraction for Speech Recognition
Error-Rate Related Limitations - Overemphasis (2) • The larger , the more the direction becomes visible in the eigenvectors corresponding to the larger eigenvalues of . • Thus, there are large distances between class pairs completely dominate the eigenvalue decomposition. • Consequently, there is a overlap among the remaining classes, leading to an overall low and suboptimal classification error rate. Y2 2 1 The principal axe is dominantly determined by the distance between classes 2 and 3, but classes 1 and 2 need to get more separation for better classification accuracy. 3 Z1 overlap Y1 Linear Discriminant Feature Extraction for Speech Recognition
Error-Rate Related Limitations - Overemphasis (3) • To alleviate the overemphasis of the influence of classes that are already well-separated, some weighting based approaches were proposed. • Modifying the LDA criterion by replacing with the following weighted form: weighting function Linear Discriminant Feature Extraction for Speech Recognition
Error-Rate Related Limitations - Overemphasis (4) • The weighting based LDA has a good solvability, since is invariant to any linear transformations (whitening). • Thus, similar to LDA, we can give an algorithm for deriving an unique transformation. Linear Discriminant Feature Extraction for Speech Recognition
Error-Rate Related Limitations - Overemphasis (5) • A simple and intuitional weighting function is: (Li, 2000), (Liang, 2007) • The above function in essence is a monotonically decreasing functions of such that those class pairs with large will not be overemphasized. Linear Discriminant Feature Extraction for Speech Recognition
Error-Rate Related Limitations - Distance Measure (1) • As a distance-measure based approach, LDA tries to maximize the Mahalanobis distance between each class-mean pairs. • LDA is not directly associated with the classification error. • But, LDA is optimal for classification in a Bayesian sense on the following conditions: • The two-class problem • The classes are normal-distributed with equal-covariance. Linear Discriminant Feature Extraction for Speech Recognition
Limitations & Improvements - Distance Measure (2) • Under the equal-covariance assumption, for two normal-distributed classes, LDA is an optimal approach on two-class classification. • The Bayes error: Linear Discriminant Feature Extraction for Speech Recognition
Error-Rate Related Limitations - Distance Measure (3) • The Chernoff bound: • That is, maximizing the LDA criterion is equivalent to minimizing the Bayes error. Linear Discriminant Feature Extraction for Speech Recognition
Error-Rate Related Limitations - Distance Measure (4) • On multi-class classification, LDA is suboptimal. • The multi-class classification error rates have not been fully investigated. • Geometrically speaking, after the whitening stage, even if the equal-covariance assumption is satisfied, LDA can not guarantee the smaller (not necessarily minimum) overlap among overall classes. • Loog (2001) proposed a new criterion, the approximate pairwise accuracy criterion (aPAC), to solve this problem. • aPAC is derived from an attempt to approximate the Bayes error for pairs of classes. • aPAC still retains the equal-covariance assumption of LDA, and simultaneously retains the computational simplicity of LDA. Linear Discriminant Feature Extraction for Speech Recognition
Error-Rate Related Limitations - Distance Measure (5) • The criterion of aPAC is . • means the error function that is twice the integral of the Gaussian distribution with 0 mean and variance of 1/2. • aPAC can well confine the influence of outlier classes - that makes them more robust than LDA. • But, it cannot be guaranteed that aPAC always lead to improved classification rate. Linear Discriminant Feature Extraction for Speech Recognition
Error-Rate Related Limitations - Distance Measure (6) • Loog’s method (aPAC) (2001) can also solve the overemphasis problem. • The above function in essence is a monotonically decreasing functions of such that those class pairs with large will not be overemphasized. Linear Discriminant Feature Extraction for Speech Recognition
Error-Rate Related Limitations - Distance Measure (7) • To deal with the heteroscedastic data, Loog (2004) modified the aPAC, and proposed a new idea of directed distance matrices (DDMs). • DDMs can be considered as generalizations of , which is based on the Chernoff distance. • The Chernoff distance gives an upper bound on the error probability for two normally distributed densities. Linear Discriminant Feature Extraction for Speech Recognition
Error-Rate Related Limitations - Distance Measure (8) • The Chernoff criterion is It can be showed that Linear Discriminant Feature Extraction for Speech Recognition
Error-Rate Related Limitations - Distance Measure (9) • Although Loog’s method (2004) skillfully transformed the distance-measure based approach into classification-accuracy based one, it suffers some limitations. • It is not distribution-free. • Similar to Loog’s method (2001), it approximates the theoretical C-class Bayes error by sum of the two-class errors, which is an upper bound to the C-class error. • Not all of the classifiers are completely designed as Bayesian contextual ones. Linear Discriminant Feature Extraction for Speech Recognition
Error-Rate Related Limitations - Empirical Error Rate (1) • Lee (2008a) incorporated the empirical classification information from the training data into the derivation of LDA to form a classifier-related objective function. • Advocating the concept of pairwise class confusion information, Lee’s method takes the empirical classification error rates resulted from a given classifier into consideration, rather than operating merely in the Bayesian sense. • The corresponding weighting function is expressed as the number of samples that originally belong to class i but are misallocated to class j by the classifier. Linear Discriminant Feature Extraction for Speech Recognition