1 / 40

6.Principal Component Analysis

6.Principal Component Analysis. Principal component analysis (PCA) is a technique that is useful for the compression and classification of data.

hamlet
Download Presentation

6.Principal Component Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 6.Principal Component Analysis Principal component analysis (PCA) is a technique that is useful for the compression and classification of data. The purpose is to reduce the dimensionality of a data set (sample) by finding a new set of variables, smaller than the original set, that nonetheless retains most of the sample's information. 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  2. 6.Principal Component Analysis By information we mean the variation present in the sample, given by the correlations between the original variables. The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains. 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  3. 6.Principal Component Analysis Principal component • direction of maximum variance in the input space • principal eigenvector of the covariance matrix Goal: Relate these two definitions 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  4. 6.Principal Component Analysis • Variance A random variablefluctuating about its mean value Average of the square of the fluctuations 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  5. 6.Principal Component Analysis • Covariance Pair of random variables, each fluctuating about its mean value. Average of product of fluctuations 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  6. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  7. 6.Principal Component Analysis • Covariance matrix N random variables NxN symmetric matrix Diagonal elements are variances 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  8. 6.Principal Component Analysis Principal components • eigenvectors with k largest eigenvaluesNow you can calculate them, but what do they mean? 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  9. 6.Principal Component Analysis • Covariance to variance From the covariance, the variance of any projection can be calculated. w:unit vector 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  10. 6.Principal Component Analysis • Maximizing parallel variance Principal eigenvector of C (the one with the largest eigenvalue) 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  11. 6.Principal Component Analysis Geometric picture of principal components • the 1st PC z1 is a minimum distance fit to a line in X space • the 2nd PC z2 is a minimum distance fit to a line in the plane perpendicular to the 1st PC 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  12. 6.Principal Component Analysis Then… 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  13. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  14. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  15. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  16. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  17. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  18. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  19. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  20. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  21. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  22. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  23. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  24. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  25. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  26. 6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  27. 6.Principal Component Analysis Algebraic definition of PCs Given a sample of n observations on a vector of p variables x = (x1,x2,….,xp) define the first principal component of the sample by the linear transformation λ where the vector is chosen such that is maximum 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  28. 6.Principal ComponentAnalysis Likewise, define the kth PC of the sample by the linear transformation where the vector is chosen such that is maximum subject to 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  29. 6.Principal Component Analysis To find first note that is the covariance matrix for the variables 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  30. 6.Principal ComponentAnalysis To find maximize subject to Let λ be a Lagrange multiplier then maximize by differentiating… therefore is an eigenvector of corresponding to eigenvalue 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  31. 6.Principal ComponentAnalysis Algebraic derivation of We have maximized So is the largest eigenvalue of The first PC retains the greatest amount of variation in the sample. 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  32. 6.Principal ComponentAnalysis To find vector maximize subject to First note that then let λ and φ be Lagrange multipliers, and maximize 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  33. 6.Principal ComponentAnalysis We find that is also an eigenvector of whose eigenvalue is the second largest. In general The kth largest eigenvalue of is the variance of the kth PC. The kth PC retains the kth greatest fraction of the variation in the sample. 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  34. 6.Principal ComponentAnalysis Algebraic formulation of PCA Given a sample of n observations on a vector of p variables define a vector of p PCs according to where is an orthogonal p x p matrix whose kth column is the kth eigenvector of 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  35. 6.Principal ComponentAnalysis Then is the covariance matrix of the PCs, being diagonal with elements 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  36. 6.Principal ComponentAnalysis In general it is useful to define standardized variables by If the are each measured about their sample mean then the covariance matrix of will be equal to the correlation matrix of and the PCs will be dimensionless 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  37. 6.Principal ComponentAnalysis Practical computation of PCs Given a sample of n observations on a vector of p variables (each measured about its sample mean) compute the covariance matrix where is the n x p matrix whose ith row is the ith observation 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  38. 6.Principal ComponentAnalysis Then, compute the n x p matrix whose ith row is the PC score for the ith observation: Write to decompose each observation into PCs: 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  39. 6.Principal ComponentAnalysis Usage of PCA: Data compression Because the kth PC retains the kth greatest fraction of the variation, we can approximate each observation by truncating the sum at the first m < p PCs: 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

  40. 6.Principal ComponentAnalysis This reduces the dimensionality of the data from p to m < p by approximating where is the n x m portion of and is the p x m portion of 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

More Related