420 likes | 638 Views
6.Principal Component Analysis. Principal component analysis (PCA) is a technique that is useful for the compression and classification of data.
E N D
6.Principal Component Analysis Principal component analysis (PCA) is a technique that is useful for the compression and classification of data. The purpose is to reduce the dimensionality of a data set (sample) by finding a new set of variables, smaller than the original set, that nonetheless retains most of the sample's information. 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis By information we mean the variation present in the sample, given by the correlations between the original variables. The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains. 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis Principal component • direction of maximum variance in the input space • principal eigenvector of the covariance matrix Goal: Relate these two definitions 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis • Variance A random variablefluctuating about its mean value Average of the square of the fluctuations 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis • Covariance Pair of random variables, each fluctuating about its mean value. Average of product of fluctuations 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis • Covariance matrix N random variables NxN symmetric matrix Diagonal elements are variances 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis Principal components • eigenvectors with k largest eigenvaluesNow you can calculate them, but what do they mean? 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis • Covariance to variance From the covariance, the variance of any projection can be calculated. w:unit vector 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis • Maximizing parallel variance Principal eigenvector of C (the one with the largest eigenvalue) 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis Geometric picture of principal components • the 1st PC z1 is a minimum distance fit to a line in X space • the 2nd PC z2 is a minimum distance fit to a line in the plane perpendicular to the 1st PC 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis Then… 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis Algebraic definition of PCs Given a sample of n observations on a vector of p variables x = (x1,x2,….,xp) define the first principal component of the sample by the linear transformation λ where the vector is chosen such that is maximum 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal ComponentAnalysis Likewise, define the kth PC of the sample by the linear transformation where the vector is chosen such that is maximum subject to 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal Component Analysis To find first note that is the covariance matrix for the variables 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal ComponentAnalysis To find maximize subject to Let λ be a Lagrange multiplier then maximize by differentiating… therefore is an eigenvector of corresponding to eigenvalue 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal ComponentAnalysis Algebraic derivation of We have maximized So is the largest eigenvalue of The first PC retains the greatest amount of variation in the sample. 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal ComponentAnalysis To find vector maximize subject to First note that then let λ and φ be Lagrange multipliers, and maximize 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal ComponentAnalysis We find that is also an eigenvector of whose eigenvalue is the second largest. In general The kth largest eigenvalue of is the variance of the kth PC. The kth PC retains the kth greatest fraction of the variation in the sample. 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal ComponentAnalysis Algebraic formulation of PCA Given a sample of n observations on a vector of p variables define a vector of p PCs according to where is an orthogonal p x p matrix whose kth column is the kth eigenvector of 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal ComponentAnalysis Then is the covariance matrix of the PCs, being diagonal with elements 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal ComponentAnalysis In general it is useful to define standardized variables by If the are each measured about their sample mean then the covariance matrix of will be equal to the correlation matrix of and the PCs will be dimensionless 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal ComponentAnalysis Practical computation of PCs Given a sample of n observations on a vector of p variables (each measured about its sample mean) compute the covariance matrix where is the n x p matrix whose ith row is the ith observation 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal ComponentAnalysis Then, compute the n x p matrix whose ith row is the PC score for the ith observation: Write to decompose each observation into PCs: 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal ComponentAnalysis Usage of PCA: Data compression Because the kth PC retains the kth greatest fraction of the variation, we can approximate each observation by truncating the sum at the first m < p PCs: 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006
6.Principal ComponentAnalysis This reduces the dimensionality of the data from p to m < p by approximating where is the n x m portion of and is the p x m portion of 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006