Data Analytics Courses

Dimension Reduction using Principal Components Analysis (PCA)

Application of dimension reduction • Computational advantage for other algorithms • Face recognition— image data (pixels) along new axes works better for recognizing faces • Image compression

Data for 25 undergraduate programs at business schools in US universities in 1995. Univ Brown CalTech CMU Columbia Cornell Dartmouth Duke Georgetown Harvard JohnsHopkins MIT Northwestern NotreDame PennState Princeton Purdue Stanford TexasA&M UCBerkeley UChicago UMichigan UPenn UVA UWisconsin Yale SAT 1310 1415 1260 1310 1280 1340 1315 1255 1400 1305 1380 1260 1255 1081 1375 1005 1360 1075 1240 1290 1180 1285 1225 1085 1375 Top10 Accept SFRatio Expenses GradRate 89 22 13 100 25 6 62 59 9 76 24 12 83 33 13 89 23 10 90 30 12 74 24 12 91 14 11 75 44 7 94 30 10 85 39 11 81 42 13 38 54 18 91 14 8 28 90 19 90 20 12 49 67 25 95 40 17 75 50 13 65 68 16 80 36 11 77 44 14 40 69 15 95 19 11 Use PCA to: 1) Reduce # columns 22,704 63,575 25,026 31,510 21,864 32,162 31,585 20,126 39,525 58,691 34,870 28,052 15,122 10,185 30,220 9,066 36,450 8,704 15,140 38,380 15,470 27,553 13,349 11,857 43,514 94 81 72 88 90 95 95 92 97 87 91 89 94 80 95 69 93 67 78 87 85 90 92 71 96 Additional benefits: 2) Identify relation between columns 3) Visualize universities in 2D Source: US News & World Report, Sept 18 1995

PCA Input ? Output PC1 PC2 PC3 PC4 PC5 PC6 Univ SAT Top10 Accept SFRatio Expenses GradRate 1310 89 22 13 1415 100 25 6 1260 62 59 9 1310 76 24 12 1280 83 33 13 1340 89 23 10 1315 90 30 12 1255 74 24 12 1400 91 14 11 1305 75 44 7 Brown CalTech CMU Columbia Cornell Dartmouth Duke Georgetown Harvard JohnsHopkins 22,704 63,575 25,026 31,510 21,864 32,162 31,585 20,126 39,525 58,691 94 81 72 88 90 95 95 92 97 87 Hope is that a fewer columns may capture most of the information from the original dataset

The Primitive Idea – Intuition First How to compress the data loosing the least amount of information?

Input Output PCA • p principal components (= p weighted averages of original measurements) • p measurements/ original columns • Uncorrelated • Ordered by variance • Keep top principal components; drop rest • Correlated

Mechanism PC1 PC2 PC3 PC4 PC5 PC6 Univ SAT Top10 Accept SFRatio Expenses GradRate 1310 89 22 13 1415 100 25 6 1260 62 59 9 1310 76 24 12 1280 83 33 13 1340 89 23 10 1315 90 30 12 1255 74 24 12 1400 91 14 11 1305 75 44 7 Brown CalTech CMU Columbia Cornell Dartmouth Duke Georgetown Harvard JohnsHopkins 22,704 63,575 25,026 31,510 21,864 32,162 31,585 20,126 39,525 58,691 94 81 72 88 90 95 95 92 97 87 The ithprincipal component is a weighted average of original measurements/columns: PC a = X a + X a + X + … i i1 1 i2 2 ip p Weights (aij) are chosen such that: 1. PCs are ordered by their variance (PC1has largest variance, followed by PC2, PC3, and so on) 2. Pairs of PCs have correlation = 0 3. For each PC, sum of squared weights =1

PC a = X a + X a + X + … i i1 1 i2 2 ip p Demystifying weight computation • Main idea: high variance = lots of information Var(X ) Var(X ) Var(PC i i i a a + … + + 2 2 2 1 2 1 2 2 2 a = + + … + + ) Var(X ) 1 2 p 1 2 ip a a Cov(X ,X ) a a Cov(X ,X ) 1 1 i i i p- ip p- p Also want, Covar ( , ) , 0 when = ≠ PC PC i j i j • Goal: Find weights aijthat maximize variance of PCi, while keeping PCiuncorrelated to other PCs. • The covariance matrix of the X’s is needed.

Standardize the inputs Univ Brown CalTech CMU Columbia Cornell Dartmouth Duke Georgetown Harvard JohnsHopkins MIT Northwestern NotreDame PennState Princeton Purdue Stanford TexasA&M UCBerkeley UChicago UMichigan UPenn UVA UWisconsin Yale Z_SAT Z_Top10 Z_Accept Z_SFRatio Z_Expenses Z_GradRate 0.4020 0.6442 -0.8719 0.0688 1.3710 1.2103 -0.7198 -1.6522 -0.0594 -0.7451 1.0037 -0.9146 0.4020 -0.0247 -0.7705 -0.1770 0.1251 0.3355 -0.3143 0.0688 0.6788 0.6442 -0.8212 -0.6687 0.4481 0.6957 -0.4664 -0.1770 -0.1056 -0.1276 -0.7705 -0.1770 1.2326 0.7471 -1.2774 -0.4229 0.3559 -0.0762 0.2433 -1.4063 1.0480 0.9015 -0.4664 -0.6687 -0.0594 0.4384 -0.0101 -0.4229 -0.1056 0.2326 0.1419 0.0688 -1.7113 -1.9800 0.7502 1.2981 1.0018 0.7471 -1.2774 -1.1605 -2.4127 -2.4946 2.5751 1.5440 0.8634 0.6957 -0.9733 -0.1770 -1.7667 -1.4140 1.4092 3.0192 -0.2440 0.9530 0.0406 1.0523 0.2174 -0.0762 0.5475 0.0688 -0.7977 -0.5907 1.4599 0.8064 0.1713 0.1811 -0.1622 -0.4229 -0.3824 0.0268 0.2433 0.3147 -1.6744 -1.8771 1.5106 0.5606 1.0018 0.9530 -1.0240 -0.4229 Why? • variables with large variances will have bigger influence on result -0.3247 2.5087 -0.1637 0.2858 -0.3829 0.3310 0.2910 -0.5034 0.8414 2.1701 0.5187 0.0460 -0.8503 -1.1926 0.1963 -1.2702 0.6282 -1.2953 -0.8491 0.7620 -0.8262 0.0114 -0.9732 -1.0767 1.1179 0.8037 -0.6315 -1.6251 0.1413 0.3621 0.9141 0.9141 0.5829 1.1349 0.0309 0.4725 0.2517 0.8037 -0.7419 0.9141 -1.9563 0.6933 -2.1771 -0.9627 0.0309 -0.1899 0.3621 0.5829 -1.7355 1.0245 Solution • Standardize before applying PCA Excel: =standardize(cell, average(column), stdev(column))

Standardization shortcut for PCA • Rather than standardize the data manually, you can use correlation matrix instead of covariance matrix as input • PCA with and without standardization gives different results!

PCA Transform > Principal Components (correlation matrix has been used here) • PCs are uncorrelated • Var(PC1) > Var (PC2) > ... Principal Components PC a = X a + X a + X + … i i1 1 i2 2 ip p Scaled Data PC Scores

Computing principal scores • For each record, we can compute their score on each PC. • Multiply each weight (aij) by the appropriate Xij • Example for Brown University (using standardized numbers): • PC Score1 for Brown University = (– 0.458)(0.40) +(–0.427)(.64) +(0.424)(–0.87) +(0.391)(.07) + (–0.363)(–0.32) + (–0.379)(.80) = –0.989

R Code for PCA (Assignment) OPTIONAL R Code install.packages("gdata") ## for reading xlsfiles install.packages("xlsx") ## ” for reading xlsxfiles mydata<-read.xlsx("University Ranking.xlsx",1) ## use read.csv for csvfiles mydata## make sure the data is loaded correctly help(princomp) ## to understand the apifor princomp pcaObj<-princomp(mydata[1:25,2:7], cor= TRUE, scores = TRUE, covmat= NULL) ## the first column in mydatahas university names ## princomp(mydata, cor= TRUE) not_same_asprcomp(mydata, scale=TRUE); similar, but different summary(pcaObj) loadings(pcaObj) plot(pcaObj) biplot(pcaObj) pcaObj$loadings pcaObj$scores

Goal #1: Reduce data dimension • PCs are ordered by their variance (=information) • Choose top few PCs and drop the rest! Example: • PC1captures most ??% of the information. • The first 2 PCs capture ??% • Data reduction: use only two variables instead of 6.

Matrix Transpose OPTIONAL: R code help(matrix) A<-matrix(c(1,2),nrow=1,ncol=2,byrow=TRUE) A t(A) B<-matrix(c(1,2,3,4),nrow=2,ncol=2,byrow=TRUE) B t(B) C<matrix(c(1,2,3,4,5,6),nrow=3,ncol=2,byrow=TRUE) C t(C)

Matrix Multiplication OPTIONAL R Code A<- matrix(c(1,2,3,4,5,6),nrow=3,ncol=2,byrow= TRUE) A B<- matrix(c(1,2,3,4,5,6,7,8),nrow=2,ncol=4,byro w=TRUE) B C<-A%*%B D<-t(B)%*%t(A) ## note, B%*%A is not possible; how does D look like?

Matrix Inverse If, , identity matrix × = A B I A1 - Then, B = Identity matrix : OPTIONAL R Code   0 1 0 0 ... ## How to create nˣn Identity matrix?   1 0 0 0 ...   help(diag) A<-diag(5)   . . . . . . ... .   0 0 ...1 0 ## find inverse of a matrix solve(A)       0 0 1 0 ...

Data Compression     [ ] PCScores PrincipalC omponents = × ScaledData     × N p × × N p p p [ ] ScaledData × N p [ ] 1 − [ ] PCScores = × PrincipalC omponents × N p × p p c = Number of components kept; c ≤ p   T [ ] PCScores = × PrincipalC omponents   × N p × p p Approximat ion : [ ] T [ ] [ ] PCScores = × Approximat edScaledDa ta PrincipalC omponents × N c × N p × c p

Goal #2: Learn relationships with PCA by interpreting the weights • ai1,…, aipare the coefficients for PCi. • They describe the role of original X variables in computing PCi. • Useful in providing context-specific interpretation of each PC.

PC1Scores (choose one or more) 1. are approximately a simple average of the 6 variables 2. measure the degree of high Accept & SFRatio, but low Expenses, GradRate, SAT, and Top10

Goal #3: Use PCA for visualization • The first 2 (or 3) PCs provide a way to project the data from a p-dimensional space onto a 2D (or 3D) space

Scatter Plot: PC2vs. PC1 scores

Monitoring batch processes using PCA • Multivariate data at different time points • Historical database of successful batches are used • Multivariate trajectory data is projected to low-dimensional space >>> Simple monitoring charts to spot outlier

Your Turn! 1. If we use a subset of the principal components, is this useful for prediction? for explanation? 2. What are advantages and weaknesses of PCA compared to choosing a subset of the variables? 3. PCA vs. Clustering

Data Analytics Courses