570 likes | 1.06k Views
Principal Component Analysis. Biosystems Data Analysis. From molecule to networks. Protein network of SRD5A2. Yeast metabolic network of Glycolysis. Disease gene network. Biological data. Genes or proteins or metabolites. DATA. Samples. 5.31. 5.31. Repeatability (herhaalbaarheid).
E N D
Principal Component Analysis Biosystems Data Analysis UNIVERSITY OF AMSTERDAM
From molecule to networks Protein network of SRD5A2 Yeast metabolic network of Glycolysis UNIVERSITY OF AMSTERDAM
Disease gene network UNIVERSITY OF AMSTERDAM
Biological data Genes or proteins or metabolites DATA Samples 5.31 5.31 Repeatability (herhaalbaarheid) Reproducibility (reproduceerbaarheid) Biological variability UNIVERSITY OF AMSTERDAM
How to explore such networks Genes or proteins or metabolites DATA Genes or proteins or metabolites Samples Genes or proteins or metabolites Correlation matrix Results are specific for the selected samples/situation UNIVERSITY OF AMSTERDAM
Goals • If you measure multiple variables on an object it can be important to analyze the measurements simultaneously. • Understand the most important tool in multivariate data analysis Principal Component Analysis. UNIVERSITY OF AMSTERDAM
Multiple measurements • If there is a mutual relationship between two or more measurements they are correlated. • There are strong correlations and weak correlations Capabilities in sports and month of birth Mass of an object and the weight of that object on the earth surface UNIVERSITY OF AMSTERDAM
Correlation • Correlation occurs everywhere! • Example: mean height vs. age of a group of young children • A strong linear relationship between height and age is seen. • For young children, height and age are correlated. Moore, D.S. and McCabe G.P., Introduction to the Practice of Statistics (1989). UNIVERSITY OF AMSTERDAM
Correlation in spectroscopy 230 265 • Example: a pure compound is measured at two wavelengths over a range of concentrations 0.9 0.8 0.7 0.6 0.5 Absorbance (units) Conc. (MMol) Intensity at 230nm Intensity at 265nm 0.4 5 0.166 0.090 0.3 10 0.332 0.181 0.2 15 0.498 0.270 20 0.664 0.362 0.1 25 0.831 0.453 0 200 210 220 230 240 250 260 270 280 290 300 Wavelength (nm) UNIVERSITY OF AMSTERDAM
The intensities at 230 and 265 are highly correlated. increasing concentration Correlation in spectroscopy 0.5 0.45 0.4 0.35 • The data is not two-dimensional, but one-dimensional. 0.3 0.25 Absorbance at 265nm (units) 0.2 0.15 0.1 • There is only one factor underlying the data: concentration. 0.05 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Absorbance at 230nm (units) UNIVERSITY OF AMSTERDAM
The data matrix • For example, • Spectroscopy: sample wavelength • Proteomics: patient protein • Information often comes in the form of a matrix: variables objects UNIVERSITY OF AMSTERDAM
Large amounts of data • In (bio)chemical analysis, the measured data matrices can be very large. • An infrared spectrum measured for 50 samples gives a data matrix of size 50 800 = 40,000 numbers! • The matabolome of a 100 patient yield a data matrix of size 100 1000 = 100,000 numbers. • We need a way of extracting the important information from large data matrices. UNIVERSITY OF AMSTERDAM
Principal Component Analysis • Data reduction • PCA reduces large data matrices into two smaller matrices which can be more easily examined, plotted and interpreted. • Data exploration • PCA extracts the most important factors (principal components) from the data. These factors describe multivariate interactions between the measured variables. • Data understanding • Principal components can be used to classify samples, identify compound spectra, determine biomarker etc. UNIVERSITY OF AMSTERDAM
Different views of PCA • Statistically, PCA is a multivariate analysis technique closely related to • eigenvector analysis • singular value decomposition (SVD) • In matrix terms, PCA is a decomposition of X into two smaller matrices plus a set of residuals: X = TPT + E • Geometrically, PCA is a projection technique in which X is projected onto a subspace of reduced dimensions. UNIVERSITY OF AMSTERDAM
PCA: mathematics • The basic equation for PCA is written as where X (I J) is a data matrix, T(I R) are the scores, P(J R) are the loadings and E(I J) are the residuals. R is the number of principal components used to describe X. UNIVERSITY OF AMSTERDAM
Principal comp. % X explained Total % X explained 1 45.6 45.6 2 23.9 69.5 3 18.1 87.6 4 1.3 88.9 Principal components • Principal components describe maximum variance and are calculated in order of importance, e.g. • A principal component is defined by one pair of loadings and scores, , sometimes also known as a latent variable. and so on... up to 100% UNIVERSITY OF AMSTERDAM
= X principal component = + E PT T PCA: matrices loadings ... + + scores UNIVERSITY OF AMSTERDAM
Scores and loadings • Scores • relationships between objects • orthogonal, TTT = diagonal matrix • Loadings • relationships between variables • orthonormal, PTP = identity matrix, I • Similarities and differences between objects (or variables) can be seen by plotting scores (or loadings) against each other. UNIVERSITY OF AMSTERDAM
Numbers example UNIVERSITY OF AMSTERDAM
scores plot PC1 PCA PC2 PCA: simple projection • Simplest case: two correlated variables • PC1 describes 99.77% of the total variation in X. • PC2 describes residual variation (0.23%). UNIVERSITY OF AMSTERDAM
PCA: projections • PCA is a projection technique. • Each row of the data matrix X (IJ) can be considered as a point in J-dimensional space. This data is projected orthogonally onto a subspace of lower dimensionality. • In the previous example, we projected the two-dimensional data onto a one-dimensional space, i.e. onto a line. • Now we will project some J-dimensional data onto a two-dimensional space, i.e. onto a plane. UNIVERSITY OF AMSTERDAM
••••••••••••••• ••••••••••••••• • • ••••••••••••••• + = • • ••••••••••••••• • • • • UNIVERSITY OF AMSTERDAM
Example :Protein data • Protein consumption across Europe was studied. • 9 variables describe different sources of protein. • 25 objects are the different countries. • Data matrix has dimensions 25 9. • Which countries are similar? • Which foods are related to red meat consumption? Weber, A., Agrarpolitik im Spannungsfeld der internationalen Ernaehrungspolitik, Institut fuer Agrarpolitik und marktlehre, Kiel (1973) . UNIVERSITY OF AMSTERDAM
PCA on the protein data • The data is mean-centred and each variable is scaled to unit variance. Then a PCA is performed. Percent Variance Captured by PCA Model Principal Eigenvalue % Variance %Variance Component of Captured Captured Number Cov(X) This PC Total --------- ---------- ---------- ---------- 1 4.01e+000 44.52 44.52 2 1.63e+000 18.17 62.68 3 1.13e+000 12.53 75.22 4 9.55e-001 10.61 85.82 5 4.64e-001 5.15 90.98 6 3.25e-001 3.61 94.59 7 2.72e-001 3.02 97.61 8 1.16e-001 1.29 98.90 9 9.91e-002 1.10 100.00 How many principal components do you want to keep? 4 UNIVERSITY OF AMSTERDAM
2 Albania Bulgaria Romania Austria Yugoslavia 1 Netherlands Hungary Ireland Switzerland Czechoslovakia Finland West Germany Sweden USSR UK 0 Belgium Denmark Italy East Germany Poland France Norway -1 Greece Scores PC 2 (18.17%) -2 Spain -3 -4 Portugal -5 -3 -2 -1 0 1 2 3 4 PC 2 Scores PC 1 (44.52%) Scores: PC1 vs PC2 UNIVERSITY OF AMSTERDAM
0.6 PC1 PC2 0.4 0.2 0 PC loadings -0.2 -0.4 -0.6 -0.8 Red meat White meat Eggs Milk Fish Cereals Starch Beans/nuts/oil Fruit & veg Loadings UNIVERSITY OF AMSTERDAM
2 Albania White meat Cereals Bulgaria Milk Romania Austria Yugoslavia 1 Netherlands Hungary Ireland Switzerland Czechoslovakia Finland Red meat West Germany Eggs Sweden USSR UK 0 Belgium Denmark Italy SE Europeans eat cereal crops East Germany Poland France Norway Beans/nuts/oil -1 Greece PC 2 -2 Starch Spain -3 Fruit & veg PC2 primarily says that the Spanish and Portuguese especially like fruit, vegetables, fish, oils. -4 Fish Portugal -5 -5 -4 -3 -2 -1 0 1 2 3 4 5 PC 1 Biplot: PC1 vs PC2 UNIVERSITY OF AMSTERDAM
4 White meat 3 Fruit & veg 2 Hungary Poland Austria East Germany Starch Czechoslovakia 1 Eggs West Germany Netherlands Spain Cereals PC 3 The Dutch like ‘patat’... Belgium Yugoslavia Bulgaria Italy Romania Portugal 0 France Ireland Switzerland Beans/nuts/oil USSR ...with mayonnaise!? Denmark Greece -1 UK Sweden Red meat Fish Norway Albania -2 Milk Finland -3 -5 -4 -3 -2 -1 0 1 2 3 4 5 PC 1 Red meat and milk are correlated Scandinavians eat fish! Biplot: PC1 vs PC3 UNIVERSITY OF AMSTERDAM
Residuals • It is also important to look at the model residuals, E. • Ideally, the residuals will not contain any structure - just unsystematic variation (noise). UNIVERSITY OF AMSTERDAM
Country 23 (USSR) fits the model least well Residuals • The (squared) model residuals can be summed along the object or variable direction: UNIVERSITY OF AMSTERDAM
Centering and scaling • We are often interested in the differences between objects, not in their absolute values. • protein data: differences between countries • If different variables are measured in different units, some scaling is needed to give each variable an equal chance of contributing to the model. UNIVERSITY OF AMSTERDAM
Mean-centering 6.525 0.0 Mean-centering • Subtract the mean from each column of X: 36.75 10840 0.0 0.0 UNIVERSITY OF AMSTERDAM
Scaling 0.171 1.0 Scaling • Divide each column of X by its standard deviation: 1.139 704.8 1.0 1.0 UNIVERSITY OF AMSTERDAM
How many PC’s to use? • Too few PC’s: • some systematic variation is not described. • model does not fully summarise the data. X = TPT + E systematic variation noise • Too many PC’s: • latter PC’s describe noise. • model is not robust when applied to new data. • How to select the correct number of PC’s? UNIVERSITY OF AMSTERDAM
How many PC’s to use? • Eigenvalue plots ‘Knee’ here - select 4 PC’s • Select components where explained % variance > noise level • Look at PC scores and loadings - do they make sense?! Do residuals have structure? • Cross-validation UNIVERSITY OF AMSTERDAM
Calculate PRESS: Cross-validation • Remove subset of the data - test set. ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• • Build model on remaining data - training set. • Project test set onto model - calculate residuals. • Repeat for next test set. ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• ••••••••••••••• • Repeat for R = 1,2,3... UNIVERSITY OF AMSTERDAM
8 PC’s gives very high CV error Overall minimum at 4 PC’s First minimum at 2 PC’s PRESS plot 5 50 Eigenvalue of Cov(x) b) PRESS (r) 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 Latent Variable UNIVERSITY OF AMSTERDAM
Remove outlier Outliers • Outliers are objects which are very different from the rest of the data. These can have a large effect on the principal component model and should be removed. bad experiment UNIVERSITY OF AMSTERDAM
6 4 2 Scores PC 2 0 -2 -4 -6 -8 Scores PC 1 -8 -6 -4 -2 0 2 4 6 8 Outliers • Outliers can also be found in the model space or in the residuals. UNIVERSITY OF AMSTERDAM
...but is not valid for 30 year olds! Linear model was valid for this age range... Model extrapolation can be dangerous! UNIVERSITY OF AMSTERDAM
Conclusions • Principal component analysis (PCA) reduces large, collinear matrices into two smaller matrices - scores and loadings: • Principal components • describe the important variation in the data. • are calculated in order of importance. • are orthogonal. UNIVERSITY OF AMSTERDAM
Conclusions • Scores plots and biplots can be useful for exploring and understanding the data. • It is often correct to mean-center and scale the variables prior to analysis. • It is important to include the correct number of PC’s in the PCA model. One method for determining this is called cross-validation. UNIVERSITY OF AMSTERDAM