610 likes | 1.33k Views
Sparse Principal Component Analysis. Zou, Hastie, Tibshirani. presented by Banu Dost. Outline. Microarray expression data PCA on expression data SPCA Definition Motivation Method Linear regression Lasso/elastic net 2 Examples. DNA Microarrays. Gene Protein
E N D
Sparse Principal Component Analysis Zou, Hastie, Tibshirani presented by Banu Dost
Outline • Microarray expression data • PCA on expression data • SPCA • Definition • Motivation • Method • Linear regression • Lasso/elastic net • 2 Examples
DNA Microarrays • Gene Protein • Monitors gene expression levels on a genomic scale. (~abundance of each protein in the cell) From “Genome-Wide Expression in Escheria Coli K-12”, Blattner et al., 1999 expression
DNA Microarrays • One (micro-)array = one snapshot of the cell • Each spot represents a gene • Intensity of colors Indicate expression level of the corresponding gene in a certain experiment • Noisy, each expression level is only an approximation.
Microarray Expression Data n genes m arrays m arrays • n in order of thousands, while m is in order of hundreds. • n>>m, n>1000, m>100.
PCA on Expression Data n genes eigen-genes m arrays eigen-arrays linear transformation
PCA on Expression Data n genes Each eigen-gene is expressed only in the corresponding eigen-array with the corresponding eigen-expression level. m arrays E eigen-genes n eigen-arrays D VT = U x x m eigen- expressions (Alter et al., 2000)
Drawbacks of PCA eigen-genes n loadings D VT E = U x x eigen- expressions • PCs (eigen-genes) are linear combinations of all n variables (genes). • Each PC corresponds to a loading vector (columns of V) • Loadings = coefficients corresponding to variables in the linear combination • Difficult to interpret
SPCA: Motivation • Idea: Reduce the number of explicitly used variables (genes). • Approach: Modify PCA so that PCs have sparse loadings = sparse PCA (SPCA)
SPCA • Writes PCA as a regression-type optimization problem. • Uses lasso • a variable selection technique • produces sparse models • Result: modified PCA • Modified PCs with sparse loadings.
Linear Regression Problem • Input variables • Response variable • Regression coefficients • Multivariate linear model
Linear Regression Problem Y X ε β Error Training Data Coefficients • N observations, p predictors. • Goal: Estimate the coefficients, β.
Least Squares Solution Y Y X
Lasso Solution • Pros: • Lasso continously shrinks the coefficients toward zero. • Produces a sparse model. • Variable selection method.
Lasso Solution • Cons: • #selected variables is limited by n, the number of observations. e.g. microarray expression data n(arrays)<<p(genes) • Selects only one of the highly correlated variables, does not care which one is in the final model.
Elastic Net Solution • Pros: • Limitation of lasso removed by ridge constraint. All variables are included in the model. • Grouping effect: • Selects a group of highly correlated variables once one variable among them is selected. (lasso selects only one of them, does not care which.) • Good way as a gene selection method in microarray data analysis.
SPCA • Goal: • Construct a regression framework in which PCA can be reconstructed exactly. • Use lasso/elastic-net to construct modified PCs with sparse loadings.
Reconstruction of PCA in a Regression Framework Idea: Each PC is a linear combination of the p variables. Its loadings can be recovered by regression PC on the p variables. p variables … Training Data ith PC p ith loading D VT = U x x Coefficients Response
Theorem 1 Yi p Vi D VT X = U x x
Reconstruction of PCA in a Regression Framework • By theorem 1, we can reconstruct the loadings of PCs exactly by a linear regression problem. • not an alternative to PCA as it uses its results. • Ridge penalty does not penalize the coefficients, but ensure the reconstruction of PCs. • Now, add lasso penalty to the problem to penalize for the absolute values of coefficients.
Summary • Goal: Construct PCs with sparse loadings. • Method: • Step 1: Perform PCA. • Step 2: Solve below to obtain sparse approximations of PCs and loadings.
Transformation of PCA to a Regression Problem • Question: “Can we derive PCs from a ‘self-contained’ regression type criterion?”
SPCA criterion • Add lasso penalty for sparseness.
Simulation Example: PCA vs. SPCA • Data points X = (X1, X2, …, X10) • 10 variables • Model to generate data: • 3 hidden factors: V1, V2, V3 • V1 ~ N(0, 290) • V2 ~ N(0, 300) • V3 = -0.3 V1 + 0.925 V2 + e, e~ N(0, 1) • Xi = V1 + ei1, ei1 ~ N(0, 1), i=1,2,3,4 • Xi = V2 + ei2, ei2 ~ N(0, 1), i=5,6,7,8 • Xi = V3 + ei3, ei3 ~ N(0, 1), i=9,10 variances: 290,300,298 4 vars associated with V1 4 vars associated with V2 2 vars associated with V3
Simulation Example: PCA vs. SPCA • How many observations? • PCA and SPCA performed on exact covariance matrix. => infinitely many data points. • We expect to derive 2 PCs with right sparse loadings: • One from (X5, X6, X7, X8) recovering V2 • One from (X1, X2, X3, X4) recovering V1 PCs? X1 X2 …. X10 C X1 X2 …. X10
Table of Loadings V1 V2
SPCA on Microarray Data • Ramaswamy dataset: • p=16063 genes, n=144 samples • p>>n • SPCA applied to find leading PC. • λ = ∞, a sequence of used λ1 • number of non-zero loadings varies
Sparse leading PC of Microarray Data Percentage of explained variance Number of non-zero loadings
Conclusion • Two methods proposed to derive PCs with sparse loadings • Direct approximations from PCs • Formulating PCA as regression problem and deriving PCs with ridge and lasso constraints.