Sparse Principal Component Analysis

Sparse Principal Component Analysis Zou, Hastie, Tibshirani presented by Banu Dost

Outline • Microarray expression data • PCA on expression data • SPCA • Definition • Motivation • Method • Linear regression • Lasso/elastic net • 2 Examples

DNA Microarrays • Gene Protein • Monitors gene expression levels on a genomic scale. (~abundance of each protein in the cell) From “Genome-Wide Expression in Escheria Coli K-12”, Blattner et al., 1999 expression

DNA Microarrays • One (micro-)array = one snapshot of the cell • Each spot represents a gene • Intensity of colors Indicate expression level of the corresponding gene in a certain experiment • Noisy, each expression level is only an approximation.

Microarray Expression Data n genes m arrays m arrays • n in order of thousands, while m is in order of hundreds. • n>>m, n>1000, m>100.

PCA on Expression Data n genes eigen-genes m arrays eigen-arrays linear transformation

PCA on Expression Data n genes Each eigen-gene is expressed only in the corresponding eigen-array with the corresponding eigen-expression level. m arrays E eigen-genes n eigen-arrays D VT = U x x m eigen- expressions (Alter et al., 2000)

Drawbacks of PCA eigen-genes n loadings D VT E = U x x eigen- expressions • PCs (eigen-genes) are linear combinations of all n variables (genes). • Each PC corresponds to a loading vector (columns of V) • Loadings = coefficients corresponding to variables in the linear combination • Difficult to interpret

SPCA: Motivation • Idea: Reduce the number of explicitly used variables (genes). • Approach: Modify PCA so that PCs have sparse loadings = sparse PCA (SPCA)

SPCA • Writes PCA as a regression-type optimization problem. • Uses lasso • a variable selection technique • produces sparse models • Result: modified PCA • Modified PCs with sparse loadings.

Linear Regression Problem • Input variables • Response variable • Regression coefficients • Multivariate linear model

Linear Regression Problem Y X ε β Error Training Data Coefficients • N observations, p predictors. • Goal: Estimate the coefficients, β.

Least Squares Solution Y Y X

Lasso Solution • Pros: • Lasso continously shrinks the coefficients toward zero. • Produces a sparse model. • Variable selection method.

Lasso Solution • Cons: • #selected variables is limited by n, the number of observations. e.g. microarray expression data n(arrays)<<p(genes) • Selects only one of the highly correlated variables, does not care which one is in the final model.

Elastic Net Solution

Elastic Net Solution • Pros: • Limitation of lasso removed by ridge constraint. All variables are included in the model. • Grouping effect: • Selects a group of highly correlated variables once one variable among them is selected. (lasso selects only one of them, does not care which.) • Good way as a gene selection method in microarray data analysis.

SPCA • Goal: • Construct a regression framework in which PCA can be reconstructed exactly. • Use lasso/elastic-net to construct modified PCs with sparse loadings.

Reconstruction of PCA in a Regression Framework Idea: Each PC is a linear combination of the p variables. Its loadings can be recovered by regression PC on the p variables. p variables … Training Data ith PC p ith loading D VT = U x x Coefficients Response

Theorem 1 Yi p Vi D VT X = U x x

Reconstruction of PCA in a Regression Framework • By theorem 1, we can reconstruct the loadings of PCs exactly by a linear regression problem. • not an alternative to PCA as it uses its results. • Ridge penalty does not penalize the coefficients, but ensure the reconstruction of PCs. • Now, add lasso penalty to the problem to penalize for the absolute values of coefficients.

Construction of SPCA in a Regression Framework

Summary • Goal: Construct PCs with sparse loadings. • Method: • Step 1: Perform PCA. • Step 2: Solve below to obtain sparse approximations of PCs and loadings.

Transformation of PCA to a Regression Problem • Question: “Can we derive PCs from a ‘self-contained’ regression type criterion?”

Theorem 2

Theorem 3(generalization of Theorem 2)

SPCA criterion • Add lasso penalty for sparseness.

Simulation Example: PCA vs. SPCA • Data points X = (X1, X2, …, X10) • 10 variables • Model to generate data: • 3 hidden factors: V1, V2, V3 • V1 ~ N(0, 290) • V2 ~ N(0, 300) • V3 = -0.3 V1 + 0.925 V2 + e, e~ N(0, 1) • Xi = V1 + ei1, ei1 ~ N(0, 1), i=1,2,3,4 • Xi = V2 + ei2, ei2 ~ N(0, 1), i=5,6,7,8 • Xi = V3 + ei3, ei3 ~ N(0, 1), i=9,10 variances: 290,300,298 4 vars associated with V1 4 vars associated with V2 2 vars associated with V3

Simulation Example: PCA vs. SPCA • How many observations? • PCA and SPCA performed on exact covariance matrix. => infinitely many data points. • We expect to derive 2 PCs with right sparse loadings: • One from (X5, X6, X7, X8) recovering V2 • One from (X1, X2, X3, X4) recovering V1 PCs? X1 X2 …. X10 C X1 X2 …. X10

Table of Loadings V1 V2

SPCA on Microarray Data • Ramaswamy dataset: • p=16063 genes, n=144 samples • p>>n • SPCA applied to find leading PC. • λ = ∞, a sequence of used λ1 • number of non-zero loadings varies

Sparse leading PC of Microarray Data Percentage of explained variance Number of non-zero loadings

Conclusion • Two methods proposed to derive PCs with sparse loadings • Direct approximations from PCs • Formulating PCA as regression problem and deriving PCs with ridge and lasso constraints.

Sparse Principal Component Analysis

Sparse Principal Component Analysis

Presentation Transcript

Structured Sparse Principal Component Analysis

Principal Component Analysis

Principal component analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis

Principal Component Analysis