EECS 800 Research Seminar Mining Biological Data

EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006

Administrative • Project design is due Oct 30th • ~3 weeks from now • Include the following items in the document • The goal of the project • A brief introduction of the overall project • A list of background materials that will be covered in the final report • A high level design of your project • A testing plan

Overview • Gain insights of high dimensional space by projection pursuit (feature reduction). • PCA: Principle components analysis • A data analysis tool • Mathematical background • PCA and gene expression profile analysis • briefly

A Group of Related Techniques • Unsupervised • Principal Component Analysis (PCA) • Latent Semantic Indexing (LSI): truncated SVD • Independent Component Analysis (ICA) • Canonical Correlation Analysis (CCA) • Supervised • Linear Discriminant Analysis (LDA) • Semi-supervised • Research topic

Rediscovery – Renaming of PCA • Statistics: • Principal Component Analysis (PCA) • Social Sciences: • Factor Analysis (PCA is a subset) • Probability / Electrical Eng: • Karhunen – Loeve expansion • Applied Mathematics: • Proper Orthogonal Decomposition (POD) • Geo-Sciences: • Empirical Orthogonal Functions (EOF)

An Interesting Historical Note • The 1st (?) application of PCA to Functional Data Analysis: • Rao, C. R. (1958) Some statistical methods for comparison of growth curves, Biometrics, 14, 1-17. • 1st Paper with “Curves as Data” viewpoint

What is Principal Component Analysis? • Principal component analysis (PCA) • Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables • Retains most of the sample's information. • Useful for the compression and classification of data. • By information we mean the variation present in the sample, given by the correlations between the original variables. • The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains.

the 1st PC is the line in the space such that the “projected” data set has the largest total variance • the 2nd PC is the line, orthogonal to , to capture the remaining total variance PCs are a series of linear fits to a sample, each orthogonal to all the previous. A Geometric Picture

Connect Math to Graphics 2-d Toy Example Feature Space Object Space Data Points (Curves) are columns of data matrix, X

Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Sample Mean, X

Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Residuals from Mean = Data - Mean

Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Recentered Data = Mean Residuals, shifted to 0

Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space PC1 Direction = η = Eigenvector (w/ biggest λ)

Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Centered Data PC1 ProjectionResidual

Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space PC2 Direction = η = Eigenvector (w/ 2nd biggest λ)

Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Centered Data PC2 ProjectionResidual

Connect Math to Graphics (Cont.) Note for this 2-d Example: PC1 Residuals = PC2 Projections PC2 Residuals = PC1 Projections (i.e. colors common across these pics)

PCA and Complex Data Analysis • Data set is a set of curves • How to find clusters? • Treat curves as points in a high dimensional space • Applications in gene expression profile analysis Zhao, X., Marron, J.S. and Wells, M.T. (2004) The Functional Data View of Longitudinal Data, Statistica Sinica, 14, 789-808

N-D Toy Example Upper left shows the mean. Upper right is residuals from mean. Lower left is projections of the mean residuals in the PC1 direction. Lower right is further residuals from PC1 projections.

Yeast Cell Cycle Data Central question: Which genes are “periodic” over 2 cell cycles?

Yeast Cell Cycle Data, PCA analysis Periodic genes? Naïve approach: Simple PCA

Yeast Cell Cycle Data, FDA View • Central question: which genes are “periodic” over 2 cell cycles? • Naïve approach: Simple PCA • Doesn’t work • No apparent (2 cycle) periodic structure? • Eigenvalues suggest large amount of “variation” • PCA finds “directions of maximal variation” • Often, but not always, same as “interesting directions” • Here need better approach to study periodicities

Yeast Cell Cycles, Freq. 2 Proj. PCA on Freq. 2 Periodic Component Of Data

PCA for 2D Surfaces 2-d M-Rep Example: Corpus Callosum Atoms Spokes Implied Boundary

Pros and Cons • PCA works for • Multi-dimensional Gaussian distribution • It doesn’t work for • Gaussian mixtures • Data in non-Euclidian spaces

Detailed Look at PCA • Three important (and interesting) viewpoints: • Mathematics • Numerics • Statistics • 1st: Review linear alg. and multivar. prob.

Review of Linear Algebra • Vector Space: • set of “vectors”, , • and “scalars” (coefficients or an element in a field), • “closed” under “linear combination” ( in space) • For example: “ d dim Euclid’n space”

Subspace • Subspace: • subset that is again a vector space which is closed under linear combination • Examples: • lines through the origin • planes through the origin • all linear combos of a subset of vector (= a hyperplane through origin)

Basis • Basis of subspace: set of vectors that • span, i.e. everything is a lin. com. of them • are linearly indep’t, i.e. lin. Com. is unique • Example: • “unit vector basis” in • e.g.

Basis Matrix • Basis Matrix, of subspace of • Given a basis: • create matrix of columns: • Then “linear combo” is a matrix multiplicat’n:

Linear Transformation • Aside on matrix multiplication: (linear transformation) for matrices • Define the “matrix product” (“inner products” of columns with rows) (composition of linear transformations)

Matrix Trace • For a square matrix • Define • Trace commutes with matrix multiplication:

Dimension • Dimension of subspace (a notion of “size”): • number of elements in a basis (unique) • (use basis above) • Example • Dimension of a line is 1 • Dimension of a plane is 2 • Dimension is “degrees of freedom”

Vector Norm • in , • Idea: “length” of the vector • “length normalized vector”: • (has length one, thus on surf. of unit sphere) • get “distance” as:

Inner Product • Inner (dot, scalar) product: • for vectors and , • related to norm, via • measures “angle between and ” as: • key to “orthogonality”, i.e. “perpendicul’ty”: • if and only if

Orthonormal Basis • Orthonormal basis : • All ortho to each other, i.e. , for • All have length 1, i.e. , for • “Spectral Representation”: where • check: • Matrix notation: where i.e. • is called “transform (e.g. Fourier, wavelet) of ”

Vector Projection • Projection of a vector onto a subspace : • Idea: member of that is closest to • (i.e. “approx’n”) • Find that solves: (“least squa’s”) • General solution in : for basis matrix • So “proj’n operator” is “matrix mult’n”: • (thus projection is another linear operation)

Vector Projection (cont) • Projection using orthonormal basis : • Basis matrix is “orthonormal”: • So = Recon(Coeffs of “in dir’n”) • For “orthogonal complement”, , • and • Parseval inequality:

Random Vectors • Given a “random vector” • A “center” of the distribution is the mean vector, • A “measure of spread” is the covariance matrix:

Empirically • Given a random sample , estimate the theoretical mean , with the sample mean:

Empirically (Cont.) • And estimate the “theoretical cov.” , with the “sample cov.”:

With Linear Algebra Outer product representation: , where:

PCA as an Optimization Problem Find “direction of greatest variability”:

Applications of PCA • Eigenfaces for recognition. Turk and Pentland. 1991. • Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo. 2001. • Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum. Lilien. 2003. • Zhao, X., Marron, J.S. and Wells, M.T. (2004) The Functional Data View of Longitudinal Data, Statistica Sinica, 14, 789-808

EECS 800 Research Seminar Mining Biological Data