Microarray analysis

Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon

Microarrays • Measure the expression of genes in the cell • “Count” the number of mRNA molecules that attach to biological probes • Expression data is gathered for many (thousands) of genes at once • Data is gathered for several experiments • Either in several time stamps or different conditions

Relative expression of gene i in experiment j

Detecting Patterns in Expression Data • Genes may have similar expression patterns because • They are part of the same complex (protein-protein interactions) • They are part of the same pathway • They have similar regulatory elements • They have similar functions (part of a fail-safe mechanism) • A popular solution: clustering (we saw already) • Hierarchical clustering, K-means, agglomerative,... • Today: dimensionality reduction • PCA • SVD

Why Dimensionality Reduction • Using irrelevant data may harm accuracy • Clustering algorithms do not perform well in high dimensional data • Visualizing high dimensional data

Principle Components Analysis (PCA) • PCA seeks for a linear projection that best describes the data in a least mean squares sense • Finds a set of principle components (PCs) • A PC defines a projection that encapsulates the maximum amount of variation in a dataset • Each PC is orthogonal to all other PCs • Reduce dimensionality by picking the most informative PCs • Namely, for reducing from dimension d to dimension d’, pick the d’ most informative PCs

PCA - Steps Input: a dataset • Subtract the mean from each dimension • Compute the covariance matrix  for the d dimensions • The covariance of two variables X and Y: • The covariance matrix:

PCA – Steps (cont.) • Compute the eigenvectors and eigenvalues of the covariance matrix • Choose the most informative PCs, construct a feature vector • Eigenvectors with highest eigenvalues carry the most information • Feature vector is simply the combination of all eigenvectors chosen FeatureVector = (eig1, eig2, …, eigd’) • Transform dataset to the new axis system • For sS:

When Things Get Messy… • PCA is fine when initial dimension is not too big • Space and time complexity are of O(d2) - size of covariance matrix • Otherwise – we have a problem… • E.g. when d=104 time/space complexity is O(108)… • Luckily an alternative exists: SVD

Eigengenes, Eigenarrays and SVD • The idea: • Use the singular value decomposition (SVD) theorem for transforming the dataset from the gene/array space to the eigengene/eigenarray space • Eigengenes, eigenarrays and eigenvalues: • Each dimension is represented by an eigengene/eigenarray/eigenvalue triplet • Eigenvalues are used for ranking dimensions • Paper: • Alter et. Al., 2000

Singular Value Decomposition (SVD) • Theorem: if E is a real M by N matrix, then there exist orthogonal matrices s.t.Whereand

SVD • i is the ithsingular value of E. ui and viare the ithleft singular vector and right singular vector of E, respectively. • It holds that • Efficient algorithms for calculating the SVD exist

Orthogonality of Decomposition

SVD and Microarray analysis • Reduction from the N genes x M arrays to p eigengenes x p eigenarrays space • W is the eigenexpression matrix • U represents the expression of genes over eigenarrays • V represents the expression of eigengenes over arrays • The “fraction of eigenexpression”: • “Shannon entropy” of the dataset:

Example: Cell cycle of Saccharomyces Cerevisiae • Data is available for 5981 genes over 14 time steps (with ½ hour intervals) • 784 genes were classified as cell-cycle regulated (with no missing values)

Data Sorting • For eigengenes 1 and 2, plot the correlation of each gene g1 with both on a 2-D plot • X-axis represents the correlation with 2, Y-axis relates to 1. • Sort by angular distance

Further Reading • PCA: • L. Smith: A Tutorial on Principal Components Decomposition • Eigengenes, eigenvectors and SVD: • O. Alter, P. Brown & D. Botstein: Singular Value Decomposition for Genome-wide Expression Data Processing and Modeling, PNAS 97:18, 2000

Microarray analysis