Microarray data analysis

Microarray data analysis Jeremy Glasner Genetics 875 November 29, 2007

Why cluster data? • TMI- can’t “see” patterns in data • Reduce complexity in data sets • Allow “visualization” of complex data

Preliminary questions you need to ask before you start clustering • What genes and experiments to cluster? • What normalization, standardization, or transformation should be applied to data? • What distance function should be used? • What clustering method should be used?

Cluster differentially expressed genes or all genes? Determine which changes are significant: Fixed cutoff (fold-change>4) Replication allows assessment of variability Common statistics such as the t-test are often used for gene expression data. Significance of the value is then determined by referring to the t distribution. This assumes that the data is normally distributed, which may not be true. Gene expression experiments may require thousands of statistical tests and significance should be adjusted to reflect this. A standard Bonferroni correction is the p-value multiplied by the number of tests but is likely too conservative.

Simple scatterplot for two experiments

Principle Components Analysis (PCA, a.k.a. SVD) Definition: Principle Components - A set of variables that define a projection that encapsulates the maximum amount of variation in a dataset and is orthogonal (and therefore uncorrelated) to the previous principle component of the same dataset. • With 1000 genes and 10 experiments we have either 1000 data points in 10-dimensional space or 10 data points in 1000-dimensional space • The data, though clumped around several central points in that hyperspace, will generally tend towards one direction. If one were to draw a solid line that best describes that direction, then that line is the first principle component (PC). • Any variation that is not captured by that first PC is captured by subsequent orthogonal PCs. • Singular Value Decomposition (SVD) is PCA using the covariance matrix of the data. http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/PCA_1.htm

Pattern Discovery- assign objects to classes Unsupervised learning -The classes are unknown a priori and need to be “discovered” from the data, e.g. cluster analysis, class discovery, unsupervised pattern recognition Supervised learning-The classes are predefined and the task is to understand the basis for the classification from a set of labeled objects. This information is then used to classify future observations, e.g. classification, discriminant analysis, class prediction, supervised,pattern recognition From: Eisen MB, Spellman PT, Brown PO and Botstein D. (1998). Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc Natl Acad Sci U S A 95, 14863-8.

Different distance measures • Euclidean distance- takes into account both the direction and the magnitude of the vectors • Manhattan distance- distance that is measured along directions that are parallel to the x and y axes meaning that there are no diagonal direction

More distance metrics • Correlation distance • Chebychev distance • Angle between vectors • Squared Euclidean distance • Standardized Euclidean distance • Mahalanobis distance Differentially expressed genes varying in the same way

Yet more distance measures- inter-cluster distances

Hierarchical clustering of expression data From Eisen et al., PNAS 95:14863

Clustering from CGH data

K-means Clustering K-means clustering proceeds by repeated application of a three-step process where: 1) the mean vector for all items in each cluster is computed 2) items are reassigned to the cluster whose center is closest to the item 3) repeat The parameters controlling k-means clustering are: 1) the number of clusters (K) 2) the maximum number of cycles

Acetate utilization prp genes Cluster visualized as a line graph of expression profiles propionate fatty acid oxidation Log2 signal intensity fad log phase 1hr 2hr 3hr 6hr 10hr

Self-Organizing Maps Figure From: Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A 1999 Mar 16;96(6):2907-12.

A self-organizing map of expression profiles

3-D Topo map of gene expression patterns Kim et al., A Gene Expression Map for Caenorhabditis elegans Science 14 September 2001 • Caenorhabditis elegans gene expression terrain map created by VxInsight at lowest resolution, showing three-dimensional representation of 44 gene mountains derived from 553 microarray hybridizations and consisting of 17,661 genes • correlations of gene expression profiles as distances in two dimensions and gene density in the third dimension

Heat maps of mountains Development 130, 1621-1634 (2003)

Combining results from different methods David J. Lockhart & Elizabeth A. Winzeler. NATURE VOL 405 15 JUNE 2000

Integrating high-throughput data with other biological information

Mapping expression data onto metabolic pathways http://www.genome.ad.jp/kegg/ http://biocyc.org:1555/ECOLI/expression.html

Machine Learning A form of artificial intelligence that is used to classify objects into known groups. For example: Given a set of patients with a disease and a collection of gene expression profiles we could try to train a model on the known cases and try to predict the disease in samples where it is unknown using our model. Training examples are essential for these methods.

Machine learning to predict regulatory states of genes http://www.ebi.ac.uk/microarray/Research/networks/reconstruction.html#

General strategy for machine learning http://www.ebi.ac.uk/microarray/Research/networks/reconstruction.html#

A decision tree http://www.ebi.ac.uk/microarray/Research/networks/reconstruction.html#

Transcription factor binding site identification by gene expression analysis Typically examine expression in a mutant that under or overproduces a transcriptional regulator. Potential targets of the regulator are identified by finding significant differences in gene expression between the mutant and wild-type. Upstream regions of the sequence are searched for over-represented sequences (motifs) usually using a Gibbs sampling approach. Once motifs are identified a matrix describing the motif can be used to search the genome for additional potential site.

Microarray data analysis