More Microarray Analysis: Unsupervised Approaches

More Microarray Analysis:Unsupervised Approaches Matt Hibbs Troyanskaya Lab

Outline • Gene Expression vs. DNA applications • A little more normalization (missing values) • Unsupervised Analysis • Basic Clustering • Statistical Enrichment • PCA/SVD • Advanced Clustering • Search-based Approaches

Expression / DNA • Some similar concepts to analysis, but often very different goals • Expression – clustering, guilt by association, functional enrichment • DNA – signal processing, spatial relationships, motif finding • Visualized differently (Heat maps vs. karyoscope)

The missing value problem • Microarrays can have systematic or random missing values • Some algorithms can’t deal with missing values (PCA/SVD in particular) • Instead of hoping missing values won’t bias the analysis, better to estimate them accurately

Spatial Defects

KNN Impute • Idea: use genes with similar expression profiles to estimate missing values 2 | | 5 | 7 | 3 | 1 Gene X 2 |4.3| 5 | 7 | 3 | 1 Gene X 2 | 4 | 5 | 7 | 3 | 2 Gene A 2 | 4 | 5 | 7 | 3 | 2 Gene A 8 | 9 | 2 | 1 | 4 | 9 Gene B 8 | 9 | 2 | 1 | 4 | 9 Gene B 3 | 5 | 6 | 7 | 3 | 2 Gene C 3 | 5 | 6 | 7 | 3 | 2 Gene C

Imputation affects downstream analysis Complete data set Data set with 30% entries missing and filled with zeros (zero values appear black) Data set with missing values estimated by KNNimpute algorithm

Unsupervised Analysis • Supervised techniques great if you have starting information (e.g. labels) • But, we often we don’t know enough beforehand to apply these methods • Unsupervised techniques are exploratory • Let the data organize itself, then try to find biological meaning • Approaches to understand whole data • Visualization often helpful

Clustering • Let the data organize itself • Reordering of genes (or conditions) in the dataset so that similar patterns are next to each other (or in separate groups) • Identify subsets of genes (or experiments) that are related by some measure

Quick Example Conditions Genes

Why cluster? • “Guilt by association” – if unknown gene X is similar in expression to known genes A and B, maybe they are involved in the same/related pathway • Visualization: datasets are too large to be able to get information out without reorganizing the data

Clustering Techniques • Algorithm (Method) • Hierarchical • K-means • Self Organizing Maps • QT-Clustering • NNN • . • . • . • Distance Metric • Euclidean (L2) • Pearson Correlation • Spearman Correlation • Manhattan (L1) • Kendall’s t • . • . • .

Distance Metrics • Choice of distance measure is important for most clustering techniques • Pair-wise metrics – compare vectors of numbers • e.g. genes x & y, ea. with n measurements Euclidean Distance Pearson Correlation Spearman Correlation

Distance Metrics Euclidean Distance Pearson Correlation Spearman Correlation

Hierarchical clustering • Imposes (pair-wise) hierarchical structure on all of the data • Often good for visualization • Basic Method (agglomerative): • Calculate all pair-wise distances • Join the closest pair • Calculate pair’s distance to all others • Repeat from 2 until all joined

Hierarchical clustering

HC – Interior Distances • Three typical variants to calculate interior distances within the tree • Average linkage: mean/median over all possible pair-wise values • Single linkage: minimum pair-wise distance • Complete linkage: maximum pair-wise distance

Hierarchical clustering: problems • Hard to define distinct clusters • Genes assigned to clusters on the basis of all experiments • Optimizing node ordering hard (finding the optimal solution is NP-hard) • Can be driven by one strong cluster – a problem for gene expression b/c data in row space is often highly correlated

HC: Real Example • Demo in JavaTreeView & HIDRA • Spellman et al., 1998: yeast alpha-factor sync cell cycle timecourse

HC: Another Example • Expression of tumors hierarchically clustered • Expression groups by clinical class Garber et al.

K-means Clustering • Groups genes into a pre-defined number of independent clusters • Basic algorithm: • Define k = number of clusters • Randomly initialize each cluster with a seed (often with a random gene) • Assign each gene to the cluster with the most similar seed • Recalculate all cluster seeds as means (or medians) of genes assigned to the cluster • Repeat 3 & 4 until convergence (e.g. No genes move, means don’t change much, etc.)

K-means example

K-means: problems • Have to set k ahead of time • Ways to choose “optimal” k: minimize within-cluster variation compared to random data or held out data • Each gene only belongs to exactly 1 cluster • One cluster has no influence on the others (one dimensional clustering) • Genes assigned to clusters on the basis of all experiments

K-means: Real Example • Demo in TIGR MeV • Spellman et al. alpha-factor cell cycle

Clustering “Tweaks” • Fuzzy clustering – allows genes to be “partially” in different clusters • Dependent clusters – consider between-cluster distances as well as within-cluster • Bi-clustering – look for patterns across subsets of conditions • Very hard problem (NP-complete) • Practical solutions use heuristics/simplifications that may affect biological interpretation

Cluster Evaluation • Mathematical consistency • Compare coherency of clusters to background • Look for functional consistency in clusters • Requires a gold standard, often based on GO, MIPS, etc. • Evaluate likelihood of enrichment in clusters • Hypergeometric distribution, etc. • Several tools available

Gene Ontology • Organization of curated biological knowledge • 3 branches: biological process, molecular function, cellular component

Hypergeometric Distribution • Probability of observing x or more genes in a cluster of n genes with a common annotation • N = total number of genes in genome • M = number of genes with annotation • n = number of genes in cluster • x = number of genes in cluster with annotation • Multiple hypothesis correction required if testing multiple functions (Bonferroni, FDR, etc.) • Additional genes in clusters with strong enrichment may be related

GO term Enrichment Tools • SGD’s & Princeton’s GoTermFinder • http://go.princeton.edu • GOLEM (http://function.princeton.edu/GOLEM) • HIDRA Sealfon et al., 2006

More Unsupervised Methods • Search-based approaches • Starting with a query gene/condition, find most related group • Singular Value Decomposition (SVD) & Principal Component Analysis (PCA) • Decomposition of data matrix into “patterns” “weights” and “contributions” • Real names are “principal components”“singular values” and “left/right eigenvectors” • Used to remove noise, reduce dimensionality, identify common/dominant signals

SVD (& PCA) • SVD is the method, PCA is performing SVD on centered data • Projects data into another orthonormal basis • New basis ordered by variance explained X U  Vt = Singular values “Eigen-genes” Original Data matrix “Eigen-conditions”

SVD SVD

SVD: Real Example • Demo in TIGR MeV • Spellman et al., 1998 cell cycle time courses • alpha-factor sync • cdc15 sync

DNA arrays / Sequence-based Analysis • Methods so far focused on expression data • Other uses of microarrays often sequence based: CGH, ChIP-chip, SNP scanner • Data has important, inherent order • Most analysis methods developed from signal processing techniques (e.g. sound) • View data in chromosomal order (karyoscope) • Tools: JavaTreeView, IGB, Chippy

CGH Example • Demo in JavaTreeView

Aneuploidy affects expression too rpl20aD/ rpl20aD, Chromosome XV (data from Hughes et al. (2000))

Software Tools • JavaTreeView – viz, karyoscope • HIDRA – viz, mult. datasets, search • Cluster (Eisen lab) – clustering • TIGR MeV – clustering, viz • IGB – Affy’s CGH browser • ChIPpy – ChIP-chip analysis

Summary • Unsupervised Analysis • Let the data organize itself, find patterns • Clustering: Distance Metric + Algorithm • SVD/PCA – auto find dominant patterns • Impute missing values (KNN) • CGH – Karyoscope view • Questions?

More Microarray Analysis: Unsupervised Approaches