Microarray Data Analyisis: Clustering and Validation Measures

Microarray Data Analyisis: Clustering and Validation Measures Raffaele Giancarlo Dipartimento di Matematica Università di Palermo Italy Raffaele Giancarlo

What we want (tipically) Genes Expression Levels Genes Expression Matrix • Group functionally related genes together • Basic Axiom of Computational Biology: Guilt by Association A high similarity among object, as measured by mathematical functions, is strong indication of functional relatedness…Not always • Clustering Raffaele Giancarlo

What we want (tipically) Clustering Solution Raffaele Giancarlo

Limitations in the Analysis Process Raffaele Giancarlo

Limitations: Microarray Technology • MIAME, we have a problem-Robert Shields, Trends in Genetics, 2006 • …no amount of statistical or algorithmic knowledge can compensate for limitations of the technology itself • A large proportion of the transcriptome is beyond the reach of current technology, i.e, the signal is too weak Raffaele Giancarlo

Limitations: Visualization Tools • One of those two Clusters is random noise … Which One ??? Raffaele Giancarlo

Limitations: Statistics • Towards sound epistemological foundations of statistical methods for high-dimensional biology- T. Mehta et al, Nature Genetics, 2004 • Many papers for omic research describe development or application of statistical methods— Many of those are questionable Raffaele Giancarlo

Overview Of Remaining Part • Clustering as a three step process • Internal validation Techniques • External Validation Techniques • Experiments • One stop shops software systems • Some Issues I Really Had to Talk About Raffaele Giancarlo

Cluster Analysis as a Three Step Process Raffaele Giancarlo

What is clustering? • Group similar objects together Clustering experiments Clustering genes Raffaele Giancarlo

What is Clustering? • Goal: partition the observations {xi} so that • C(i)=C(j) if xi and xj are “similar” • C(i)C(j) ifxi and xj are “dissimilar” • natural questions: • What is a cluster • How do I choose a good similarity function • How do I choose a good algorithm • APPLICATION and DATA DEPENDENT • How many clusters are REALLY present in the data Raffaele Giancarlo

What’s a Cluster? • No rigorous definition • Subjective • Scale/Resolution dependent (e.g. hierarchy) Raffaele Giancarlo

Step One • Choose a good similarity function- • Euclidean Distance- • capture magnetudo and pattern of expression, i.e., direction • Correlation functions • Captures pattern of expression, i.e. direction • Etc… Raffaele Giancarlo

Step Two • Choose a good clustering algorithm. Algorithms may be broadly classified according to the objective function they optimize • Compactness: Intra- Cluster Variation Small • They like well separated or spherical clusters but fail on more complex cluster shapes • Kmeans, Average Link Hierarchical Clustering • Connectedness- neighboring items should share the same cluster • Robust with respect to cluster shapes, but fail when separation in the data is poor. • Single Link Hierarchical Clustering, CAST, CLICK • Spatial Separation- Poor performer by itself, usually coupled with other criteria • Simulated Annealing, Tabu Search Raffaele Giancarlo

Step Three • An index that tells us how many clusters are really present in the data: Consistency/Uniformity more likely to be 2 than 3 more likely to be 2 than 36? (depends, what if each circle represents 1000 objects?) Raffaele Giancarlo

Step Three • An index that tells us: Separability increasing confidence to be 2 Raffaele Giancarlo

Step Three • An index that is • independent of cluster “volume”? • independent of cluster size? • independent of cluster shape? • sensitive to outliers? • etc… • Theoretically Sound-Gap Statistics • Data Driven andValidated-Many Raffaele Giancarlo

Internal Validation Measures How many clusters are really present in the data Assess Cluster Quality Internal: No external knowledge about the dataset is given Raffaele Giancarlo

The Basic Scheme • Given an Index F – a function of clustering solution • black box producing clustering solutions with k=2,…,m clusters • Compute F( ) to decide which k is best Raffaele Giancarlo

Internal Validation Measures • Within-Cluster Sum of Squares [Folklore] • Gap Statistics [Tibshirani, Walther, Hastie 2001] • FOM [Yeung, Haynor, Ruzzo 2001] • Consensus Clustering [Monti et al., 2003] • Etc… Raffaele Giancarlo

Within-Cluster Sum of Squares xj xi Raffaele Giancarlo

Within-Cluster Sum of Squares Measure of compactness of clusters Raffaele Giancarlo

Using Wk to determine # clusters Idea of L-Curve Method: use the k corresponding to the “elbow” (the most significant increase in goodness-of-fit) Raffaele Giancarlo

Example • Yeast Cell Cycle Dataset, 698 genes and 72 conditions • Five functional classes-The gold solution • Algorithm, K-means with Av. Link input and Euclidean Distance • We want to know how many clusters are predicted by Wk , with K-means as an “oracle” Raffaele Giancarlo

Example Raffaele Giancarlo

Problems with Use of Wk • No reference clustering solution to compare against, i.e., no model • The values ofWk are not normalized and therefore cannot be compared • In a nutshell: we get values of Wk but we do not quite know how far we are from randomness • Gap Statistics takes care of those problems Raffaele Giancarlo

The Gap Statistics • Based on solid statistical work for the 1-D case, i.e., the objects to be clustered are scalars, takes care of the problems outlined for Wk • Extended to work in higher dimensions – No Theory • Validated experimentally Raffaele Giancarlo

SampleUniformly and at Random • Align with feature axes (data-geometry independent) Bounding Box (aligned with feature axes) Monte Carlo Simulations Observations Raffaele Giancarlo

Computation of the Gap Statistic for l = 1 to B Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.) for k = 1 to K Cluster the observations into k groups and compute log Wk for l = 1 to B Cluster the M.C. sample into k groups and compute log Wkb Compute Compute sd(k), the s.d. of {log Wkb}l=1,…,B Set the total s.e. Find the smallest k such that Raffaele Giancarlo

Example • The same experimental setting as for Within-Sum of Squares • We want to know whether the Gap Statistics predicts 5 clusters, with K-means as an “oracle” Raffaele Giancarlo

Figure of Merit • A purely experimental approach, designed and validated specifically for microarray data Raffaele Giancarlo

FOM Experiments 1 e m 1 Cluster C1 genes g Cluster Ci n Cluster Ck R(g,e) Raffaele Giancarlo

FOM Raffaele Giancarlo

Example • Same experimental setting as in the Within Sum of Squares • We want to know whether FOM indicates 5 clusters in the data set, with K-means as an “oracle” • Hint: look for the elbow in the FOM plot, exactly as for the Wk curve. Raffaele Giancarlo

External Validation Measures Given two partitions of the same dataset, how close they are ? Assess Quality of a partition against a given gold standard External: the gold standard, i.e., the refernce partition must be given and trusted. In case of Biology, the elements in a cluster must be biologically correlated, i.e., same functional group of genes Raffaele Giancarlo

Some External Validation Measures • The two partitions must have the same number of classes • Jaccard Index • Minkowski score • Rand Index [Rand 71] • The two partitions can have a different number of classes • The Adjusted Rand Index [Hubert and Arabie 85] • The F measure [van Rijsbergen 79] Raffaele Giancarlo

Some External Validation Measures • Problem with the mentioned indexes: • What is their expected value ? • In very intuitive terms, if one picks blindly two partitions, among the possible partitions of the data, what is the value of the index we should expect ? Same problem we had withGap Statistics. Raffaele Giancarlo

The Adjusted Rand Index • It takes in input two partitions, not necessarely having the same number of classes. • Value 1, its maximum, means perfect agreement • The expected value of the index, i.e., its value on two randomly correlated partitions, is zero • Note1: the index may take negative values • Note2: The same property is not shared by other mentioned indexes, including its relative-the Rand Index • The index must be maximased • We will see some of its uses later Raffaele Giancarlo

Adjusted Rand index • Compare clusters to classes • Consider # pairs of objects Raffaele Giancarlo

Example (Adjusted Rand) Closed form in the paper by Handl et al. (supplementary material) Raffaele Giancarlo

Some Experiments or on the Need of Benchmark Data Set Raffaele Giancarlo

How Do I Pick: • Distance and Similarity Functions, given algorithm and data set • algorithm, given data set • Internal Validation Measures, given data set Raffaele Giancarlo

Different Distances-Same Algorithm and implementation (k-means) Raffaele Giancarlo

Same Distance-Two Different Implementations of the Same Algorithm: not all k-means are equal Raffaele Giancarlo

Performance of Different Algorithms- precision Raffaele Giancarlo

Microarray Data Analyisis: Clustering and Validation Measures

Microarray Data Analyisis: Clustering and Validation Measures

Presentation Transcript

Unsupervised Learning with Random Forest Predictors: Applied to Tissue Microarray Data

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

limma: Linear Models for Microarray Data

Chapter 2 Data Mining

Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering

Chapter 3

XML Grammars

Chapter 4: Unsupervised Learning

LECTURE 3 Introduction to PCA and PLS K-mean clustering Protein function prediction using network concepts Network Centr

Density-Based Clustering of Uncertain Data (KDD2005)

Advanced Methods and Analysis for the Learning and Social Sciences

Lecture 2 Microarray Data Analysis Bioinformatics Data Analysis and Tools

Verification and Validation

Statistics

Clustering and Pathway Analysis

Verification and Validation

Lecture 2 Microarray and a-CGH Data Analysis Bioinformatics Data Analysis and Tools

Microarray Data Analysis Using BASE

Information Theory, Statistical Measures and Bioinformatics approaches to gene expression

Analyisis application Incoterms 2010

Clustering Analysis