620 likes | 721 Views
Microarray Data Analyisis: Clustering and Validation Measures. Raffaele Giancarlo Dipartimento di Matematica Università di Palermo Italy. What we want (tipically). Genes. Expression Levels. Genes Expression Matrix. Group functionally related genes together
E N D
Microarray Data Analyisis: Clustering and Validation Measures Raffaele Giancarlo Dipartimento di Matematica Università di Palermo Italy Raffaele Giancarlo
What we want (tipically) Genes Expression Levels Genes Expression Matrix • Group functionally related genes together • Basic Axiom of Computational Biology: Guilt by Association A high similarity among object, as measured by mathematical functions, is strong indication of functional relatedness…Not always • Clustering Raffaele Giancarlo
What we want (tipically) Clustering Solution Raffaele Giancarlo
Limitations in the Analysis Process Raffaele Giancarlo
Limitations: Microarray Technology • MIAME, we have a problem-Robert Shields, Trends in Genetics, 2006 • …no amount of statistical or algorithmic knowledge can compensate for limitations of the technology itself • A large proportion of the transcriptome is beyond the reach of current technology, i.e, the signal is too weak Raffaele Giancarlo
Limitations: Visualization Tools • One of those two Clusters is random noise … Which One ??? Raffaele Giancarlo
Limitations: Statistics • Towards sound epistemological foundations of statistical methods for high-dimensional biology- T. Mehta et al, Nature Genetics, 2004 • Many papers for omic research describe development or application of statistical methods— Many of those are questionable Raffaele Giancarlo
Overview Of Remaining Part • Clustering as a three step process • Internal validation Techniques • External Validation Techniques • Experiments • One stop shops software systems • Some Issues I Really Had to Talk About Raffaele Giancarlo
Cluster Analysis as a Three Step Process Raffaele Giancarlo
What is clustering? • Group similar objects together Clustering experiments Clustering genes Raffaele Giancarlo
What is Clustering? • Goal: partition the observations {xi} so that • C(i)=C(j) if xi and xj are “similar” • C(i)C(j) ifxi and xj are “dissimilar” • natural questions: • What is a cluster • How do I choose a good similarity function • How do I choose a good algorithm • APPLICATION and DATA DEPENDENT • How many clusters are REALLY present in the data Raffaele Giancarlo
What’s a Cluster? • No rigorous definition • Subjective • Scale/Resolution dependent (e.g. hierarchy) Raffaele Giancarlo
Step One • Choose a good similarity function- • Euclidean Distance- • capture magnetudo and pattern of expression, i.e., direction • Correlation functions • Captures pattern of expression, i.e. direction • Etc… Raffaele Giancarlo
Step Two • Choose a good clustering algorithm. Algorithms may be broadly classified according to the objective function they optimize • Compactness: Intra- Cluster Variation Small • They like well separated or spherical clusters but fail on more complex cluster shapes • Kmeans, Average Link Hierarchical Clustering • Connectedness- neighboring items should share the same cluster • Robust with respect to cluster shapes, but fail when separation in the data is poor. • Single Link Hierarchical Clustering, CAST, CLICK • Spatial Separation- Poor performer by itself, usually coupled with other criteria • Simulated Annealing, Tabu Search Raffaele Giancarlo
Step Three • An index that tells us how many clusters are really present in the data: Consistency/Uniformity more likely to be 2 than 3 more likely to be 2 than 36? (depends, what if each circle represents 1000 objects?) Raffaele Giancarlo
Step Three • An index that tells us: Separability increasing confidence to be 2 Raffaele Giancarlo
Step Three • An index that tells us: Separability increasing confidence to be 2 Raffaele Giancarlo
Step Three • An index that tells us: Separability increasing confidence to be 2 Raffaele Giancarlo
Step Three • An index that is • independent of cluster “volume”? • independent of cluster size? • independent of cluster shape? • sensitive to outliers? • etc… • Theoretically Sound-Gap Statistics • Data Driven andValidated-Many Raffaele Giancarlo
Internal Validation Measures How many clusters are really present in the data Assess Cluster Quality Internal: No external knowledge about the dataset is given Raffaele Giancarlo
The Basic Scheme • Given an Index F – a function of clustering solution • black box producing clustering solutions with k=2,…,m clusters • Compute F( ) to decide which k is best Raffaele Giancarlo
Internal Validation Measures • Within-Cluster Sum of Squares [Folklore] • Gap Statistics [Tibshirani, Walther, Hastie 2001] • FOM [Yeung, Haynor, Ruzzo 2001] • Consensus Clustering [Monti et al., 2003] • Etc… Raffaele Giancarlo
Within-Cluster Sum of Squares xj xi Raffaele Giancarlo
Within-Cluster Sum of Squares Measure of compactness of clusters Raffaele Giancarlo
Using Wk to determine # clusters Idea of L-Curve Method: use the k corresponding to the “elbow” (the most significant increase in goodness-of-fit) Raffaele Giancarlo
Example • Yeast Cell Cycle Dataset, 698 genes and 72 conditions • Five functional classes-The gold solution • Algorithm, K-means with Av. Link input and Euclidean Distance • We want to know how many clusters are predicted by Wk , with K-means as an “oracle” Raffaele Giancarlo
Example Raffaele Giancarlo
Problems with Use of Wk • No reference clustering solution to compare against, i.e., no model • The values ofWk are not normalized and therefore cannot be compared • In a nutshell: we get values of Wk but we do not quite know how far we are from randomness • Gap Statistics takes care of those problems Raffaele Giancarlo
The Gap Statistics • Based on solid statistical work for the 1-D case, i.e., the objects to be clustered are scalars, takes care of the problems outlined for Wk • Extended to work in higher dimensions – No Theory • Validated experimentally Raffaele Giancarlo
SampleUniformly and at Random • Align with feature axes (data-geometry independent) Bounding Box (aligned with feature axes) Monte Carlo Simulations Observations Raffaele Giancarlo
Computation of the Gap Statistic for l = 1 to B Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.) for k = 1 to K Cluster the observations into k groups and compute log Wk for l = 1 to B Cluster the M.C. sample into k groups and compute log Wkb Compute Compute sd(k), the s.d. of {log Wkb}l=1,…,B Set the total s.e. Find the smallest k such that Raffaele Giancarlo
Example • The same experimental setting as for Within-Sum of Squares • We want to know whether the Gap Statistics predicts 5 clusters, with K-means as an “oracle” Raffaele Giancarlo
Example Raffaele Giancarlo
Figure of Merit • A purely experimental approach, designed and validated specifically for microarray data Raffaele Giancarlo
FOM Experiments 1 e m 1 Cluster C1 genes g Cluster Ci n Cluster Ck R(g,e) Raffaele Giancarlo
FOM Raffaele Giancarlo
Example • Same experimental setting as in the Within Sum of Squares • We want to know whether FOM indicates 5 clusters in the data set, with K-means as an “oracle” • Hint: look for the elbow in the FOM plot, exactly as for the Wk curve. Raffaele Giancarlo
Example Raffaele Giancarlo
External Validation Measures Given two partitions of the same dataset, how close they are ? Assess Quality of a partition against a given gold standard External: the gold standard, i.e., the refernce partition must be given and trusted. In case of Biology, the elements in a cluster must be biologically correlated, i.e., same functional group of genes Raffaele Giancarlo
Some External Validation Measures • The two partitions must have the same number of classes • Jaccard Index • Minkowski score • Rand Index [Rand 71] • The two partitions can have a different number of classes • The Adjusted Rand Index [Hubert and Arabie 85] • The F measure [van Rijsbergen 79] Raffaele Giancarlo
Some External Validation Measures • Problem with the mentioned indexes: • What is their expected value ? • In very intuitive terms, if one picks blindly two partitions, among the possible partitions of the data, what is the value of the index we should expect ? Same problem we had withGap Statistics. Raffaele Giancarlo
The Adjusted Rand Index • It takes in input two partitions, not necessarely having the same number of classes. • Value 1, its maximum, means perfect agreement • The expected value of the index, i.e., its value on two randomly correlated partitions, is zero • Note1: the index may take negative values • Note2: The same property is not shared by other mentioned indexes, including its relative-the Rand Index • The index must be maximased • We will see some of its uses later Raffaele Giancarlo
The Adjusted Rand Index • It takes in input two partitions, not necessarely having the same number of classes. • Value 1, its maximum, means perfect agreement • The expected value of the index, i.e., its value on two randomly correlated partitions, is zero • Note1: the index may take negative values • Note2: The same property is not shared by other mentioned indexes, including its relative-the Rand Index • The index must be maximased • We will see some of its uses later Raffaele Giancarlo
Adjusted Rand index • Compare clusters to classes • Consider # pairs of objects Raffaele Giancarlo
Example (Adjusted Rand) Closed form in the paper by Handl et al. (supplementary material) Raffaele Giancarlo
Some Experiments or on the Need of Benchmark Data Set Raffaele Giancarlo
How Do I Pick: • Distance and Similarity Functions, given algorithm and data set • algorithm, given data set • Internal Validation Measures, given data set Raffaele Giancarlo
Different Distances-Same Algorithm and implementation (k-means) Raffaele Giancarlo
Same Distance-Two Different Implementations of the Same Algorithm: not all k-means are equal Raffaele Giancarlo
Performance of Different Algorithms- precision Raffaele Giancarlo