300 likes | 420 Views
Clustering. Log 2 transformation Row centering and normalization Filtering. Clustering Preliminaries. Log 2 -transformation makes sure that the noise is independent of the mean and similar differences have the same meaning along the dynamic range of the values.
E N D
Log2 transformation Row centering and normalization Filtering Clustering Preliminaries
Log2-transformation makes sure that the noise is independent of the mean and similar differences have the same meaning along the dynamic range of the values. We would like dist(100,200)=dist(1000,2000). Log2 Transformation Advantages of log2 transformation:
Row Centering & Normalization x y=x-mean(x) z=y/stdev(y)
Filtering is very important for unsupervised analysis since many noisy genes may totally mask the structure in the data After finding a hypothesis one can identify marker genes in a larger dataset via supervised analysis. Filtering genes All genes Supervised AnalysisMarker Selection Clustering
Aim: Partition data (e.g. genes or samples) into sub-groups (clusters), such that points of the same cluster are “more similar”. Challenge: Not well defined. No single objective function / evaluation criterion Example:How many clusters? 2+noise, 3+noise, 20, Hierarchical: 23 + noise One has to choose: Similarity/distance measure Clustering method Evaluate clusters Clustering/Class Discovery
Representative based: Find representatives/centroids K-means: KMeansClustering Self Organizing Maps (SOM): SOMClustering Bottom-up (Agglomerative): HierarchicalClustering Hierarchically unite clusters single linkage analysis complete linkage analysis average linkage analysis Clustering-like: NMFConsensus PCA (Principal Components Analysis) Clustering in GenePattern No BEST method! For easy problems – most of them work. Each algorithm has its assumptions and strengths and weaknesses
Aim: Partition the data points into K subsets and associate each subset with a centroid such that the sum of squared distances between the data points and their associated centroid is minimal. K-means Clustering
Initialize centroids atrandom positions Iterate: Assign each data point toits closestcentroid Move centroids to center of assigned points Stop when converged Guaranteed to reach a local minimum Iteration = 0 Iteration = 1 Iteration = 2 Iteration = 2 Iteration = 1 K-means: Algorithm K=3
Result depends on initial centroids’ position Fast algorithm: needs to compute distances from data points to centroids Must preset number of clusters. Fails for non-spherical distributions K-means: Summary
Distance between joined clusters 1 3 2 4 5 Dendrogram Hierarchical Clustering 2 4 5 3 1
2 4 5 3 1 1 3 2 4 5 Hierarchical Clustering Need to define the distance between thenew cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Distance between joined clusters The dendrogram induces a linear ordering of the data points (up to left/right flip in each split) Dendrogram
Average Linkage Leukemia samples and genes
Single and Complete Linkage Leukemia samples and genes Complete-linkage Single-linkage
Decide: which samples/genes should be clustered together Euclidean: the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula Pearson correlation - a parametric measure of the strength of linear dependence between two variables. Absolute Pearson correlation - the absolute value of the Pearson correlation Spearman rank correlation - a non-parametric measure of independence between two variables Uncentered correlation - same as Pearson but assumes the mean is 0 Absolute uncentered correlation - the absolute value of the uncentered correlation Kendall’s tau - a non-parametric similarity measure used to measure the degree of correspondence between two rankings City-block/Manhattan - the distance that would be traveled to get from one point to the other if a grid-like path is followed Similarity/Distance Measures
Reasonable Distance Measure Euclidean distance on samples and genes on row-centered and normalized data. Gene 1 Genes: Close -> Correlated Samples: Similar profile givingGene 1 and 2 a similar contribution to the distance between sample 1 and 5 Gene 2 Gene 3 Gene 4 Sample 1 Sample 5
Elongated clusters Filament Clusters of different sizes Pitfalls in Clustering
All methods work Compact Separated Clusters Adapted from E. Domany
Elongated Clusters • Single linkage succeeds to partition • Average linkage fails
Single linkage not robust Filament Adapted from E. Domany
Single linkage not robust Filament with Point Removed Adapted from E. Domany
Two independent cluster analyses on genes and samples used to reorder the data (two-way clustering): Two-way Clustering
Results depend on distance update method Single Linkage: elongated clusters Complete Linkage: sphere-like clusters Greedy iterative process NOT robust against noise No inherent measure to choose the clusters – we return to this point in cluster validation Hierarchical Clustering Summary
Validating Number of Clusters How do we know how many real clusters exist in the dataset?
Consensus matrix: counts proportion of times two samples are clustered together. • (1) two samples always cluster together • (0) two samples never cluster together RED WHITE s1 s2 … sn Generate “perturbed” datasets s1 s2 … sn ... Dn D1 D2 compute consensus matrix dendogram based on matrix Apply clustering algorithm to each Di Clustering1 Clustering2 .. Clusteringn Consensus Clustering Original Dataset The Broad Institute of MIT and Harvard
... Dn D1 D2 consensus matrix ordered according to dendogram compute consensus matrix dendogram based on matrix Apply clustering algorithm to each Di Clustering1 Clustering2 .. Clusteringn Consensus Clustering • Consensus matrix: counts proportion of times two samples are clustered together. • (1) two samples always cluster together • (0) two samples never cluster together Original Dataset RED WHITE s1 s3 … si s1 s3 … si C1 C2 C3
Aim: Measure agreement between clustering results on “perturbed” versions of the data. Method: Iterate N times: Generate “perturbed” version of the original dataset bysubsampling, resampling with repeats, adding noise Cluster the perturbed dataset Calculate fraction of iterations where different samples belong to the same cluster Optimize the number of clusters K by choosing the value of K which yields the most consistent results Validation Consistency / Robustness Analysis
Reduce number of genes by variation filtering Use stricter parameters than for comparative marker selection Choose a method for cluster discovery (e.g. hierarchical clustering) Select a number of clusters Check for sensitivity of clusters against filtering and clustering parameters Validate on independent data sets Internally test robustness of clusters with consensus clustering Clustering Cookbook