360 likes | 484 Views
More About Clustering. Naomi Altman Nov '06. Assessing Clusters. Some things we might like to do: Understand the within cluster similarity and between cluster differences in terms of distances and expression profiles. Decide how many clusters "are" in the data.
E N D
More About Clustering Naomi Altman Nov '06
Assessing Clusters Some things we might like to do: • Understand the within cluster similarity and between cluster differences in terms of distances and expression profiles. • Decide how many clusters "are" in the data. • Display the clusters graphically. • Determine cluster stability. • Determine stability of genes in a cluster.
Assessing Clusters • Understand the within cluster similarity and between cluster differences in terms of distances and expression profiles. • Decide how many clusters "are" in the data. Distance similarity: genes in the same cluster should be closer together than genes in different clusters. Expression similarity: genes in the same cluster should have expression profiles that are more similar (on features of interest) than genes in different clusters. (And if this is not true, maybe we need a different similarity measure.) • Display the clusters graphically. • Determine cluster stability. • Determine stability of genes in a cluster.
Distance similarity: Genes in the same cluster should be closer together than genes in different clusters. We need to be careful about how we define "closer".
Some criteria Minimize the total within cluster distances. (K-means clustering finds a local minimum of this function for p=2. We might also consider p=1. Clearly, the larger K is, the smaller this criterion. We could consider a scree plot. The silhouette is a measure of how much closer points in the cluster are to one another compared to points outside the cluster. Observations with large and positive sil (~1) are well clustered, those with sil around 0 lie between clusters, and those with negative sil are placed in the “wrong” cluster.
#M.yeast is the data with 294 genes and 24 times (arrays) library(cluster) kmeans6.out=kmeans(M.yeast,6) sil6=silhouette(kmeans.out$cluster,dist(yeast.M[,2:25], method = "euclidean")) plot(sil6) summary(sil6)
Some diagnostics for the Cell Cycle Data Silhouette of 294 units in 3 Cluster sizes and average silhouette widths: 114 145 35 0.1754309 0.1837610 0.1968444 Individual silhouette widths: Min. 1st Qu. Median Mean 3rd Qu. Max. -0.01413 0.10110 0.17530 0.18210 0.26640 0.38640
Using Height In hierarchical clustering, we can use the cluster height to help determine where to define clusters. complete.out=hclust(dist(M.yeast),method="complete") plot(1:293,complete.out$height,ylab="Distance at Join",main="Complete Linkage")
Using Heightin Hierarchical Clustering Gaps might indicate heights at which to cut the tree to form clusters
Assessing Clusters • Understand the within cluster similarity and between cluster differences in terms of distances and expression profiles. • Decide how many clusters "are" in the data. Display the clusters graphically. How do the clusters relate to the treatments? What are the "typical" expression profiles for genes in the clusters • Determine cluster stability. • Determine stability of genes in a cluster.
Plotting Cluster Centers # All of the clustering routines provide a vector in the same order # as the genes, with the cluster number. # For hierarchical methods, you need to cut the tree to form the # clusters. You can do it by height or the desired number of # clusters. clust6=cutree(complete.out,k=6) par(mfrow=c(2,3)) for (i in 1:6){ M=M.yeast[clust6==i,] meansi=apply(M,2,mean) plot(time,meansi,xlab="Time",ylab="Mean Expression",main=paste("Cluster",i),type="l")}
Complete Linkage Single Linkage cluster 1 2 3 4 5 6 cluster 1 2 3 4 5 6 n 75 17 100 62 37 3 n 287 1 1 3 1 1
Complete Linkage K-means cluster 1 2 3 4 5 6 cluster 1 2 3 4 5 6 n 75 17 100 62 37 3 n 35 31 61 45 91 31
The genes do not look much like the cluster means - wrong metric! We are interested in the cyclic pattern - we should use correlation distance.
Correlation Distance by Zscores zscore=function(v) (v-mean(v))/sqrt(var(v)) yeast.Z=apply(yeast.M[,2:25],1,zscore) #apply saves the results in the columns - we want genes in rows yeast.Z=t(yeast.Z)
← data zscore →
Assessing Clusters • Understand the within cluster similarity and between cluster differences in terms of distances and expression profiles. • Decide how many clusters "are" in the data. • Display the clusters graphically. Determine cluster stability. We already know that different clustering methods and metrics cluster differently. How stable are the clusters if we: add noise leave out a treatment, array, gene. Removing noise stabilizes real clusters. • Determine stability of genes in a cluster.
The Effect of Noise on Clustering Each data value could be anywhere inside the circle. The red gene might be clustered with the blue or green genes. The correlation between the black and red genes is 1.0. Correlation distance will blow up patterns that are just due to noise.
Assessing "Noise" When we have replicates for the treatments, we have several ways to assess noise. 1) Bootstrap arrays - i.e. resample with replacement within a treatment. 2) Subsample arrays - i.e. take subsets of arrays within a treatment 3) Take residuals from a fitted model a) Bootstrap the residuals b) Use the residuals to estimate an error distribution and resample from the distribution
Resampling from Arrays We have treatments 1 ... T and for each treatment there are nt arrays. Bootstrap: sample nt arrays with replacement Subsample: sample kt arrays without replacement for each treatment Recluster. Look at changes in clusters or cluster summaries for each reclustering. (Note that cluster labels will change.)
Using Residuals e.g. We have treatments 1 ... T and for each treatment there are nt arrays. For each gene i and treatment t there is an expression mean mit. For a gene i, treatment t, array a, there is an expression value, yita. The estimated noise is rita= yita - mit. Obtain bootstrap samples by (for each gene, but ignoring t and a), sampling with replacement from rita giving residuals r*ita The bootstrap samples are y*ita = mit +r*ita (If you are worried about correlation among the genes on the same array, you can sample the entire vector of errors.) Alternatively, use the histogram of rita to estimate a distribution for each gene (e.g. N(0, si2) ) and sample from the distribution (but it is harder to deal with correlation among genes) Recluster. Look at changes in clusters or cluster summaries for each reclustering. (Note that cluster labels will change.)
Removing genes breaks up clusters Removing arrays changes similarity
Other Perturbations Recluster: Random subsamples of genes Random subsamples of treatments
Using the Simulated Samples If the clusters are well-separated, we expect the changes due to the perturbations to be small. If the clusters are just "zip-coding" i.e. dividing a big cluster into nearby but not separated regions, we expect the changes due to the perturbations to be large. If the clusters are due to noise, we expect changes due to the perturbations to be large.
The original data and a slightly perturbed version lead to perturbed dendrograms and cluster centers
Reducing Noise to Find Structure As for differential expression analysis, we hope to find replicable structure by clustering. Reducing noise improves our ability to find biologically meaningful structure - particularly when we are using scale invariant similarity measures like correlation. 1: Replicate, replicate, replicate We can cluster arrays just like we cluster genes. If an array does not cluster with its biological replicates then either there is no differential expression or it is a bad array. For each gene, the mean over the replicates is less noisy by √#reps - 2 reps give 30% less noise, 4 reps give 50% less. Instead of clustering the profile over arrays, cluster over treatment means.
Reducing Noise to Find Structure As for differential expression analysis, we hope to find replicable structure by clustering. Reducing noise improve our ability to find biologically meaningful structure - particularly when we are using scale invariant similarity measures like correlation. 2. Filter genes Remove non-differentially expressing genes from the data matrix before clustering - especially when using scale-free distances. Best:If possible, do a differential expression analysis and use only the significant genes. The correlation between the black and red genes is 1.0. Correlation distance will blow up patterns that are just due to noise.
Reducing Noise to Find Structure As for differential expression analysis, we hope to find replicable structure by clustering. Reducing noise improve our ability to find biologically meaningful structure - particularly when we are using scale invariant similarity measures like correlation. 3. Reduce dimension The initial dimension for clustering genes is the number of arrays. Using treatment means reduces this to the number of treatments. If there are no replicates, dimension reduction can be used such as regressing on the most significant eigengenes, or a pattern of interest such as sine and cosine waves. The regression coefficients are then used as the input to the clustering routine.
svd.z=svd(yeast.Z) design.reg=model.matrix(~svd.z$v[,1:4]) fit.reg=lmFit(yeast.Z,design.reg) pc.out=hclust(dist(fit.reg$coef))
Assessing Clusters • Understand the within cluster similarity and between cluster differences in terms of distances and expression profiles. • Decide how many clusters "are" in the data. • Display the clusters graphically. • Determine cluster stability. Determine stability of genes in a cluster. If the clusters stay distinguishable during the simulation, count the number of times each gene is in each cluster. If the clusters are less stable, count the number of times the pair gene i and gene j are in the same cluster.
Summarizing the Perturbed Data Dendrograms: Compute the consensus tree There are various methods based on the number of times each branch of the dendrogram appears in the perturbed data. (I use phylogenetics software for this.) Clusters: If clusters are stable, put each gene into the cluster it joins the most often If clusters are less stable, use the pairwise counts as a new distance measure and recluster based on this measure.
Other partitioning Methods • Partitioning around medioids (PAM): instead of averages, use multidim medians as centroids (cluster “prototypes”). Dudoit and Freedland (2002). • Self-organizing maps (SOM): add an underlying “topology” (neighboring structure on a lattice) that relates cluster centroids to one another. Kohonen (1997), Tamayo et al. (1999). • Fuzzy k-means: allow for a “gradation” of points between clusters; soft partitions. Gash and Eisen (2002). • Mixture-based clustering: implemented through an EM (Expectation-Maximization)algorithm. This provides soft partitioning, and allows for modeling of cluster centroids and shapes. Yeung et al. (2001), McLachlan et al. (2002) F. Chiaromonte Sp 06 5
Assessing the ClustersComputationally The bottom line is that the clustering is "good" if it is biologically meaningful (but this is hard to assess). Computationally we can: 1) Use a goodness of cluster measure, such as the within cluster distances compared to the between cluster distances. 2) Perturb the data and assess cluster changes: a) add noise (maybe residuals after ANOVA) b) resample (genes, arrays)