430 likes | 725 Views
Clustering methods used in microarray data analysis. Steve Horvath Human Genetics and Biostatistics UCLA Acknowledgement: based in part on lecture notes from Darlene Goldstein web site: http://ludwig-sun2.unil.ch/~darlene/. Contents. Background clustering k-means clustering
E N D
Clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA Acknowledgement: based in part on lecture notes from Darlene Goldstein web site: http://ludwig-sun2.unil.ch/~darlene/
Contents • Background clustering • k-means clustering • hierarchical clustering
References for clustering • Gentleman, Carey, et al (Bioinformatics and Comp Biology Solutings Using R) Chapters 11,12,13, • T. Hastie, R. Tibshirani, J. Friedman (2002) The elements of Statistical Learning. Springer Series • L. Kaufman, P. Rousseeuw (1990) Finding groups in data. Wiley Series in Probability
Clustering • Historically, objects are clustered into groups • periodic table of the elements (chemistry) • taxonomy (zoology, botany) • Why cluster? • Understand the global structure of the data: see the forest instead of the trees • detect heterogeneity in the data, e.g. different tumor classes • Find biological pathways (cluster gene expression profiles) • Find data outliers (cluster microarray samples)
Classification, Clustering and Prediction WARNING • many people talk about “classification” when they mean clustering (unsupervised learning) • Other people talk about classification when they mean prediction (supervised learning) • Usually, the meaning is context specific. I prefer to avoid the term classification and to talk about clustering or prediction or another more specific term. • Common denominator: classification divides objects into groups based on a set of values • Unlike a theory, clustering is neither true nor false, and should be judged largely on the usefulness of results. • CLUSTERING IS AND ALWAYS WILL BE SOMEWHAT OF AN ARTFORM • However, a classification (clustering) may be useful for suggesting a theory, which could then be tested
Cluster analysis • Addresses the problem: Given n objects, each described by p variables (or features), derive a useful division into a number of classes • Usually want a partition of objects • But also ‘fuzzy clustering’ • Could also take an exploratory perspective • ‘Unsupervised learning’
Wordy Definition Cluster analysis aims to group or segment a collection of objects into subsets or "clusters", such that those within each cluster are more closely related to one another than objects assigned to different clusters. An object can be described by a set of measurements (e.g. covariates, features, attributes) or by its relation to other objects. Sometimes the goal is to arrange the clusters into a natural hierarchy, which involves successively grouping or merging the clusters themselves so that at each level of the hierarchy clusters within the same group are more similar to each other than those in different groups.
Clustering Gene Expression Data • Can cluster genes (rows), e.g. to (attempt to) identify groups of co-regulated genes • Can cluster samples(columns), e.g. to identify tumors based on profiles • Can cluster both rows and columns at the same time (to my knowledge, not in R)
Clustering Gene Expression Data • Leads to readily interpretable figures • Can be helpful for identifying patterns in time or space • Useful (essential?) when seeking new subclasses of samples • Can be used for exploratory purposes
Similarity=Proximity • Similarity sij indicates the strength of relationship between two objects i and j • Usually 0 ≤ sij ≤1 • Ex 1: absolute value of the Pearson correlation coefficient • Use of correlation-based similarity is quite common in gene expression studies but is in general contentious... • Ex 2 co-expression network methods: topological overlap matrix • Ex 3 random forest similarity
Proximity matrices are the input to most clustering algorithms Proximity between pairs of objects: similarity or dissimilarity. If the original data were collected as similarities, a monotone-decreasing function can be used to convert them to dissimilarities. Most algorithms use (symmetric) dissimilarities (e.g. distances) But the triangle inequality does *not* have to hold. Triangle inequality:
Dissimilarity and Distance • Associated withsimilarity measures sij bounded by 0 and 1 is a dissimilarity dij = 1 - sij • Distance measures have the metric property (dij +dik ≥ djk) • Many examples: Euclidean (‘as the crow flies’), Manhattan (‘city block’), etc. • Distance measure has a large effect on performance • Behavior of distance measure related to scale of measurement
Partitioning Methods • Partition the objects into a prespecified number of groups K • Iteratively reallocate objects to clusters until some criterion is met (e.g. minimize within cluster sums of squares) • Examples: k-means, self-organizing maps (SOM), partitioning around medoids (PAM), model-based clustering
K-means clustering • Prespecify number of clusters K, and cluster ‘centers’ • Minimize within cluster sum of squares from the centers • Iterate (until cluster assignments do not change): • For a given cluster assignment, find the cluster means • For a given set of means, minimize the within cluster sum of squares by allocating each object to the closest cluster mean • Intended for situtations where all variables are quantitative, with (squared) Euclidean distance (so scale variables suitably before use)
PAM clustering • Also need to prespecify number of clusters K • Unlike K-means, the cluster centers (‘medoids’) are objects, not averages of objects • Can use general dissimilarity • Minimize (unsquared) distances from objects to cluster centers, so more robust than K-means
Combinatorial clustering algorithms.Example: K-means clustering
Clustering algorithms • Goal: partition the observations into groups ("clusters") so that the pairwise dissimilarities between those assigned to the same cluster tend to be smaller than those in different clusters. • 3 types of clustering algorithms: mixture modeling, mode seekers (e.g. PRIM algorithm), and combinatorial algorithms. • We focus on the most popular combinatorial algorithms.
Combinatorial clustering algorithms Most popular clustering algorithms directly assign each observation to a group or cluster without regard to a probability model describing the data. Notation: Label observations by an integer “i” in {1,...,N} and clusters by an integer k in {1,...,K}. The cluster assignments can be characterized by a many to one mapping C(i) that assigns the i-th observation to the k-th cluster: C(i)=k. (aka encoder) One seeks a particular encoder C*(i) that minimizes a particular *loss* function (aka energy function).
Loss functions for judging clusterings One seeks a particular encoder C*(i) that minimizes a particular *loss* function (aka energy function). Example: within cluster point scatters
Cluster analysis by combinatorial optimization Straightforward in principle: Simply minimize W(C) over all possible assignments of the N data points to K clusters. Unfortunately such optimization by complete enumeration is feasible only for small data sets. For this reason practical clustering algorithms are able to examine only a fraction of all possible encoders C. The goal is to identify a small subset that is likely to contain the optimal one or at least a good sub-optimal partition. Feasible strategies are based on iterative greedy descent.
K-means clustering is a very popular iterative descent clustering methods. Setting: all variables are of the quantitative type and one uses a squared Euclidean distance. In this case Note that this can be re-expressed as
Thus one can obtain the optimal C* by solving the enlarged optimization problem This can be minimized by an alternating optimization procedure given on the next slide…
K-means clustering algorithm leads to a local minimum 1. For a given cluster assignment C, the total cluster variance is minimized with respect to {m1,...,mk} yielding the means of the currently assigned clusters, i.e. find the cluster means. 2. Given the current set of means, TotVar is minimized by assigning each observation to the closest (current) cluster mean. That is C(i)=argmink ||xi-mk||2 3. Steps 1 and 2 are iterated until the assignments do not change.
Recommendations for k-means clustering • Either: Start with many different random choices of • starting means, and choose the solution having smallest value of • the objective function. • Or use another clustering method (e.g. hierarchical clustering) • to determine an initial set of cluster centers.
Agglomerative clustering, hierarchical clustering and dendrograms
Hierarchical Clustering • Produce a dendrogram • Avoid prespecification of the number of clusters K • The tree can be built in two distinct ways: • Bottom-up: agglomerative clustering • Top-down: divisive clustering
Agglomerative Methods • Start with nmRNA sample (or G gene) clusters • At each step, merge the two closest clusters using a measure of between-cluster dissimilarity which reflects the shape of the clusters • Examples of between-cluster dissimilarities: • Unweighted Pair Group Method with Arithmetic Mean (UPGMA): average of pairwise dissimilarities • Single-link (NN): minimum of pairwise dissimilarities • Complete-link (FN): maximum of pairwise dissimilarities
Agglomerative clustering • Agglomerative clustering algorithms begin with every observation representing a singleton cluster. • At each of the N-1 the closest 2 (least dissimilar) clusters are merged into a single cluster. • Therefore a measure of dissimilarity between 2 clusters must be defined.
Different intergroup dissimilarities Let G and H represent 2 groups.
Comparing different linkage methods If there is a strong clustering tendency, all 3 methods produce similar results. Single linkage has a tendency to combine observations linked by a series of close intermediate observations ("chaining“). Good for elongated clusters Bad: Complete linkage may lead to clusters where observations assigned to a cluster can be much closer to members of other clusters than they are to some members of their own cluster. Use for very compact clusters (like perls on a string). Group average clustering represents a compromise between the extremes of single and complete linkage. Use for ball shaped clusters
Dendrogram Recursive binary splitting/agglomeration can be represented by a rooted binary tree. The root node represents the entire data set. The N terminal nodes of the trees represent individual observations. Each nonterminal node ("parent") has two daughter nodes. Thus the binary tree can be plotted so that the height of each node is proportional to the value of the intergroup dissimilarity between its 2 daughters. A dendrogram provides a complete description of the hierarchical clustering in graphical format.
Comments on dendrograms Caution: different hierarchical methods as well as small changes in the data can lead to different dendrograms. Hierarchical methods impose hierarchical structure whether or not such structure actually exists in the data. In general dendrograms are a description of the results of the algorithm and not graphical summary of the data. Only valid summary to the extent that the pairwise *observation* dissimilarities obey the ultrametric inequality for all i,i’,k
Figure 1 average complete single
Divisive Methods • Start with onlyonecluster • At each step, split clusters into two parts • Advantage: Obtain the main structure of the data (i.e. focus on upper levels of dendrogram) • Disadvantage: Computational difficulties when considering all possible divisions into two groups
Partitioning vs. Hierarchical • Partitioning • Advantage: Provides clusters that satisfy some optimality criterion (approximately) • Disadvantages: Need initial K, long computation time • Hierarchical • Advantage: Fast computation (agglomerative) • Disadvantages: Rigid, cannot correct later for erroneous decisions made earlier • Word on the street: most data analysts prefer hierarchical clustering over partitioning methods when it comes to gene expression data
Generic Clustering Tasks • Estimating number of clusters • Assigning each object to a cluster • Assessing strength/confidence of cluster assignments for individual objects • Assessing cluster homogeneity
How many clusters K? • Many suggestions for how to decide this! • Milligan and Cooper (Psychometrika 50:159-179, 1985) studied 30 methods • A number of new methods, including GAP (Tibshirani) and clest (Fridlyand and Dudoit, uses bootstrapping), see also prediction strength methods http://www.genetics.ucla.edu/labs/horvath/GeneralPredictionStrength/
R: clustering • A number of R packages (libraries) contain functions to carry out clustering, including: • mva: kmeans, hclust • cluster: pam (among others) • cclust: convex clustering, also methods to estimate K • mclust: model-based clustering • GeneSOM