350 likes | 595 Views
Clustering. Petter Mostad. Clustering vs. class prediction. Class prediction: A learning set of objects with known classes Goal: put new objects into existing classes Also called: Supervised learning, or classification Clustering: No learning set, no given classes
Clustering Petter Mostad
Clustering vs. class prediction • Class prediction: • A learning set of objects with known classes • Goal: put new objects into existing classes • Also called: Supervised learning, or classification • Clustering: • No learning set, no given classes • Goal: discover the ”best” classes or groupings • Also called: Unsupervised learning, or class discovery
Overview • General clustering theory • Steps, methods, algorithms, issues... • Clustering microarray data • Recommendations for this kind of data • Programs for clustering • Some other visualization techniques
Issues in clustering • Used to explore and visualize data, with few preconceptions • Many subjective choices must be made, so a clustering output tends to be subjective • It is difficult to get truly statistically ”significant” conclusions • Algorithms will always produce clusters, whether any exist in the data or not
Steps in clustering • Feature selection and extraction • Defining and computing similarities • Clustering or grouping objects • Assessing, presenting, and using the result
1. Feature selection and extraction • Deciding which measurements matter for similarity • Data reduction • Filtering away objects • Normalization of measurements
The data matrix • Every row contains the measurements for one object. • Similarities are computed between all pairs of rows • If measurements are of same type, one can instead cluster them! measurements objects
2. Defining and computing similarities • Similarity measures for continuous data vectors: • Euclidean distance • Minkowski distance (including Manhattan metric) • Mahalanobis distance where S is a covariance matrix
Centered and non-centered (absolute) Pearson correlation • centered: • non-centered: where • Spearman rank correlation • Compute the ranking of the numbers in each vector • Find correlation between ranking numbers • ....
Geometrical view of clustering • If measurements are coordinates, objects become points in some space • If the simiarity measure is Euclidean distance, the goal is to group nearby points • Note: When we have only 2 or 3 measurements per object, we can do better than most algorithms using visual inspection
Similarity measures for discrete data • Comparing two binary vectors, count the numbers a,b,c,d of 1-1’s, 1-0’s, 0-1’s, and 0-0’s, respectively • Construct different similarity measurements based on these numbers: • Similarity of for example trees or other objects can be defined in reasonable ways
Similarities using contexts • Mutual Neighbour Distance: where is the neighbour number of x with respect to y • This is not a metric, but similarities do not need to be based on metrics.
3. Clustering or grouping • Hierarchical clusterings • Divisive: Starts with one big cluster and subdivides on cluster in each step • Agglomerative: Starts with each object in separate cluster. In each step, joins the two closest clusters • Partitional clusterings • Probabilistic or fuzzy clusterings
Hierarchical clustering • Agglomerative clustering depends on type of linkage, i.e., how to compute the distance between merged cluster (UV) and old cluster (W): • d(UV, W) = min(d(U, W), d(V,W)) (single linkage) • d(UV, W) = max(d(U,W), d(V,W)) (complete linkage) • d(UV, W) = average over all distances between objects in (UV) and objects in W (average linkage, or UPGMA: Unweighted Pair Group Method with Arithmetic mean) • The output is a dendrogram • A simplification of average linkage is often implemented (“average group linkage”): It may lead to inverted dendrograms!
Dendrograms, visualizations • The data matrix is often visualized using three colors, representing positive, negative, and zero values. • Hierarchical clustering results often represented with a dendrogram. The similarity at which clusters merge should correspond to height of corresponding horizontal line in dendrogram! • To display the dendrogram, the objects (lines or columns) need to be sorted, this can be done in two ways at every time when two clusters are merged.
Ward’s hierarchical clustering • Agglomerative. • Goal: minimize ”Error Sum of Squares” (ESS) at every step. • ESS = The sum over all clusters, of the sum of the squares of the distances from the objects to the cluster centroid. • When joining two clusters, find the pair that results in the smallest increase in ESS.
Partitional clusterings • The number of desired clusters is fixed at the start • K-means clustering: • Partition into k initial clusters • Iteratively, reassign points to groups with the closest centroid. Recompute centroids. • Repeat until stability • The result may depend on initial clusters • May include a procedure joining or splitting clusters according to size • The choice of number of clusters may not be obvious
Probabilistic or fuzzy clustering • The output is, for each object and each cluster, a probability or weight that the object belongs to the cluster • Example: The observations are modelled as produced by drawing from a number of probability densities (often multivariate normal). Parameters are then estimated with Maximum Likelihood (for example using EM algorithm). • Example: A ”fuzzy” version of k-means, where weights for objects are changed iteratively
Neural networks for clustering • Neural networks are mathematical models made to be similar to actual neural networks • They consist of layers of nodes that send out ”signals” based probabilistically on input signals • Most known uses are classifications, i.e., with learning sets
Clustering as optimization • Given similarity definition and definition of what is an ”optimal” clustering, it can often be a huge algorithmic challenge to find the optimum. • Example: Subdivide many thousand objects into 50 clusters, minimizing e.g. the sum of the squared distances to centroids. • Then, algorithms for optimization are central.
Genetic algorithms • Tries to use ”evolution” to obtain good solutions to a problem • A number of solutions are kept at every step: They may then mate or mutate, to produce new solutions. The ”fittest” solutions are kept. • Can be seen as an optimization algorithm • A great challenge to design ways of mating and mutating that produce an efficient algorithm
Simulated annealing • A general optimization technique • Iterative: At every step, nearby solutions are chosen with probabilities depending on their optimality (so even less optimal solutions may be chosen) • As the algorithm proceeds, and the ”temperature” sinks, the probability of choosing less optimal solutions also sinks. • Is a good general way to avoid local optima.
4. Assessing and using the result • Visualization and summarization of the clusters • Note: You should always investigate the dependence of your results on the choices you have made for the clustering!
Examples of applications of clustering • Image analysis • Speech recognition • Data mining • ....
Clustering microarray data samples • Samples are columns, genes are rows, in data matrix • What values to cluster? • What is a biologically relevant measure of similarity? • One can cluster genes and/or samples genes
Clustering microarray data • Use logged data, usually • Data should be on same scale (but usually is if you use data that is already normalized) • You may have to filter away genes that show too little variation over samples. • Use an appropriate distance measure for the question you want to focus on (Pearson correlation often works OK). • Use appropriate clustering algorithm (Hierarchical average linkage usually works OK). • If you draw some conclusion from the clustering results, try to vary your clustering choices to see how stable these results are. • Clustering works best as a tool to generate hypotheses and ideas, which may then be tested in other ways.
Clustering to confirm or reject hypotheses? • A clustering may appear to validate, or be validated by, a grouping derived by using other data • Caution: The many different ways to do a clustering may make it possible to tweak it to produce the clusters you want • There is a huge and complex multiple testing problem • Note that small changes in data can change result dramatically • If you insist on trying to get ”significance”: • Using permutations of data • Using resampling of data (bootstrapping)
How to do clustering: Programs • A good program for clustering and visualization: HCE • Great visualization options • Adapted to microarray data • http://www.cs.umd.edu/hcil/hce/ • Can import similarity matrices • Classic for microarray data: Cluster & TreeView (Eisen) • R/BioConductor: package cluster, hclust function, heatmap function, ... • Many other programs/packages
Other visualization techniques: Principal Components • The principal components can be viewed as the axes of a “better” coordinate system for the data. • “Better” in the sense that the data is maximally spread out along the first principal components. • The principal components correspond to eigenvectors of the covariance matrix of the data. • The eigenvalues represent the part of the total variance explained by each of the principal components.
Other visualization techniques: Multidimensional scaling • Start with some points in a very high dimension. • Goal: Display these points in a lower dimension, so that distances between them are similar to distances in original dimension. • May also try to preserve only the ranking of the pairwise distances. • Makes it possible to use powerful visual inspection, in 2 or 3 dimensions. • Can sometimes give very convincing pictures separating samples in a predicted way.