480 likes | 571 Views
Cluster Analysis I. 9/28/2012. Outline. Introduction Distance and similarity measures for individual data points A few widely used methods: hierachical clustering, K-means, model-based clustering. Introduction.
E N D
Cluster Analysis I 9/28/2012
Outline • Introduction • Distance and similarity measures for individual data points • A few widely used methods: hierachical clustering, K-means, model-based clustering
Introduction • To group or segment a collection of objects into subsets or “clusters”, such that those within each cluster are more closely related to one another than objects assigned to different clusters. • Some times, the goal is to arrange the clusters into a natural hierarchy. • Cluster genes: similar expression pattern implies co-regulation. • Cluster samples: identify potential sub-classes of disease.
Introduction • Assigning subjects into group. • Estimating number of clusters. • Assess the strength/confidence of cluster assignments for individual objects
Proximity Matrix • An NxN matrix D (N=number of objects), each element records the proximity (distance) between object i and i’. • Most often, we have measurement of p dimension on each object. Then we can define
Dissimilarity Measures • Two main classes of distance for continuous variables: • Distance metric (scale-dependent) • 1- Correlation coefficients (scale-invariant)
Minkowski distance • For vectors and of length S, the Minkowski family of distance measures are defined as
Two commonly used special case • Manhattan distance (a.k.a. city-block distance, k=1) • Euclidean distance (k=2)
Mahalanobis distance • Taking the correlation structure into account. • When assuming identity covariance matrix, it is the same as Euclidian distance.
Pearson correlation and inner product • Pearson correlation • After standardization: • Sensitive to outliers.
Spearman correlation • Calculate using the rank of the two vectors (note: sum of the ranks is n(n+1)/2)
Spearman correlation • When there is no tied observations • Robust to outliers since it is based on ranks of the data.
Standardization of the data • Standardize gene rows to mean 0 and stdev 1. • Advantage: makes Euclidean distance and correlation equivalent. Many useful methods require the data to be in Euclidean space.
Clustering methods • Clustering algorithms come in two flavors
Hierarchical clustering • Produce a tree or dendrogram. • They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level. • The tree can by built in two distinct ways • Bottom-up: agglomerative clustering (most used). • Top-down: divisive clustering.
Agglomerative Methods • The most popular hierarchical clustering method. • Start with n clusters. • At each step, merge the two closest clusters using a measure of between-cluster dissimilarity .
Comparison of the three methods • Single-link • Elongated clusters • Individual decision, sensitive to outliers • Complete-link • Compact clusters • Individual decision, sensitive to outliers • Average-link or centroid • “In between” • Group decision, insensitive to outliers.
Divisive Methods • Begin with the entire data set as a single cluster, and recursively divide one of the existing clusters into two daughter clusters. • Do it till each cluster only have one object or all members overlapped with each other. • Not as popular as agglomeriative methods.
Divisive Algorithms • At each division, other method, e.g. K-means with K=2, could be used. • Smith et al. 1965 proposed a method that does not involve other clustering method • Start with 1 cluster G, assign the object that is the furthest from the others (with the highest average pair-wise distance) to cluster H. • For the remaining iterations, each time assign the one object in G that is the closest to H (maximum difference between the average pair-wise distance to objects in H and G). • Do it till all objects in G is closer to each other than to objects in H.
Hierarchical clustering • The most overused statistical method in gene expression. • Gives us pretty pictures. • Results tend to be unstable, sensitive to small changes.
Partitioning method • Partition the data (size N) into a pre-specified number K of mutually excusive and exhaustive groups: a many-to-one mapping, or encoder k=C(i), that assings the ith observation to the kth cluster. • Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimization of a specific loss function
Partitioning method • A natural loss function would be the within cluster point scatter: • The total point scatter: • is the between cluster point scatter. • Minimizing is equivalent to minimize
Partitioning method • In principle, we simply need to minimize W or maximize B over all possible assignments of N objects to K clusters. • However, the number of distinct assignment, grows rapidly as N and K goes large.
Partitioning method • In practice, we can only examine a small fraction of all possible encoders. • Such feasible strategies are based on iterative greedy descent: • An initial partition is specified. • At each iterative step, the cluster assignments are changed in such a way that the value of the criterion is improved from its previous value.
K-means • Choose the squared Euclidean distance as dissimilarity measure: . • Minimize the within cluster point scatter: • Where .
K-means Algorithm—closely related to the EM algorithm for estimating a certainGaussian mixture model • Choose K centroids at random. • Make initial partition of objects into k clusters by assigning objects to closest centroid. • E step:Calculate the centroid (mean) of each of the k clusters. • M step: Reassign objects to the closest centroids. • Repeat 3 and 4 until no reallocations occur.
K-means: local minimum problem Initial values for K-means. “x” falls into local minimum.
K-means: discussion • Advantages: • Fast and easy • Nice relationship with Gaussian mixture model. • Disadvantages: • Run into local minimum (should start with multiple initials). • Need to know the number of clusters (estimation for number of clusters). • Does not allow scattered objects (tight clustering).
Model based clustering • Fraley and Raftery (1998) applied a Gaussian mixture model. • The parameters can be estimated by EM algorithm. • The cluster membership is decided on the posterior probability of each belong to cluster k.
Review of EM algorithm • It is widely used in solving missing data problem. • Here our missing data is the cluster membership. • Let us review the EM algorithm with a simple example.
E M
The CML approach • Indicators , identifying the mixture component origin for , are treated as unknown parameters. • Two CML criteria have been proposed according to the sampling scheme.
Two CMLs • Random sample within each cluster • Random sample from a population of mixture density
-- Classification likelihood: -- Mixture likelihood: Gaussian assumption:
Related to K-means • When f(x) is assumed to be Gaussian and the covariance matrix is the identical and spherical across all clusters, i.e. for all k, . • So maximize C1-CML is equivalent to minimize W.
Model-based methods • Advantages: • Flexibility on cluster covariance structure. • Rigorous statistical inference with full model. • Disadvantages: • Model selection is usually difficult. Data may not fit Gaussian model. • Too many parameters to estimate in complex covariance structure. • Local minimum problem
References • Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Statistical Learning (2nd ed.), New York: Springer. http://www-stat.stanford.edu/~tibs/ElemStatLearn/ • Everitt, B. S., Landau, S., Leese, M., and Stahl, D. (2011), Cluster Analysis (5th ed.), West Sussex, UK: John Wiley & Sons Ltd. • Celeux G, Govaert G. A Classification EM algorithm for clustering and two stochastic versions. Computational Statistics & Data Analysis 1992; 14:315-332