410 likes | 642 Views
Clustering. Overview. What is clustering? Similarity/distance metrics Hierarchical clustering algorithms K-means. What is clustering?. Group similar objects together Objects in the same cluster (group) are more similar to each other than objects in different clusters
E N D
Overview • What is clustering? • Similarity/distance metrics • Hierarchical clustering algorithms • K-means
What is clustering? • Group similar objects together • Objects in the same cluster (group) are more similar to each other than objects in different clusters • Data exploratory tool
How to define similarity? Experiments X genes n 1 p 1 X • Similarity metric: • A measure of pairwise similarity or dissimilarity • Examples: • Correlation coefficient • Euclidean distance genes genes Y Y n n Raw matrix Similarity matrix
Similarity metrics • Euclidean distance • Correlation coefficient
Example Correlation (X,Y) = 1 Distance (X,Y) = 4 Correlation (X,Z) = -1 Distance (X,Z) = 2.83 Correlation (X,W) = 1 Distance (X,W) = 1.41
Clustering algorithms • Inputs: • Raw data matrix or similarity matrix • Number of clusters classifications of clustering algorithms: • Hierarchical • partitional
Steps to Cluster Analysis • The two key steps within cluster analysis are the measurement of distances between objects and to group the objects based upon the resultant distances (linkages). • The distances provide for a measure of similarity between objects and may be measured in a variety of ways, such as Euclidean metric distance. • Linkages are based upon how the association between groups is measured.
Hierarchical Clustering • Agglomerative(bottom-up) • Algorithm: • Initialize: each item a cluster • Iterate: • select two most similarclusters • merge them • Halt: when required number of clusters is reached dendrogram
Clustering Procedure –Linkage Method • The single linkage method is based on minimum distance, or the nearest neighbor rule. At every stage, the distance between two clusters is the distance between their two closest points . • The complete linkage method is similar to single linkage, except that it is based on the maximum distance or the furthest neighbor approach. In complete linkage, the distance between two clusters is calculated as the distance between their two furthest points. • The average linkage method works similarly. However, in this method, the distance between two clusters is defined as the average of the distances between all pairs of objects, where one member of the pair is from each of the clusters
A B C D Hierarchical Clustering Initial Data Items Distance Matrix
A B C D Hierarchical Clustering Initial Data Items Distance Matrix
A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix 2
A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix
A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix
A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix 3
A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix
A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix
A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix 10
A B C D Hierarchical Clustering Single Linkage Final Result Distance Matrix
Hierarchical: Single Link • cluster similarity = similarity of two most similar members - Potentially long and skinny clusters + Fast
Example: single link 5 4 3 2 1
Example: single link 5 4 3 2 1
Example: single link 5 4 3 2 1
Hierarchical: Complete Link • cluster similarity = similarity of two least similar members + tight clusters - slow
Example: complete link 5 4 3 2 1
Example: complete link 5 4 3 2 1
Example: complete link 5 4 3 2 1
Hierarchical: Average Link • cluster similarity = average similarity of all pairs + tight clusters - slow
Example: average link 5 4 3 2 1
Example: average link 5 4 3 2 1
Example: average link 5 4 3 2 1
Hierarchical divisive clustering algorithms • Top down • Start with all the objects in one cluster • Successively split into smaller clusters • Tend to be less efficient than agglomerative • Resolver implemented a deterministic annealing approach from
Partitioning –Non Hierarchical • Assign each item to the cluster whose centroid(mean) is closest. • Recalculate the centroid for the cluster receiving the item and the cluster loosing the item • Repeat the last step until no reassignments
Details of k-means • Iterate until converge: -Group the cases into a collection of K Clusters. • Assign each data point to the closest centroid • Compute new centroid
Properties of k-means • Fast • Proved to converge to local optimum • In practice, converge quickly • Tend to produce spherical, equal-sized clusters
Summary • Definition of clustering • Pairwise similarity: • Correlation • Euclidean distance • Clustering algorithms: • Hierarchical (single-link, complete-link, average-link) • K-means • Different clustering algorithms different clusters