Clustering

Clustering

Overview • What is clustering? • Similarity/distance metrics • Hierarchical clustering algorithms • K-means

What is clustering? • Group similar objects together • Objects in the same cluster (group) are more similar to each other than objects in different clusters • Data exploratory tool

How to define similarity? Experiments X genes n 1 p 1 X • Similarity metric: • A measure of pairwise similarity or dissimilarity • Examples: • Correlation coefficient • Euclidean distance genes genes Y Y n n Raw matrix Similarity matrix

Similarity metrics • Euclidean distance • Correlation coefficient

Example Correlation (X,Y) = 1 Distance (X,Y) = 4 Correlation (X,Z) = -1 Distance (X,Z) = 2.83 Correlation (X,W) = 1 Distance (X,W) = 1.41

Clustering algorithms • Inputs: • Raw data matrix or similarity matrix • Number of clusters classifications of clustering algorithms: • Hierarchical • partitional

Steps to Cluster Analysis • The two key steps within cluster analysis are the measurement of distances between objects and to group the objects based upon the resultant distances (linkages). • The distances provide for a measure of similarity between objects and may be measured in a variety of ways, such as Euclidean metric distance. • Linkages are based upon how the association between groups is measured.

Hierarchical Clustering • Agglomerative(bottom-up) • Algorithm: • Initialize: each item a cluster • Iterate: • select two most similarclusters • merge them • Halt: when required number of clusters is reached dendrogram

Clustering Procedure –Linkage Method • The single linkage method is based on minimum distance, or the nearest neighbor rule. At every stage, the distance between two clusters is the distance between their two closest points . • The complete linkage method is similar to single linkage, except that it is based on the maximum distance or the furthest neighbor approach. In complete linkage, the distance between two clusters is calculated as the distance between their two furthest points. • The average linkage method works similarly. However, in this method, the distance between two clusters is defined as the average of the distances between all pairs of objects, where one member of the pair is from each of the clusters

A B C D Hierarchical Clustering Initial Data Items Distance Matrix

A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix 2

A B C D Hierarchical Clustering Single Linkage Current Clusters Distance Matrix

A B C D Hierarchical Clustering Single Linkage Final Result Distance Matrix

Hierarchical: Single Link • cluster similarity = similarity of two most similar members - Potentially long and skinny clusters + Fast

Example: single link 5 4 3 2 1

Hierarchical: Complete Link • cluster similarity = similarity of two least similar members + tight clusters - slow

Example: complete link 5 4 3 2 1

Hierarchical: Average Link • cluster similarity = average similarity of all pairs + tight clusters - slow

Example: average link 5 4 3 2 1

Hierarchical divisive clustering algorithms • Top down • Start with all the objects in one cluster • Successively split into smaller clusters • Tend to be less efficient than agglomerative • Resolver implemented a deterministic annealing approach from

Partitioning –Non Hierarchical • Assign each item to the cluster whose centroid(mean) is closest. • Recalculate the centroid for the cluster receiving the item and the cluster loosing the item • Repeat the last step until no reassignments

Details of k-means • Iterate until converge: -Group the cases into a collection of K Clusters. • Assign each data point to the closest centroid • Compute new centroid

Properties of k-means • Fast • Proved to converge to local optimum • In practice, converge quickly • Tend to produce spherical, equal-sized clusters

Summary • Definition of clustering • Pairwise similarity: • Correlation • Euclidean distance • Clustering algorithms: • Hierarchical (single-link, complete-link, average-link) • K-means • Different clustering algorithms  different clusters

Clustering

Clustering

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering