Clustering

Clustering 조이현

Overview • What is clustering? • Clustering algorithms

What is clustering? • Clustering • The act of grouping similar object into sets • Clustering vs. Classification • Classification assigns objects to predefined groups • Clustering infers groups based on clustered objects

Clustering algorithms • Hierarchical • Bottom-up (agglomerative clustering) • Top-down (divisive clustering) • Non-Hierarchical • K-means (can be fuzzy) • Single-pass (incremental)

Hierarchical Clustering • Bottom-up (agglomerative clustering) • Start with the individual object • Join cluster with maximum similarity • Top-down (divisive clustering) • Start with all the object • Divides them into groups • Split least coherent part in cluster

Agglomerative clustering

Clustering result: dendrogram

Hierarchical clustering variants • Various ways of calculating cluster similarity single-link (minimum) complete-link (maximum) Group-average (average)

Single Link • Similarity of two most similar members • Time complexity • O(n2) • Locally Coherent • Close objects are in the same cluster • Chaining effect

Complete Link • Similarity of two least similar members • Time complexity • O(n3) • Focused on global cluster quality • Avoids elongated cluster

Group average • Averages similarity between members • Time complexity • O(n2) • compromise between single-link and complete-link

K-means clustering • Defines clusters by the center of mass of their members • Initial center of cluster are randomly selected • Assign objects to cluster using distances between center and object • Re-compute the center of each cluster • Return step2 until stopping criteria is satisfied

K-means clustering (k=3)

Single-pass threshold

Preferable for detailed data analysis Provides more information than flat No single best algorithm (dependent on application) Less efficient than flat ( N X N similarity matrix required) Preferable if efficiency is consideration or data sets are very large K-means is the conceptually simplest method K-means assumes a simple Euclidean representation space and so can’t be used for many data sets Properties of hierarchical and flat clustering

Clustering

Clustering

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering