150 likes | 486 Views
Clustering. 조이현. Overview. What is clustering? Clustering algorithms. What is clustering?. Clustering The act of grouping similar object into sets Clustering vs. Classification Classification assigns objects to predefined groups Clustering infers groups based on clustered objects.
E N D
Clustering 조이현
Overview • What is clustering? • Clustering algorithms
What is clustering? • Clustering • The act of grouping similar object into sets • Clustering vs. Classification • Classification assigns objects to predefined groups • Clustering infers groups based on clustered objects
Clustering algorithms • Hierarchical • Bottom-up (agglomerative clustering) • Top-down (divisive clustering) • Non-Hierarchical • K-means (can be fuzzy) • Single-pass (incremental)
Hierarchical Clustering • Bottom-up (agglomerative clustering) • Start with the individual object • Join cluster with maximum similarity • Top-down (divisive clustering) • Start with all the object • Divides them into groups • Split least coherent part in cluster
Hierarchical clustering variants • Various ways of calculating cluster similarity single-link (minimum) complete-link (maximum) Group-average (average)
Single Link • Similarity of two most similar members • Time complexity • O(n2) • Locally Coherent • Close objects are in the same cluster • Chaining effect
Complete Link • Similarity of two least similar members • Time complexity • O(n3) • Focused on global cluster quality • Avoids elongated cluster
Group average • Averages similarity between members • Time complexity • O(n2) • compromise between single-link and complete-link
K-means clustering • Defines clusters by the center of mass of their members • Initial center of cluster are randomly selected • Assign objects to cluster using distances between center and object • Re-compute the center of each cluster • Return step2 until stopping criteria is satisfied
Single-pass threshold
Preferable for detailed data analysis Provides more information than flat No single best algorithm (dependent on application) Less efficient than flat ( N X N similarity matrix required) Preferable if efficiency is consideration or data sets are very large K-means is the conceptually simplest method K-means assumes a simple Euclidean representation space and so can’t be used for many data sets Properties of hierarchical and flat clustering