420 likes | 426 Views
This article provides an overview of clustering methods, focusing on K-means clustering and its issues. It discusses data recovery clustering, one-by-one clustering, and other approaches. The article also addresses the issues of determining K and weighting variables. Computational experiments are conducted to evaluate different methods.
E N D
Метод К-средних в кластер-анализе и его интеллектуализация Б.Г. Миркин Профессор, Кафедра анализа данных и искусственного интеллекта, НИУ ВШЭ МоскваРФ Professor Emeritus, School of Computer Science& Information Systems, Birkbeck College University of London, UK
Outline: • Clustering as empirical classification • K-Means and its issues: • (1) Determining Kand initialization • (2) Weighting variables • Addressing (1): • Data recovery clustering and K-Means (Mirkin 1987, 1990) • One-by-one clustering: Anomalous patterns and iK-Means • Other approaches • Computational experiment • Addressing (2): • Three-stage K-Means • Minkowski K-Means • Computational experiment • Conclusion
WHAT IS CLUSTERING; WHAT IS DATA • K-MEANS CLUSTERING: Conventional K-Means; Initialization of K-Means; Intelligent K-Means; Mixed Data; Interpretation Aids • WARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward Clustering • DATA RECOVERY MODELS: Statistics Modelling as Data Recovery; Data Recovery Model for K-Means; for Ward;Extensions to Other Data Types; One-by-One Clustering • DIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph-Theoretic Approaches; Conceptual Description of Clusters • GENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability
Referred recent work: • B.G. Mirkin, Chiang M. (2010) Intelligent choice of the number of clusters in K-Means clustering: An experimental study with different cluster spreads, J. of Classification, 27, 1, 3-41 • B.G. Mirkin,Choosing the number of clusters (2011), WIRE Data Mining and Knowledge Discovery, 1, 3, 252-60 • B.G. Mirkin, R.Amorim (2012) Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering, Pattern Recognition, 45, 1061-75
What is clustering? • Finding homogeneous fragments, mostly sets of entities, in datasets for further analysis
Example: W. Jevons (1857) planet clusters (updated by Mirkin 1996) Pluto doesn’t fit in the two clusters of planets: originated another cluster (September 2006)
Example: A Few Clusters Clustering interface to WEB search engines (Grouper): Query: Israel (after O. Zamir and O. Etzioni 2001)
Clustering algorithms: • Nearest neighbour • Agglomerative clustering • Divisive clustering • Conceptual clustering • K-means • Kohonen SOM • Spectral clustering • ………………….
Batch K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence K= 3 hypothetical centroids (@) • * * • * * * * * • * * * • @ @ • @ • ** • * * *
K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence • * * • * * * * * • * * * • @ @ • @ • ** • * * *
K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence • * * • * * * * * • * * * • @ @ • @ • ** • * * *
K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters * * @ * * * @ * * * * ** * * * @
K-Means criterion: Summary distance to cluster centroids Minimize * * @ * * * @ * * * * ** * * * @
Advantages of K-Means - Models typology building - Simple “data recovery” criterion - Computationally effective - Can be utilised incrementally, `on-line’ Shortcomings of K-Means - Initialisation: no advice on K or initial centroids - No deep minima - No defence of irrelevant features
Initial Centroids: Correct Two cluster case
Initial Centroids: Correct Final Initial
Different Initial Centroids: Wrong Initial Final
(1) To address: • *Number of clusters • Issue: Criterion WK < WK-1 • * Initial setting • * Deeper minimum • The two are interrelated: a good initial setting leads to a deeper minimum
Number K: conventional approach • Take a range RK of K, say K=3, 4, …, 15 • For each KRK • Run K-Means 100-200 times from randomly chosen initial centroids and choose the best of them W(S,c)=WK. • CompareWK for all KRK in a special way and choose the best; such as • Gap statistic (2001) • Jump statistic (2003) • Hartigan (1975): In the ascending order of K, pick the first K at which HK = [ WK / WK+1 - 1 ]/(N-K-1) 10
(1) Addressing • *Number of clusters • * Initial setting • with a PCA-like method in the data recovery approach
Representing a partition Clusterk: Centroid ckv (v - feature) Binary 1/0 membership zik (i - entity)
Basic equations (same as for PCA, but score vectors zk constrained to be binary) y – data entry, z – 1/0 membership, not score c - cluster centroid, N – cardinality i - entity, v - feature /category, k - cluster
Quadratic data scatter decomposition (Pythagorean) K-means: Alternating LS minimisation y – data entry, z – 1/0 membership c - cluster centroid, N – cardinality i - entity, v - feature /category, k - cluster
Equivalent criteria (1) A. Bilinear residuals squared MIN Minimizing difference between data and cluster structure B. Distance-to-Centre Squared MIN Minimizing difference between data and cluster structure
Equivalent criteria (2) C. Within-group error squared MIN Minimizing difference between data and cluster structure D. Within-group variance Squared MIN Minimizing within-cluster variance
Equivalent criteria (3) E. Semi-averaged within distance squared MIN Minimizing dissimilarities within clusters F. Semi-averaged within similarity squared MAX Maximizing similarities within clusters
Equivalent criteria (4) G. Distant Centroids MAX Finding anomalous types H. Consensus partition MAX Maximizing correlation between sought partition and given variables
Equivalent criteria (5) I. Spectral Clusters MAX Maximizing summary Raileigh quotient over binary vectors
PCA inspired Anomalous Pattern Clustering yiv =cv zi + eiv, where zi = 1 ifiS, zi = 0 ifiS With Euclidean distance squared cS must be anomalous, that is, interesting
Tom Sawyer Initial setting with Anomalous Pattern Cluster
1 2 Tom Sawyer 3 Anomalous Pattern Clusters: Iterate 0
iK-Means:Anomalous clusters + K-means After extracting 2 clusters (how one can know that 2 is right?) Final
Find all Anomalous Pattern clusters Remove smaller (e.g., singleton) clusters Put the number of remaining clusters as K and initialise K-Means with their centres iK-Means:Defining K and Initial Setting with Iterative Anomalous Pattern Clustering
Study of eight Number-of-clustersmethods (joint work with Mark Chiang): • Variance based: • Hartigan (HK) • Calinski & Harabasz(CH) • Jump Statistic(JS) • Structure based: • Silhouette Width(SW) • Consensus based: • Consensus Distribution area(CD) • Consensus Distribution mean (DD) • Sequential extraction of APs (iK-Means): • Least Square(LS) • Least Moduli(LM)
Experimental results at 9 Gaussian clusters (3 spread patterns), 1000 x 15 data size Two winners counted each time 1-time winner 2-times winner 3-times winner
(2) Address: Weighting features according to relevance • w: feature weights=scale factors • 3-step K-Means: • Given s, c, find w (weights) • Given w, c, find s (clusters) • Given s,w, find c (centroids) • till convergence
Minkowski’s centers • Minimize over c • At >1, d(c) is convex • Gradient method
Minkowski’s metric effects • The more uniform distribution of the entities over a feature, the smaller its weight • Uniform distribution w=0 • The best Minkowski power is data dependent • The best can be learnt from data in a semi-supervised manner (with clustering of all objects) • Example: at Fisher’s Iris, iMWK-Means gives 5 errors only (a record)
Conclusion: Data recovery K-Means-wise model of clustering is a tool that involves wealth of interesting criteria for mathematical investigation and application projects Further work: Extending the approach to other data types – text, sequence, image, web page Upgrading K-Means to address the issue of interpretation of the results Coder Model Decoder Data clustering Clusters Data recovery
HEFCE survey of students’ satisfaction • HEFCE method: ALL 93 of highest mark • STRATA: 43 best, ranging 71.8 to 84.6