Метод К-средних в кластер-анализе и его интеллектуализация

Метод К-средних в кластер-анализе и его интеллектуализация Б.Г. Миркин Профессор, Кафедра анализа данных и искусственного интеллекта, НИУ ВШЭ МоскваРФ Professor Emeritus, School of Computer Science& Information Systems, Birkbeck College University of London, UK

Outline: • Clustering as empirical classification • K-Means and its issues: • (1) Determining Kand initialization • (2) Weighting variables • Addressing (1): • Data recovery clustering and K-Means (Mirkin 1987, 1990) • One-by-one clustering: Anomalous patterns and iK-Means • Other approaches • Computational experiment • Addressing (2): • Three-stage K-Means • Minkowski K-Means • Computational experiment • Conclusion

WHAT IS CLUSTERING; WHAT IS DATA • K-MEANS CLUSTERING: Conventional K-Means; Initialization of K-Means; Intelligent K-Means; Mixed Data; Interpretation Aids • WARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward Clustering • DATA RECOVERY MODELS: Statistics Modelling as Data Recovery; Data Recovery Model for K-Means; for Ward;Extensions to Other Data Types; One-by-One Clustering • DIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph-Theoretic Approaches; Conceptual Description of Clusters • GENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability

Referred recent work: • B.G. Mirkin, Chiang M. (2010) Intelligent choice of the number of clusters in K-Means clustering: An experimental study with different cluster spreads, J. of Classification, 27, 1, 3-41 • B.G. Mirkin,Choosing the number of clusters (2011), WIRE Data Mining and Knowledge Discovery, 1, 3, 252-60 • B.G. Mirkin, R.Amorim (2012) Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering, Pattern Recognition, 45, 1061-75

What is clustering? • Finding homogeneous fragments, mostly sets of entities, in datasets for further analysis

Example: W. Jevons (1857) planet clusters (updated by Mirkin 1996) Pluto doesn’t fit in the two clusters of planets: originated another cluster (September 2006)

Example: A Few Clusters Clustering interface to WEB search engines (Grouper): Query: Israel (after O. Zamir and O. Etzioni 2001)

Clustering algorithms: • Nearest neighbour • Agglomerative clustering • Divisive clustering • Conceptual clustering • K-means • Kohonen SOM • Spectral clustering • ………………….

Batch K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence K= 3 hypothetical centroids (@) • * * • * * * * * • * * * • @ @ • @ • ** • * * *

K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence • * * • * * * * * • * * * • @ @ • @ • ** • * * *

K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters * * @ * * * @ * * * * ** * * * @

K-Means criterion: Summary distance to cluster centroids Minimize * * @ * * * @ * * * * ** * * * @

Advantages of K-Means - Models typology building - Simple “data recovery” criterion - Computationally effective - Can be utilised incrementally, `on-line’ Shortcomings of K-Means - Initialisation: no advice on K or initial centroids - No deep minima - No defence of irrelevant features

Initial Centroids: Correct Two cluster case

Initial Centroids: Correct Final Initial

Different Initial Centroids

Different Initial Centroids: Wrong Initial Final

(1) To address: • *Number of clusters • Issue: Criterion WK < WK-1 • * Initial setting • * Deeper minimum • The two are interrelated: a good initial setting leads to a deeper minimum

Number K: conventional approach • Take a range RK of K, say K=3, 4, …, 15 • For each KRK • Run K-Means 100-200 times from randomly chosen initial centroids and choose the best of them W(S,c)=WK. • CompareWK for all KRK in a special way and choose the best; such as • Gap statistic (2001) • Jump statistic (2003) • Hartigan (1975): In the ascending order of K, pick the first K at which HK = [ WK / WK+1 - 1 ]/(N-K-1)  10

(1) Addressing • *Number of clusters • * Initial setting • with a PCA-like method in the data recovery approach

Representing a partition Clusterk: Centroid ckv (v - feature) Binary 1/0 membership zik (i - entity)

Basic equations (same as for PCA, but score vectors zk constrained to be binary) y – data entry, z – 1/0 membership, not score c - cluster centroid, N – cardinality i - entity, v - feature /category, k - cluster

Quadratic data scatter decomposition (Pythagorean) K-means: Alternating LS minimisation y – data entry, z – 1/0 membership c - cluster centroid, N – cardinality i - entity, v - feature /category, k - cluster

Equivalent criteria (1) A. Bilinear residuals squared MIN Minimizing difference between data and cluster structure B. Distance-to-Centre Squared MIN Minimizing difference between data and cluster structure

Equivalent criteria (2) C. Within-group error squared MIN Minimizing difference between data and cluster structure D. Within-group variance Squared MIN Minimizing within-cluster variance

Equivalent criteria (3) E. Semi-averaged within distance squared MIN Minimizing dissimilarities within clusters F. Semi-averaged within similarity squared MAX Maximizing similarities within clusters

Equivalent criteria (4) G. Distant Centroids MAX Finding anomalous types H. Consensus partition MAX Maximizing correlation between sought partition and given variables

Equivalent criteria (5) I. Spectral Clusters MAX Maximizing summary Raileigh quotient over binary vectors

PCA inspired Anomalous Pattern Clustering yiv =cv zi + eiv, where zi = 1 ifiS, zi = 0 ifiS With Euclidean distance squared cS must be anomalous, that is, interesting

Tom Sawyer Initial setting with Anomalous Pattern Cluster

1 2 Tom Sawyer 3 Anomalous Pattern Clusters: Iterate 0

iK-Means:Anomalous clusters + K-means After extracting 2 clusters (how one can know that 2 is right?) Final

Find all Anomalous Pattern clusters Remove smaller (e.g., singleton) clusters Put the number of remaining clusters as K and initialise K-Means with their centres iK-Means:Defining K and Initial Setting with Iterative Anomalous Pattern Clustering

Study of eight Number-of-clustersmethods (joint work with Mark Chiang): • Variance based: • Hartigan (HK) • Calinski & Harabasz(CH) • Jump Statistic(JS) • Structure based: • Silhouette Width(SW) • Consensus based: • Consensus Distribution area(CD) • Consensus Distribution mean (DD) • Sequential extraction of APs (iK-Means): • Least Square(LS) • Least Moduli(LM)

Experimental results at 9 Gaussian clusters (3 spread patterns), 1000 x 15 data size Two winners counted each time 1-time winner 2-times winner 3-times winner

(2) Address: Weighting features according to relevance • w: feature weights=scale factors • 3-step K-Means: • Given s, c, find w (weights) • Given w, c, find s (clusters) • Given s,w, find c (centroids) • till convergence

Minkowski’s centers • Minimize over c • At >1, d(c) is convex • Gradient method

Minkowski’s metric effects • The more uniform distribution of the entities over a feature, the smaller its weight • Uniform distribution  w=0 • The best Minkowski power  is data dependent • The best  can be learnt from data in a semi-supervised manner (with clustering of all objects) • Example: at Fisher’s Iris, iMWK-Means gives 5 errors only (a record)

Conclusion: Data recovery K-Means-wise model of clustering is a tool that involves wealth of interesting criteria for mathematical investigation and application projects Further work: Extending the approach to other data types – text, sequence, image, web page Upgrading K-Means to address the issue of interpretation of the results Coder Model Decoder Data clustering Clusters Data recovery

HEFCE survey of students’ satisfaction • HEFCE method: ALL 93 of highest mark • STRATA: 43 best, ranging 71.8 to 84.6

Метод К-средних в кластер-анализе и его интеллектуализация

Метод К-средних в кластер-анализе и его интеллектуализация

Presentation Transcript