Adapting the Right Measures for K-means Clustering

Adapting the Right Measures for K-means Clustering Junjie Wu (wujj@buaa.edu.cn) Beihang University Joint Work with Hui Xiong (Rutgers Univ.) & Jian Chen (Tsinghua Univ.)

Outline • Introduction • Defective Validation Measures • Measure Normalization • Measure Properties • Concluding Remarks

Clustering and Cluster Validation Data Preprocessing Clustering ClusterValidation Input Data Output Clusters • Cluster analysis provides insight into the data by dividing the objects into groups (clusters) of objects, such that objects in a cluster are more similar to each other than to objects in other clusters. • Cluster validationrefers to procedures that evaluate the results of clustering in a quantitative and objective fashion.[Jain & Dubes, 1988] • How to be “quantitative”: To employ the measures. • How to be “objective”: To validate the measures!

Cluster Validation Measures • A Typical View of Cluster Validation Measures: • External measures • Match a cluster structure to a prior information, e.g., class labels. • E.g.,Rand index, Γ statistics, F-measure, Mutual Information • Internal measures • Assess the fit between the structure and the data themselves only. • E.g., Silhouette index, CPCC, Γ statistics • Relative measures • Decide which of two structures is better, often used for selecting the right clustering parameters, e.g., the cluster number. • E.g., Dunn’s indices, Davies-Bouldin index, partition coefficient • Other Views: • Partitional Indices vs. Hierarchical Indices • Fuzzy Indices vs. Non-Fuzzy Indices • Statistics-based Indices vs. Information-based Indices

Research Motivations • There is littlework on evaluating the effectiveness of cluster validation measures in a systematic way. Many questions remain! • What are the measures widely used? • Are these measures objective? • Why and how these measures should be normalized? • What are the properties and interrelationships of these measures? • How to adapt the right measures for a specific clustering algorithm? • The answers to the above questions are essential to the success of cluster analysis!

The Scope of this Study • To provide an organized study of external validation measures for K-means clustering. • K-means is a well-known, widely used, and successful clustering method. • 16 external measures studied, 13 remained.

Workflow Towards Right Measures

Main Contributions • In general, we provided an organized study of selecting the right measures for K-means clustering. Specifically, we • Reviewed 16 well-known external validation measures; • Identified some defective measures • Established the importance of measure normalization and designed normalization solutions for several validation measures; • Revealed some major properties of these external measures, such as consistency, sensitivity, and symmetry properties. • Provided the final guidance for adapting right measures for K-means clustering

K-means: The Uniform Effect • For data sets in skewed distributions, K-means tends to produce clusters with relatively uniform sizes. On the document data set “sports”

A Necessary Selection Criterion • Two Clustering Results for a Sample Data Set CV1=0 CV1=1.125 far away CV0=1.166

Identifying Defective Measures: An Example • The cluster validation results • Now only 10 measures remained.

Exploring the Defectiveness horizontal vertical • Entropy and Purity • Mutual Information + ∑jmaxinij/n + H(P|C)

Improving the Defective Measures • Variation of Information (VI) vs. Entropy (E) • van Dongen criterion (VD) vs. Purity (P)

Two Normalization Methods • Normalization enables the use of measures for the comparisons of clustering results of different data sets. • Two types of normalization schemes • Statistics-based normalization • Extreme value-based normalization • Basic Assumptions • Multivariate hyper-geometricdistribution fixed

Normalization Solutions • The Normalized Measures Type I Type II

Test Normalizations: The DCV Criterion and the Settings • The DCV Criterion • DCV=CV1-CV0 • As the DCV values go down, the clustering results by K-means tend to be away from “true” class distributions. • As the DCV values go down, the good measures are expected to show worse clustering performances. • The Experimental Setup • Data Sets: simulated + sampled, with increased DCV. • Tools: Matlab 7.1 Cluto 2.1.1

Normalization Experiments: The Results : Kendall’s rank correlation • Remark • If we use the unnormalized measures to do cluster validation, only three measures, namely R, Γ, Γ’, have strong consistency with DCV. • All the normalized measures show perfect consistency with DCV except for Fn and ξn. • Wider value ranges afternormalization.

The Impact of the Number of Clusters • Remark • The measurement values for all the measures will change as the increase of the cluster numbers. • The normalized measures can capture the same optimal cluster number: 5. The data set “la2”

The Consistency • The Experiment Setup • Data Sets: 29 benchmark document data sets. • Tools: CLUTO. • Measures: Kendall’s rank correlation. • Result: Correlations of the Measures • The normalized measures have much stronger consistency than the unnormalized measures.

The Consistency, Cont’d • Hierarchical Clustering on the Normalized Measures • are equivalent. • are more similar to one another. • show inconsistency in varying degrees. • Only 7 normalized measures remained!

The Sensitivity • Remarks • All the measures show different validation results for the two clusterings except for VDn and Fn. • VIn is the most sensitive measure.

Math Properties .

Math Properties, Cont’d

The Selection Process: An Overview • The Way to the Right Measures • Step I: Discard M, MAP and GK. 13 measures remained. • Step II: Filter out E, P, and MI. 10 measures remained. • Step III: Normalize the measures. 10 normalized measures remained. • Step IV: Discard . 7 normalized measures remained. • Step V: Filter out Fn and ξn. 5 normalized measures remained. • Step VI: Discard FMn and Γn. 3 normalized measures remained. • The Three Right Measures for K-means Clustering • Normalized van Dongen criterion (VDn) • Normalized variation ofinformation (VIn) • Normalized Rand index (Rn)

Insights • Guidance for K-means Clustering Validation • It is most suitable touse VDn, since VDnhas a simple computation form, satisfies allmathematicallysound properties, and can measure wellon the data with imbalanced class distributions. • For the case that the clustering performances are hard todistinguish, we may use VIn instead, since VIn has high sensitivityon detecting the clustering changes. • Rn can also beused as a complementary to the above two measures.

Conclusions • In this study, we compared and contrasted external validation measures for K-means clustering • It is necessary to normalize validation measures before they can be employed for clustering validation • Provided normalization solutions for the measures whose normalized solutions are not available • Summarized the key properties of these measures. These properties should be considered before deciding what is the right measure to use in practice • Investigated the relationships among these validation measures.

Thank You! http://datamining.buaa.edu.cn

Adapting the Right Measures for K-means Clustering

Adapting the Right Measures for K-means Clustering

Presentation Transcript

k -means Clustering

K-means Clustering

K-means Clustering

K means Clustering ( Weka )

Canopy Clustering and K-Means Clustering

The K-Means Clustering Method : for numerical attributes

K-MEANS CLUSTERING

K-Means Clustering

K-means clustering

K-means Clustering

Initial K-Means Clustering :

Lecture 8 K-means for clustering

K-means Clustering

Determining the ‘k’ in k-Means Clustering

K-means Clustering

Clustering Beyond K -means

Clustering: K-Means

K-means clustering