150 likes | 178 Views
A density-based cluster validity approach using multi-representatives. Presenter : Lin, Shu-Han Authors : Maria Halkidi *, Michalis Vazirgiannis. ˜. Pattern Recognition Letters 29 (2008). Outline. Motivation Objective Methodology Experiments Conclusion Personal Comments.
E N D
A density-based cluster validity approach using multi-representatives Presenter : Lin, Shu-Han Authors : Maria Halkidi *, Michalis Vazirgiannis ˜ Pattern Recognition Letters 29 (2008)
Outline • Motivation • Objective • Methodology • Experiments • Conclusion • Personal Comments
Motivation • Many clustering algorithms under different clustering assumptions, often lead to qualitatively different results. As a consequence the results of clustering algorithms (i.e., data set partitioning) need to be evaluated as regards their validity based on widely accepted criteria. • In this paper they motivate the aspects of assessing the validity of clustering results, using examples: The different partitioning defined by K-Means when it runs with different input parameter (ipvs) ,they just aim to find the best possible partitioning for the given ipvs but there is no indication that the defined clusters are the ones that best fit data. 3 Fig. The different partitioning defined by K-Means when it runs with different ipvs.
Objectives • To define and evaluate a new validity index, CDbw(Composed density between and within clusters) and a methodology that given a data set, S, and a set of algorithms A = {algi} enables • (i) finding the set of input parameter values (i.e., the best partitioning of a data set) that lead each algi to the best possible clustering results. • (ii) taking into account the results of (i), finding algi that returns the best partitioning of S among those defined by the considered algorithms. 4 Fig. Partitioning of DS3 into three clusters as defined by different clustering algorithms. (a) K-Means, (b) CURE and (c) DBSCAN, CLUTO.
Methodology • A cluster validity approach based on density 5 Fig. Inter-cluster density definition.
Methodology (Cont.) • (A) Cluster representative points definition • Closest representative points • Respective closest representative points 6
Methodology (Cont.) • (B) Clusters’ separation in terms of density • Density between clusters • Inter-cluster density • Clusters’ separation (Sep) Stdev: the standard deviation is a measure of the dispersion of a set of values 7
Ci.center s = 0.8 s = 0.7 s = 0.6 s = 0.5 s = 0.4 s = 0.3 s = 0.2 s = 0.1 Vij stdev Methodology (Cont.) • (C) Clusters’ compactness in terms of density • The compactness of a clustering • Relative intra-cluster density s ∈ [0.1, 0.8] (user-defined) 8
Ci.center s = 0.8 s = 0.7 s = 0.6 s = 0.5 s = 0.4 s = 0.3 s = 0.2 s = 0.1 Vij stdev Methodology (Cont.) • (D) Assessing the quality of a data clustering • Clusters’ cohesion • Intra-density changes • Cohesion • Separation wrt compactness • (E) CDbw definition s ∈ [0.1, 0.8] (user-defined) 9
Experiments • (A) Select the partitioning that best fits data among data set Fig. CDbw as a function of number of clusters for DS1 (CLUTO). Fig. Nd_Set CDbw vs the number of clusters for a 120-dimensional data set. 10
Experiments (Cont.) • (B) Select clustering algorithm Table. Best partitioning found by CDbw for different clustering algorithms 11
Experiments (Cont.) Fig. Synthetic data sets: (a) DS1 and partitioning of DS1 using CLUTO, (b) K-Means, and (c) CURE. Fig. Partitioning of DS3 into three clusters as defined by different clustering algorithms. (a) K-Means, (b) CURE and (c) DBSCAN, CLUTO. 12
Experiments (Cont.) Table. Accuracy of the clusterings presented with respect to the expected partitioning of DS2 • (C) Comparison to other cluster validity indices Fig. Partitioning of DS2 into four clusters as defined by (a) K-Means, (b) CURE, (c) the CLUTO algorithm and (d) DBSCAN. Table. Best partitioning proposed by validity indices compared with CDbw* 13
Conclusions • In this paper, they defined a new validity index, CDbw, and a methodology for finding the clustering among those defined by an algorithm or different clustering algorithms that best fits data. • It achieves this by considering multi-representative points per cluster. Contrary to other , their cohesion criterion that estimates density changes within clusters.
Personal Comments • Advantage • Accuracy • Data independent • Algorithm independent • Drawback • … • Application • …