230 likes | 243 Views
COALA : A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity. Eric Bae and James Bailey Proceedings of the IEEE International Conference on Data Mining, ICDM, 2006. 報告人 : 吳建良. Outline. Motivation Problem Definition COALA
E N D
COALA : A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity Eric Bae and James Bailey Proceedings of the IEEE International Conference on Data Mining, ICDM, 2006 報告人:吳建良
Outline • Motivation • Problem Definition • COALA • Quantitative Evaluation • Experiment
Motivation • Traditional clustering technique • Produce only a single solution • Difficult for user to validate whether the solution is in fact appropriate, particularly if the dataset is large and complex • The user has limited knowledge about the clustering algorithm being used • Provide another, alternative clustering solution • High quality, yet different from the original solution
Related Work - Ensemble Clustering • Objective • Generate multiple clusterings • Merge them to offer a final consensus clustering • Method • Apply many algorithms • Change initial conditions of an algorithm • Random samples of data
Challenges • An inability to know which algorithms to apply and how many • A difficulty in quantitatively evaluating the degree of (dis)similarity/quality for the candidate solutions • The inefficiency of running algorithms multiple times
Requirement of clustering • Dissimilarity requirement • Given two clusterings C and S, they can be presented as solutions if they are as dissimilar from one another as possible • Cannot-link constraint • Quality requirement • Given two clusterings C and S, they can be considered as solutions if they are both high quality clusterings • Quality threshold ω
Problem Definition • Problem definition • Given a clustering C (provided as pre-defined class labels) with r clusters, find a second clustering S with r clusters, having high dissimilarity to C, but also satisfying the quality requirement threshold ω
Notation • D={x1, x2, …, xn}: a set of n objects • C={c1, c2, …,cr}: existing clustering (background knowledge) • S={s1, s2, …, sr}: new clustering with respect to C • d(ci, cj): the distance between cluster ci and cj • Average linkage • Compute average distance of all pairwise objects between cluster ci and cj
Cannot-link Constraints • Cannot-link constraint • A pair of distinct data object (xi, xj) • In any feasible clustering, objects xi and xjmust not be in the same cluster • Use cannot-link constraint to ensure that the second clustering S is dissimilar form the given clustering C • Each pair of objects which is in the same cluster in C is added to constraint set L
COALA • Agglomerative hierarchical clustering algorithm • Start by treating each object as a single cluster • At each iteration, two candidate pairs of clusters for a possible merge are found • Qualitative pair (cq1, cq2) • The minimum distance over all the pairs of clusters • Dissimilar pair (co1, co2) • The minimum distance over all the pairs of clusters that also satisfy the cannnot-link constraints
COALA (contd.) • Quality threshold ω: balance the trade-off between the qualitative merge and dissimilar merge • If then merge(cq1, cq2) else merge(co1, co2) • If no cluster actually satisfy the cannot-link constraints, it proceeds with merges of the qualitative pairs
Example C={{A,B,C,D}, {E,F}} C A D E B F Initialization: each point forms a cluster Cannot-link constraint set L: ω=0.6 (A,B) (A,C), (A,D) (B,C) (B,D) (C,D) (E,F) C A D E B F
Example (contd.) Minimal qualitative pair: (A,B) or (B,C) or (D,F) Minimal dissimilar pair: (D,F) Suppose pick (A,B) as qualitative pair Merge dissimilar pair (D,F) Result: C A D E B F
Example (contd.) Minimal qualitative pair: (A,B) or (B,C) Minimal dissimilar pair: (C,E) Suppose pick (A,B) as qualitative pair Merge qualitative pair (A,B) Result: C A D E B F
Example (contd.) Minimal qualitative pair: ({A,B},C) Minimal dissimilar pair: (C,E) Merge qualitative pair ({A,B},C) Result: C A D E B F
Example (contd.) Minimal qualitative pair: ({D,F},E) Minimal dissimilar pair: ({A,B,C},E) Merge qualitative pair ({D,F},E) Result: C A D E B F
Quantitative Evaluation • Dissimilarity • Jaccard index: • N11: the number of pairs of points in the same cluster for both C and S • N00: the number of pairs that are in different clusters in C and S • N01 and N10: the number of pairs where a pair belongs to the same cluster in one clustering, but not the other • Quality • Dunn index: • δ: cluster-to-cluster distance • Δ: cluster diameter measure
Quantitative Evaluation (contd.) • Jaccard index value ↓ dissimilarity ↑ • Dunn index value ↑ quality ↑ • Overall clustering score
Experiment • Synthetic datasets
Two competing approaches • Naïve method • Apply k-means algorithm three times using different initial points • Select two clusterings with the highest DQ • Select a clustering with higher quality as a ‘known’ clustering from those two clusterings • CIB (Conditional information bottleneck) • Retrieval dissimilar clusterings • Find the optimal assignment of objects to clusters while preserving as much information of features conditioned on the information provided by pre-defined class labels
Result (contd.) • Four real word datasets