Eric Bae and James Bailey

COALA : A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity Eric Bae and James Bailey Proceedings of the IEEE International Conference on Data Mining, ICDM, 2006 報告人:吳建良

Outline • Motivation • Problem Definition • COALA • Quantitative Evaluation • Experiment

Motivation • Traditional clustering technique • Produce only a single solution • Difficult for user to validate whether the solution is in fact appropriate, particularly if the dataset is large and complex • The user has limited knowledge about the clustering algorithm being used • Provide another, alternative clustering solution • High quality, yet different from the original solution

Related Work - Ensemble Clustering • Objective • Generate multiple clusterings • Merge them to offer a final consensus clustering • Method • Apply many algorithms • Change initial conditions of an algorithm • Random samples of data

Challenges • An inability to know which algorithms to apply and how many • A difficulty in quantitatively evaluating the degree of (dis)similarity/quality for the candidate solutions • The inefficiency of running algorithms multiple times

Requirement of clustering • Dissimilarity requirement • Given two clusterings C and S, they can be presented as solutions if they are as dissimilar from one another as possible • Cannot-link constraint • Quality requirement • Given two clusterings C and S, they can be considered as solutions if they are both high quality clusterings • Quality threshold ω

Problem Definition • Problem definition • Given a clustering C (provided as pre-defined class labels) with r clusters, find a second clustering S with r clusters, having high dissimilarity to C, but also satisfying the quality requirement threshold ω

Notation • D={x1, x2, …, xn}: a set of n objects • C={c1, c2, …,cr}: existing clustering (background knowledge) • S={s1, s2, …, sr}: new clustering with respect to C • d(ci, cj): the distance between cluster ci and cj • Average linkage • Compute average distance of all pairwise objects between cluster ci and cj

Cannot-link Constraints • Cannot-link constraint • A pair of distinct data object (xi, xj) • In any feasible clustering, objects xi and xjmust not be in the same cluster • Use cannot-link constraint to ensure that the second clustering S is dissimilar form the given clustering C • Each pair of objects which is in the same cluster in C is added to constraint set L

COALA • Agglomerative hierarchical clustering algorithm • Start by treating each object as a single cluster • At each iteration, two candidate pairs of clusters for a possible merge are found • Qualitative pair (cq1, cq2) • The minimum distance over all the pairs of clusters • Dissimilar pair (co1, co2) • The minimum distance over all the pairs of clusters that also satisfy the cannnot-link constraints

COALA (contd.) • Quality threshold ω: balance the trade-off between the qualitative merge and dissimilar merge • If then merge(cq1, cq2) else merge(co1, co2) • If no cluster actually satisfy the cannot-link constraints, it proceeds with merges of the qualitative pairs

Example C={{A,B,C,D}, {E,F}} C A D E B F Initialization: each point forms a cluster Cannot-link constraint set L: ω=0.6 (A,B) (A,C), (A,D) (B,C) (B,D) (C,D) (E,F) C A D E B F

Example (contd.) Minimal qualitative pair: (A,B) or (B,C) or (D,F) Minimal dissimilar pair: (D,F) Suppose pick (A,B) as qualitative pair  Merge dissimilar pair (D,F) Result: C A D E B F

Example (contd.) Minimal qualitative pair: (A,B) or (B,C) Minimal dissimilar pair: (C,E) Suppose pick (A,B) as qualitative pair  Merge qualitative pair (A,B) Result: C A D E B F

Example (contd.) Minimal qualitative pair: ({A,B},C) Minimal dissimilar pair: (C,E)  Merge qualitative pair ({A,B},C) Result: C A D E B F

Example (contd.) Minimal qualitative pair: ({D,F},E) Minimal dissimilar pair: ({A,B,C},E)  Merge qualitative pair ({D,F},E) Result: C A D E B F

Quantitative Evaluation • Dissimilarity • Jaccard index: • N11: the number of pairs of points in the same cluster for both C and S • N00: the number of pairs that are in different clusters in C and S • N01 and N10: the number of pairs where a pair belongs to the same cluster in one clustering, but not the other • Quality • Dunn index: • δ: cluster-to-cluster distance • Δ: cluster diameter measure

Quantitative Evaluation (contd.) • Jaccard index value ↓ dissimilarity ↑ • Dunn index value ↑ quality ↑ • Overall clustering score

Experiment • Synthetic datasets

Two competing approaches • Naïve method • Apply k-means algorithm three times using different initial points • Select two clusterings with the highest DQ • Select a clustering with higher quality as a ‘known’ clustering from those two clusterings • CIB (Conditional information bottleneck) • Retrieval dissimilar clusterings • Find the optimal assignment of objects to clusters while preserving as much information of features conditioned on the information provided by pre-defined class labels

Result

Result (contd.) • Four real word datasets

Impact of quality threshold ω

Eric Bae and James Bailey

Eric Bae and James Bailey

Presentation Transcript

Reynold Bailey

By: Aaron Gee, Eric Lee, and James Watanabe

David bailey

Katharine Bailey

Bailey Lien

By: Sean Pastorok , Bailey Lien, James Hong, Alex Winters, and Kurt Schaller

Pearl Bailey

Ashley Hihath , BAE

ERIC COX FENRICK JAMES

Kephren and bailey

By: Bailey

Optical Lever System ( OpLev ) Eric James

By: Michael Bailey James Roe Mike Parker

By: Sean Pastorok , Bailey Lien, James Hong, Alex Winters, and Kurt Schaller

Dick Bailey Fire Safety Rodeo July 13, 2013 at BAE/Holston AAP

BAE Systems

James Yandall Eric Chen Michael Florentino

BAE Project

BAE Systems

Sara Bailey