410 likes | 459 Views
Correlation Clustering. Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla. Clustering. Say we want to cluster n objects of some kind (documents, images, text strings) But we don’t have a meaningful way to project into Euclidean space.
E N D
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla
Clustering • Say we want to cluster n objects of some kind (documents, images, text strings) • But we don’t have a meaningful way to project into Euclidean space. • Idea of Cohen, McCallum, Richman: use past data to train up f(x,y)=same/different. • Then run fon all pairs and try to find most consistent clustering.
The problem Harry B. Harry Bovik H. Bovik Tom X.
The problem Harry B. + Harry Bovik +: Same -: Different + + H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs
The problem Harry B. + Harry Bovik +: Same -: Different + + H. Bovik Tom X. • Totally consistent: • + edges inside clusters • –edges outside clusters
The problem Harry B. + Harry Bovik +: Same -: Different + H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs
The problem Harry B. + Harry Bovik +: Same -: Different + Disagreement H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering
The problem Harry B. + Harry Bovik +: Same -: Different + Disagreement H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering
The problem Harry B. + Problem: Given a complete graph onnvertices. Each edge labeled+or-. Goal is to findpartitionof vertices asconsistentas possible with edge labels. Max #(agreements) or Min #( disagreements) There is no k : # of clusters could be anything Harry Bovik +: Same -: Different + Disagreement H. Bovik Tom X.
The Problem Noise Removal: There is some true clustering. However some edges incorrect. Still want to do well. Agnostic Learning: No inherent clustering. Try to find the best representation using hypothesis with limited representation power. Eg: Research communities via collaboration graph
Our results • Constant-factor approxfor minimizing disagreements. • PTASfor maximizing agreements. • Results for random noise case.
PTAS for maximizing agreements Easy to get ½ of the edges Goal: additive apx ofen2 . Standard approach: • Draw small sample, • Guess partition of sample, • Compute partition of remainder. Can do directly, or plug into General Property Tester of [GGR]. • Running timedoubly exp’l in e, orsinglywith bad exponent.
Minimizing Disagreements Goal: Get a constant factor approx. Problem:Even if we can find a cluster that’s as good as best cluster of OPT, we’re headed toward O(log n) [Set-Cover like analysis] Need a way of lower-bounding Dopt.
Lower bounding idea: bad triangles Consider + + We know any clustering has to disagree with at least one of these edges.
Lower bounding idea: bad triangles If several such disjoint, then mistake on each one 1 + + 5 2 2 Edge disjoint Bad Triangles (1,2,3), (3,4,5) + + 4 3 + Dopt¸ #{Edge disjointbadtriangles} (Not Tight)
Lower bounding idea: bad triangles If several such, then mistake on each one 1 + + 5 2 Edge disjoint Bad Triangles (1,2,3), (3,4,5) + + 4 3 + Dopt¸ #{Edge disjointbadtriangles} How can we use this ?
d-clean Clusters Given a clustering, vertex d-good if few disagreements N-(v) Within C < d|C| N+(v) Outside C < d|C| C v is d-good +: Similar -: Dissimilar Essentially, N+(v) ¼ C(v) Cluster C d-clean if all v2C are d-good
Observation Any -clean clustering is 8 approx for <1/4 Idea: Charging mistakes to bad triangles w Intuitively, enough choices of w for each wrong edge (u,v) + + - v u
Observation Any -clean clustering is 8 approx for <1/4 Idea: Charging mistakes to bad triangles w Intuitively, enough choices of w for each wrong edge (u,v) Can find edge-disjoint bad triangles + u’ + + - - v u Similar argument for +ve edges between 2 clusters
General Structure of Argument Any -clean clustering is 8 approx for <1/4 Consequence:Just need to produce a d-clean clustering for d<1/4
General Structure of Argument Any -clean clustering is 8 approx for <1/4 Consequence:Just need to produce a d-clean clustering for d<1/4 Bad News:May not be possible !!
General Structure of Argument Any -clean clustering is 8 approx for <1/4 Consequence: Just need to produce a d-clean clustering for d<1/4 Approach: 1) Clustering of a special form: Opt(d) • Clusters either d clean or singletons • Not many more mistakes than Opt 2) Produce something “close’’ to Opt(d) Bad News:May not be possible !!
C1 C2 Existence of OPT(d) Opt
Existence of OPT(d) Identify d/3-bad vertices Opt C1 C2 /3-bad vertices
Existence of OPT(d) • Move d/3-bad vertices out • If “many” (¸d/3) d/3-bad, “split” Opt Opt() : Singletons and -clean clusters DOpt() = O(1) DOpt C1 C2 OPT(d) Split
Main Result Opt() -clean
Main Result • Guarantee: • Non-Singletons 11 clean • Singletons subset of Opt() Opt() -clean Algorithm 11-clean
Main Result • Guarantee: • Non-Singletons 11 clean • Singletons subset of Opt() Opt() -clean Choose =1/44 Non-singletons: ¼-clean Bound mistakes among non-singletons by bad trianges Involving singletons By those of Opt() Algorithm 11-clean
Main Result • Guarantee: • Non-Singletons 11 clean • Singletons subset of Opt() Opt() -clean Choose =1/44 Non-singletons: ¼-clean Bound mistakes among non-singletons by bad trianges Involving singletons By those of Opt() Approx. ratio= 9/d2 + 8 Algorithm 11-clean
Open Problems • How about a small constant? Is Dopt· 2 #{Edge disjointbadtriangles} ? Is Dopt· 2 #{Fractional e.d. bad triangles}? • Extend to {-1, 0, +1} weights: apx to constant or even log factor? • Extend to weighted case? Clique Partitioning Problem Cutting plane algorithms [Grotschel et al]
+ + The problem Harry B. Harry Bovik +: Same -: Different Disagreement H. Bovik Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering
Nice features of formulation • There’s no k. (OPT can have anywhere from 1 to n clusters) • If a perfect solution exists, then it’s easy to find: C(v) = N +(v). • Easy to get agreement on ½ of edges.
Algorithm + • Pick vertex v. Let C(v) = N (v) • Modify C(v) • Remove 3d-bad vertices from C(v). • Add 7d good vertices into C(v). • Delete C(v). Repeat until done, or above always makes empty clusters. • Output nodes left as singletons.
v C Step 1 Choose v, C= + neighbors of v C1 C2
Step 2 Vertex Removal Phase: If x is 3d bad, C=C-{x} C1 C2 v C • No vertex in C1 removed. • All vertices in C2 removed
Step 3 Vertex Addition Phase: Add 7d-good vertices to C C1 C2 v C • All remaining vertices in C1 will be added • None in C2 added • Cluster C is 11d-clean
Case 2: v Singleton in OPT() Choose v, C= +1 neighbors of v C v Same idea works
1 4 2 3 The problem