670 likes | 731 Views
Correlation Clustering. Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla. Introduction. Previous Approaches. Doc1 -> (1,0,1). +. Distance among points. Documents mapped to points. Previous Approaches. k-min clustering K-min sum, k-median …. Approximation algorithms,
E N D
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla
Previous Approaches Doc1 -> (1,0,1) + Distance among points Documents mapped to points
Previous Approaches k-min clustering K-min sum, k-median … Approximation algorithms, Matrix methods, AI Techniques … K=3 K-min clustering: Minimize Max. Diameter K-min sum : Minimize sum of distances within clusters
Some Limitations 1) Have to specify “k” If k not restricted: Best to just put each vertex in its ownindividual cluster
Some Limitations 2) Restrictions on Edge Weights Edge weights form metric
Some Limitations 3) No Clean notion of quality of clustering E.g. Minimize distance sum within clusters. What really is my Cluster quality?
Outline • Introduction • Our Approach + Problem Formulation • Approximating Agreements • Approximating Disagreements • Conclusion
Our Approach Classifier: takes 2 documents and Returns a weight in [-1,+1] indicating their similarity +1: Similar -1: Dissimilar W In this talk, W= -1 or +1
Our Approach Classifier: takes 2 documents and Returns a weight in [-1,+1] indicating their similarity +1: Similar -1: Dissimilar W In this talk, W= -1 or +1 -1 -1 +1
Our Approach Classifier: takes 2 documents and Returns a weight in [-1,+1] indicating their similarity +1: Similar -1: Dissimilar W In this talk, W= -1 or +1 -1 Our Goal: Find a clustering which agrees with this labeling -1 +1
+1: Similar -1: Dissimilar A Disagreement -1 -1 +1 2 edges have disagreements!! Disagreement: -1 edge with in a cluster +1 edge crossing a cluster -1 -1 +1 Our Goal: Minimize number of disagreements
Comparison: 1) Clean notion of quality of clustering # disagreements -> Quality
Comparison: 2) Do not have to specify “k” +1 +1 K determined by Edge labels +1 -1 -1 -1
Comparison: 3) Arbitrary Edge Weights No metric No dependence
A Closer Look Goal: Given graph with +1,-1 edges. Cluster to minimize disagreements Question: Can we always avoid disagreements?
A Closer Look Goal: Given graph with +1,-1 edges. Cluster to minimize disagreements Question: Can we always avoid disagreements? Answer: No. +1 +1 -1 Any clustering has at least 1 disagreement
Minimizing Disagreements +1 +1 +1 +1 +1 +1 -1 -1 -1 1 Disagreement
Minimizing Disagreements +1 +1 +1 +1 +1 +1 -1 -1 -1 1 Disagreement Minimizing disagreements is NP-Hard Will look for approximation algorithms
Agreements vs. Disagreements Observation: Agreements + Disagreements = Minimizing disagreements , Maximizing agreements Very different in terms of approximation: Opt: 1 disagreement We: n disagreements Disagreements : Ratio n Agreements : Ratio ¼ 1
Outline • Introduction • Our Approach + Problem Formulation • Approximating Agreements • Approximating Disagreements • Conclusion
Maximizing Agreements A 2 approximation is easy. Algorithm: If #(+1 edges) > #(-1 edges), put all in single cluster Else, individual cluster for each point. Proof: Opt’s agreements at most We agree on at least
Our Result • A PTAS for max. agreements: (1+) approximation, Time = nO(poly(1/))
Outline • Introduction • Our Approach + Problem Formulation • Approximating Agreements • Approximating Disagreements • Conclusion
Our Result • An O(1) approximation for minimizing disagreements
Approximation for Disagreements To prove: Dalg· c Dopt Roadmap: 1) Notation 2) Show existence of Opt(d) 3) Describe the Algorithm 4) Show our clustering close to Opt(d) Dalg : Our Disagreements Dopt : Opt Disagreements
+1: Similar -1: Dissimilar Notation: Given a clustering, vertex d-good if few disagreements Within C < d|C| Outside C < d|C| C v is d-good C A-bad vertex has ¸|C| disagreements v is d-bad Cluster C d-clean if all v2C are d-good
Approximation for Disagreements To prove: Dalg· c Dopt Roadmap: 1) Notations 2) Show existence of Opt(d) 3) Describe the Algorithm 4) Show our clustering close to Opt(d) Dalg : Our Disagreements Dopt : Opt Disagreements
Existence of Opt(d) Main Idea: Opt -> Opt(d) Opt(d): 1) All “non-singleton” clusters d-clean 2) Constant times worse than Opt Dopt(d) = O(1/d2) Dopt
C1 C2 Transforming OPT to OPT(d) Optimum clustering An Imaginary Procedure applied to Opt
Transforming OPT to OPT(d) Identify d/3-bad vertices Optimum clustering C1 C2 /3-bad vertices
Transforming OPT to OPT(d) • Move d/3-bad vertices out • If “many” (¸d/3) d/3-bad, “split” Optimum clustering C1 C2 OPT(d) Vertex moves out
Transforming OPT to OPT(d) • Move d/3-bad vertices out • If “many” (¸d/3) d/3-bad, “split” Optimum clustering C1 C2 OPT(d) Split
Transforming OPT to OPT(d) Disagreements of OPT(d) OPT(d) Split Disagreements: Earlier ¸ (d/3)2|C1|2 Add · |C1|2/2 Do not split Disagreements: Earlier: Each had ¸d/3|C2| Add : Each has · |C2| So, total disagreements increase by O(1/d2) times
Transforming OPT to OPT(d) “Non-Singleton” clusters are d-clean Optimum clustering Earlier d/3 good vertex Still d-good C1 C2 OPT(d)
Approximation for Disagreements To prove: Dalg· c Dopt Roadmap: 1) Notations 2) Show existence of Opt(d) 3) Describe the Algorithm 4) Show our clustering close to Opt(d) Dalg : Our Disagreements Dopt : Opt Disagreements
Main Result Opt() -clean Clustering produced by Algorithm 11-clean
The algorithm Input: Graph G Output: A clustering of G 1) Pick arbitrary v2 G, let C=+1 neighbors of v 2) Vertex Removal Phase:Remove bad vertices from C 3) Vertex Addition Phase:Add good vertices to C 4) Repeat on G-C
v C Step 1 Choose v, C= +1 neighbors of v C1 C2
v C Step 2 Vertex Removal Phase: If x is 3d bad, C=C-{x} C1 C2
Step 2 Vertex Removal Phase: If x is 3d bad, C=C-{x} C1 C2 v C • No vertex in C1 removed. • All vertices in C2 removed
Step 3 Vertex Addition Phase: Add 7d-good vertices to C C1 C2 v C
Step 3 Vertex Addition Phase: Add 7d-good vertices to C C1 C2 v C • All remaining vertices in C1 will be added • None in C2 added • Cluster C is 11d-clean
Case 2: v Singleton in OPT() Choose v, C= +1 neighbors of v C v Same idea works
Main Result Opt() -clean Algorithm 11-clean
Approximation for Disagreements To prove: Dalg· c Dopt Roadmap: 1) Notations 2) Show existence of Opt(d) 3) Describe the Algorithm 4) Show our clustering close to Opt(d) Dalg : Our Disagreements Dopt : Opt Disagreements
Our Disagreements C1 C2 Opt() +1 +1 • Disagreements: • Involving Singletons • In Non-Singletons Algorithm +1 Type 1 · Dopt(d) +1 11- clean
Disagreements in Non-Singletons Lemma: If d < ¼, disagreements in d-clean clusterings are · 8 dopt +1 Erroneous Triangle: +1 -1 Disagreements of OPT ¸ # of edge disjoint Erroneous D
Lots of these (¸ ½|C|) All cannot be used up +1 +1 -1 Disagreements in Non-Singletons Lemma: If d < ¼, errors in d-clean clusterings are · 8 dopt Proof Idea:For each disagreement will find an edge disjoint erroneous D -clean cluster C
Disagreements in Non-Singletons Lemma: If d < ¼, disagreements in d- clean clusterings are · 8 dopt Identical argument works -1 +1 +1 -clean clusters