Clustering Correlation Approaches with Approximation Algorithms

Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla

Introduction

Previous Approaches Doc1 -> (1,0,1) + Distance among points Documents mapped to points

Previous Approaches k-min clustering K-min sum, k-median … Approximation algorithms, Matrix methods, AI Techniques … K=3 K-min clustering: Minimize Max. Diameter K-min sum : Minimize sum of distances within clusters

Some Limitations 1) Have to specify “k” If k not restricted: Best to just put each vertex in its ownindividual cluster

Some Limitations 2) Restrictions on Edge Weights Edge weights form metric

Some Limitations 3) No Clean notion of quality of clustering E.g. Minimize distance sum within clusters. What really is my Cluster quality?

Outline • Introduction • Our Approach + Problem Formulation • Approximating Agreements • Approximating Disagreements • Conclusion

Our Approach Classifier: takes 2 documents and Returns a weight in [-1,+1] indicating their similarity +1: Similar -1: Dissimilar W In this talk, W= -1 or +1

Our Approach Classifier: takes 2 documents and Returns a weight in [-1,+1] indicating their similarity +1: Similar -1: Dissimilar W In this talk, W= -1 or +1 -1 -1 +1

Our Approach Classifier: takes 2 documents and Returns a weight in [-1,+1] indicating their similarity +1: Similar -1: Dissimilar W In this talk, W= -1 or +1 -1 Our Goal: Find a clustering which agrees with this labeling -1 +1

+1: Similar -1: Dissimilar A Disagreement -1 -1 +1 2 edges have disagreements!! Disagreement: -1 edge with in a cluster +1 edge crossing a cluster -1 -1 +1 Our Goal: Minimize number of disagreements

Comparison: 1) Clean notion of quality of clustering # disagreements -> Quality

Comparison: 2) Do not have to specify “k” +1 +1 K determined by Edge labels +1 -1 -1 -1

Comparison: 3) Arbitrary Edge Weights No metric No dependence

A Closer Look Goal: Given graph with +1,-1 edges. Cluster to minimize disagreements Question: Can we always avoid disagreements?

A Closer Look Goal: Given graph with +1,-1 edges. Cluster to minimize disagreements Question: Can we always avoid disagreements? Answer: No. +1 +1 -1 Any clustering has at least 1 disagreement

Minimizing Disagreements +1 +1 +1 +1 +1 +1 -1 -1 -1 1 Disagreement

Minimizing Disagreements +1 +1 +1 +1 +1 +1 -1 -1 -1 1 Disagreement Minimizing disagreements is NP-Hard Will look for approximation algorithms

Agreements vs. Disagreements Observation: Agreements + Disagreements = Minimizing disagreements , Maximizing agreements Very different in terms of approximation: Opt: 1 disagreement We: n disagreements Disagreements : Ratio n Agreements : Ratio ¼ 1

Maximizing Agreements A 2 approximation is easy. Algorithm: If #(+1 edges) > #(-1 edges), put all in single cluster Else, individual cluster for each point. Proof: Opt’s agreements at most We agree on at least

Our Result • A PTAS for max. agreements: (1+) approximation, Time = nO(poly(1/))

Our Result • An O(1) approximation for minimizing disagreements

Approximation for Disagreements To prove: Dalg· c Dopt Roadmap: 1) Notation 2) Show existence of Opt(d) 3) Describe the Algorithm 4) Show our clustering close to Opt(d) Dalg : Our Disagreements Dopt : Opt Disagreements

+1: Similar -1: Dissimilar Notation: Given a clustering, vertex d-good if few disagreements Within C < d|C| Outside C < d|C| C v is d-good C A-bad vertex has ¸|C| disagreements v is d-bad Cluster C d-clean if all v2C are d-good

Approximation for Disagreements To prove: Dalg· c Dopt Roadmap: 1) Notations 2) Show existence of Opt(d) 3) Describe the Algorithm 4) Show our clustering close to Opt(d) Dalg : Our Disagreements Dopt : Opt Disagreements

Existence of Opt(d) Main Idea: Opt -> Opt(d) Opt(d): 1) All “non-singleton” clusters d-clean 2) Constant times worse than Opt Dopt(d) = O(1/d2) Dopt

C1 C2 Transforming OPT to OPT(d) Optimum clustering An Imaginary Procedure applied to Opt

Transforming OPT to OPT(d) Identify d/3-bad vertices Optimum clustering C1 C2 /3-bad vertices

Transforming OPT to OPT(d) • Move d/3-bad vertices out • If “many” (¸d/3) d/3-bad, “split” Optimum clustering C1 C2 OPT(d) Vertex moves out

Transforming OPT to OPT(d) • Move d/3-bad vertices out • If “many” (¸d/3) d/3-bad, “split” Optimum clustering C1 C2 OPT(d) Split

Transforming OPT to OPT(d) Disagreements of OPT(d) OPT(d) Split Disagreements: Earlier ¸ (d/3)2|C1|2 Add · |C1|2/2 Do not split Disagreements: Earlier: Each had ¸d/3|C2| Add : Each has · |C2| So, total disagreements increase by O(1/d2) times

Transforming OPT to OPT(d) “Non-Singleton” clusters are d-clean Optimum clustering Earlier d/3 good vertex Still d-good C1 C2 OPT(d)

Main Result Opt() -clean Clustering produced by Algorithm 11-clean

The algorithm Input: Graph G Output: A clustering of G 1) Pick arbitrary v2 G, let C=+1 neighbors of v 2) Vertex Removal Phase:Remove bad vertices from C 3) Vertex Addition Phase:Add good vertices to C 4) Repeat on G-C

v C Step 1 Choose v, C= +1 neighbors of v C1 C2

v C Step 2 Vertex Removal Phase: If x is 3d bad, C=C-{x} C1 C2

Step 2 Vertex Removal Phase: If x is 3d bad, C=C-{x} C1 C2 v C • No vertex in C1 removed. • All vertices in C2 removed

Step 3 Vertex Addition Phase: Add 7d-good vertices to C C1 C2 v C

Step 3 Vertex Addition Phase: Add 7d-good vertices to C C1 C2 v C • All remaining vertices in C1 will be added • None in C2 added • Cluster C is 11d-clean

Case 2: v Singleton in OPT() Choose v, C= +1 neighbors of v C v Same idea works

Main Result Opt() -clean Algorithm 11-clean

Our Disagreements C1 C2 Opt() +1 +1 • Disagreements: • Involving Singletons • In Non-Singletons Algorithm +1 Type 1 · Dopt(d) +1 11- clean

Disagreements in Non-Singletons Lemma: If d < ¼, disagreements in d-clean clusterings are · 8 dopt +1 Erroneous Triangle: +1 -1 Disagreements of OPT ¸ # of edge disjoint Erroneous D

Lots of these (¸ ½|C|) All cannot be used up +1 +1 -1 Disagreements in Non-Singletons Lemma: If d < ¼, errors in d-clean clusterings are · 8 dopt Proof Idea:For each disagreement will find an edge disjoint erroneous D -clean cluster C

Disagreements in Non-Singletons Lemma: If d < ¼, disagreements in d- clean clusterings are · 8 dopt Identical argument works -1 +1 +1 -clean clusters

Clustering Correlation Approaches with Approximation Algorithms

Clustering Correlation Approaches with Approximation Algorithms

Presentation Transcript

Correlation

Correlation

Correlation

Multiple testing, correlation and regression, and clustering in R

Correlation

Correlation

Correlation

Correlation

Clustering: Partition Clustering

Correlation

Correlation

Correlation

Correlation

Price correlation 을 이용한 경제 네트워크 구성과 clustering 가능성

Correlation

Correlation

Correlation

My relationship with correlation clustering started in 2016

Correlation Clustering

Correlation

Correlation