Correlation Clustering

Correlation Clustering Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum

Document Clustering • Given a bunch of documents, classify them into salient topics • Typical characteristics: • No well-defined “similarity metric” • Number of clusters is unknown • No predefined topics – desirable to figure them out as part of the algorithm Shuchi Chawla, Carnegie Mellon University

Research Communities • Given data on research papers, divide researchers into communities by co-authorship • Typical characteristics: • How to divide really depends on the given set of researchers • Fuzzy boundaries Shuchi Chawla, Carnegie Mellon University

Traditional Approaches to Clustering • Approximation algorithms • k-means, k-median, k-min sum • Matrix methods • Spectral Clustering • AI techniques • EM, classification algorithms Shuchi Chawla, Carnegie Mellon University

Problems with traditional approaches • Dependence on underlying metric • Objective functions are meaningless without a metric eg. k-means • Algorithm works only on specific metrics (such as Euclidean) eg. spectral methods Shuchi Chawla, Carnegie Mellon University

Problems with traditional approaches • Fixed number of clusters • Meaningless without prespecified number of clusters eg. for k-means or k-median, if k is unspecified, it is best to put everything in their own cluster Shuchi Chawla, Carnegie Mellon University

Problems with traditional approaches • No clean notion of “quality” of clustering • Objective functions do not directly translate to how many items have been grouped wrongly • Heuristic approaches • Objective functions derived from generative models Shuchi Chawla, Carnegie Mellon University

Cohen, McCallum & Richman’s idea Our Task • “Learn” a similarity measure on documents • may not be a metric! • f(x,y) = amount of similarity between x and y Use labeled data to train up this function • Classify all pairs with the learned function • Find the “most consistent” clustering Shuchi Chawla, Carnegie Mellon University

An example +: Same -: Different • Consistent clustering: + edges inside clusters - edges between clusters Harry B. Harry Bovik H. Bovik Tom X. Shuchi Chawla, Carnegie Mellon University

An example Disagreement Harry B. Harry Bovik +: Same -: Different H. Bovik Tom X. Shuchi Chawla, Carnegie Mellon University

An example Disagreement • Task: Find most consistent clustering or, fewest possible disagreements equivalently, maximum possible agreements Harry B. Harry Bovik +: Same -: Different H. Bovik Tom X. Shuchi Chawla, Carnegie Mellon University

Correlation clustering • Given a complete graph – Each edge labeled ‘+’ or ‘-’ • Our measure of clustering – How many labels does it agree with? • Number of clusters depends on the edge labels • NP-complete; We consider approximations Shuchi Chawla, Carnegie Mellon University

Compared to traditional approaches… • Do not have to specify k • No condition on weights – can be arbitrary • Clean notion of quality of clustering – number of examples where the clustering differs from f • If a good (perfect) clustering exists, it is easy to find Shuchi Chawla, Carnegie Mellon University

Some machine learning justification • Noise Removal • There is some true classification function f • But there are a few errors in the data • We want to find the true function • Agnostic Learning • There is no inherent clustering • Try to find the best representation using a hypothesis with limited expressivity Shuchi Chawla, Carnegie Mellon University

Our results • Constant factor approximation for minimizing disagreements • PTAS for maximizing agreements • Results for the random noise case Shuchi Chawla, Carnegie Mellon University

Minimizing Disagreements • Goal: constant approximation • Problem: Even if we find a cluster as good as one in OPT, we are headed towards a log n approximation (a set-cover like bound) • Idea: lower bound DOPT Shuchi Chawla, Carnegie Mellon University

Lower Bounding Idea: Bad Triangles Consider + - “Bad Triangle” + We know any clustering has to disagree with at least one of these edges. Shuchi Chawla, Carnegie Mellon University

Lower Bounding Idea: Bad Triangles - If several edge-disjoint bad triangles, then any clustering makes a mistake on each one + + 1 2 Edge disjoint Bad Triangles (1,2,3), (1,4,5) 5 2 4 3 Dopt #{Edge disjoint bad triangles} Shuchi Chawla, Carnegie Mellon University

Using the lower bound • d-clean cluster: cluster C where each node has fewer than d|C| “bad” edges • d-clean clusters have few bad triangles => few mistakes • Possible solution: find a d-clean clustering • Caveat: It may not exist Shuchi Chawla, Carnegie Mellon University

Using the lower bound • We show:  a clustering with clusters that are d-clean or singleton • Further, it has few mistakes • Nice structure helps us find it easily. • Caveat: A d-clean clustering may not exist Shuchi Chawla, Carnegie Mellon University

Extensions & Open Problems • Weighted edges or incomplete graph • Recent work by Bartal et al • log-approximation based on multiway cut • Better constant for unweighted case • Can we use bad triangles (or polygons) more directly for a tighter bound? • Experimental performance Shuchi Chawla, Carnegie Mellon University

Other problems I have worked on • Game Theory and Mechanism Design • Approx for Orienteering & related problems • Online search algorithms based on Machine Learning approaches • Theoretical properties of Power Law graphs • Currently working on Privacy with Cynthia Shuchi Chawla, Carnegie Mellon University

Thanks! Shuchi Chawla, Carnegie Mellon University

Correlation Clustering