My relationship with correlation clustering started in 2016

My relationship with correlation clustering started in 2016 • From June-July 2016 I visited Melbourne as part of the East Asia and Pacific Summer Institute Fellowship. • Tony and I studied weighted correlation clustering with low-rank advice • Project was based on an observation Tony and my advisor David Gleich made in 2015 about rank-1 correlation clustering -6 +6 -4 +2 -2 Nate Veldt

Many algorithms focus on complete unweighted correlation clustering + • Given a signed graph G • Each edge indicates similarity (+) or dissimilarity (—) − − − + + − + − − Nate Veldt

In general, edges can be weighted Weights can be stored in an adjacency matrix -6 -4 -3 +2 +6 -2 Nate Veldt

The rank-1 positive semidefinitecase is very simple -6 -4 -3 +2 +6 -2 Nate Veldt

The rank-1 positive semidefinitecase is very simple -6 +3 -2 -4 -3 +2 +6 +2 -1 -2 Nate Veldt

The rank-1 positive semidefinitecase is very simple +3 -2 +2 -1 Nate Veldt

The rank-1 positive semidefinitecase is very simple +3 -2 +2 -1 Ordering v gives a perfect clustering Nate Veldt

The rank-1 positive semidefinitecase is very simple -1 -2 +3 +2 Ordering v gives a perfect clustering Nate Veldt

A simple solution for rank-1 positive semidefinite correlation clustering always exists. What happens for other low-rank matrices? Our contributions Polynomial-time solution for rank-d PSD matrices NP-hardness result for negative eigenvalues Heuristic algorithm for PSD matrices Nate Veldt

Rank-d PSD correlation clustering is equivalent to clustering vectors in Rd Nate Veldt

Main observations Nate Veldt

Main observations Cluster Ci Nate Veldt

Main observations Si “Sum point” Cluster Ci Nate Veldt

Main observations Si “Sum point” Cluster Ci For a fixed clustering, the objective can be written in terms of sum points Si: Nate Veldt

Main observations Si “Sum point” Cluster Ci For a fixed clustering, the objective can be written in terms of sum points Si: Also, we can show that the number of clusters can be bounded above by d+1. Nate Veldt

Main observations Si “Sum point” Cluster Ci For a fixed clustering, the objective can be written in terms of sum points Si: d+1 Also, we can show that the number of clusters can be bounded above by d+1. Nate Veldt

Why is the number of clusters bounded? If the clustering is optimal, the sum points will have pairwise negative dot products, i.e. If not, this would indicate that clusters i and j on the whole are “similar”, and merging them would improve the objective. Fact:The maximum number of vectors in Rd with pairwise negative dot products is d+1. [Rankin 1947] Nate Veldt

Our problem can be seen as a special case of the vector partition problem The vector partition problem can be solved in polynomial time by visiting the vertices of the d2-dimensional signing zonotope. [Onn & Schulman 2001] This leads to an algorithm for rank-d positive semidefinite CC. Nate Veldt

Our problem can be seen as a special case of the vector partition problem The vector partition problem can be solved in polynomial time by visiting the vertices of the d2-dimensional signing zonotope. [Onn & Schulman 2001] This leads to an algorithm for rank-d positive semidefinite CC. In practice we developed a faster heuristic for sampling vertices of the zonotope. Nate Veldt

Observation: Assuming low-rank edge weights leads to new complexity results, new algorithms, and connections to other problems. • General question: What other special weighted versions of correlation clustering lead to specialized algorithms and new connections? Nate Veldt

A new idea: simple but unequal weights for positive and negative edges Assign weights with respect to a resolution parameter λ∈ (0,1). • No particular connection to low-rank correlation clustering. However, it similarly leads to: • New complexity results and algorithms • Connections to other known partitioning problems Nate Veldt

This is motivated by applications to graph clustering Given G = (V,E), construct signed graph G’ = (V,E+,E– ), an instance of correlation clustering + + Without weights, unweighted correlation clustering is the same as a problem called cluster editing – + + – – + – Nate Veldt

This is motivated by applications to graph clustering Given G = (V,E), construct signed graph G’ = (V,E+,E– ), an instance of correlation clustering Parameter λ controls your interpretation of the existence or absence of an edge in a network. Nate Veldt

LambdaCC generalizes several graph clustering objectives Modularity Normalized Cut Degree- weighted Standard Cluster Deletion Sparsest Cut Correlation Clustering (Cluster Editing) m = |E| Nate Veldt

LambdaCC generalizes several graph clustering objectives Modularity Normalized Cut Degree- weighted Standard Cluster Deletion Sparsest Cut Correlation Clustering (Cluster Editing) m = |E| Let’s take a quick look at these two! Nate Veldt

Sparse and dense clustering objectives Sparsest cut Nate Veldt

Sparse and dense clustering objectives • Sparsest cut Cluster Deletion Minimize number of edges removed to partition graph into cliques Nate Veldt

Consider a restriction to two clusters S S Positive mistakes: (1 – λ) cut(S) Negative mistakes: λ |E–| – λ [ |S| |S| – cut(S) ] Total weight of mistakes = cut(S) – λ |S| |S| + λ |E–| Nate Veldt

Two-cluster LambdaCC can be written Nate Veldt

Two-cluster LambdaCC can be written constant Nate Veldt

Two-cluster LambdaCC can be written constant Note Nate Veldt

Two-cluster LambdaCC can be written constant Note This is a scaled version of sparsest cut! Nate Veldt

The relationship with sparsest cut holds in general The general LambdaCC objective can be written Theorem Minimizing this objective produces clusters with scaled sparsest cut at most λ (if they exist). There exists some λ’ such that minimizing LambdaCC will return the minimum sparsest cut partition. Nate Veldt

For large λ, LambdaCC generalizes cluster deletion cluster deletion correlation clustering with infinite penalties on negative edges We show this is equivalent to LambdaCC for the right choice of λ ≫ (1-λ) Nate Veldt

Algorithms for LambdaCC • Adapting the approach of van Zuylen and Williamson we obtain new algorithms based on LP relaxations: • ThreeLP: 3-approximation for LambdaCC when λ > ½ • TwoLP: 2-approximation for cluster deletion • We also provide scalable heuristic algorithms • Lambda-Louvain: based on Louvain method for modularity • GrowCluster: greedy agglomeration technique Best known approximation for cluster deletion! [A. van Zuylen and D. P. Williamson. Deterministic pivoting algorithms for constrained ranking and clustering problems. Mathematics of Operations Research, 34(3):594–620, 2009.] Nate Veldt

We cluster social networks with various λto understand the correlation between communities and metadata attributes Student/faculty status Graduation year Dorm Cornell University (Facebook100) Nate Veldt

We cluster social networks with various λ to understand the correlation between communities and metadata attributes Probability that two people who share a cluster also share a metadata attribute Student/faculty status Graduation year Dorm Cornell University (Facebook100) Nate Veldt

We cluster social networks with various λ to understand the correlation between communities and metadata attributes Probability that two people who share a cluster also share a metadata attribute Student/faculty status Probability that they share a related fake attribute Graduation year Dorm Cornell University (Facebook100) Nate Veldt

We cluster social networks with various λ to understand the correlation between communities and metadata attributes Probability that two people who share a cluster also share a metadata attribute Student/faculty status Probability that they share a related fake attribute The gap shows that there is a noticeable correlation between each attribute and the clustering Graduation year Dorm Cornell University (Facebook100) Nate Veldt

Swarthmore Yale Nate Veldt

S/F status and graduation year peak early Swarthmore Yale Nate Veldt

S/F status and graduation year peak early Swarthmore Dorm attribute is more correlated with small, dense communities Yale Nate Veldt

Conclusions and other work • We’ve considered several special cases of correlation clustering that come with novel approximation guarantees and are motivated by different applications in data science. • Other work • Solving the LP relaxation of CC (with James Saunderson) • Choosing λ for LambdaCC • Higher-order correlation clustering • Future work • Correlation clustering for record linkage • Practical algorithms for higher-order correlation clustering • New questions about other low-rank objectives Nate Veldt

Thanks! Papers.arXiv: 1611.07305 (at WWW2017), 1712.05825 (at WWW2018) 1809.09493 (ISAAC, to appear) 1809.01678 (submitted) Software. github: nveldt/LamCCnveldt/MetricOptimization With David Gleich (Purdue), Tony Wirth (Melbourne), and James Saunderson (Monash) Nate Veldt

My relationship with correlation clustering started in 2016

My relationship with correlation clustering started in 2016

Presentation Transcript

Parenting My Champion : Getting Started

Getting Started with

getting started with my it lab

My Collaborative Relationship with my Pastor—or Not?

getting started with my it lab

Multiple testing, correlation and regression, and clustering in R

My Relationship with Nature

Getting started with

Getting Started With

How my life started.

Correlation: How Strong Is the Linear Relationship?

Positive CPRT (Correlation, Pattern, Relationship, Trend)

My Research Work and Clustering

Modeling and clustering disease progression for correlation with genetic and demographic factors

My Relationship With My Mother

My Relationship With My Mom!

Agenda for Correlation Data/Relationship

Getting Started with

Rekindle Your Relationship, with Valentine Gifts 2016

My love hate relationship with mens jockstraps

Correlation Clustering

Correlation Clustering