530 likes | 686 Views
Clustering Social Networks. Isabelle Stanton, University of Virginia Master of Science Thesis Defense. Outline. Motivation Previous Work Finding Tightly Knit Clusters Finding Loosely Knit Clusters Group Recommendations Future Work. Motivation. Many large social networks:
E N D
Clustering Social Networks Isabelle Stanton, University of Virginia Master of Science Thesis Defense
Outline • Motivation • Previous Work • Finding Tightly Knit Clusters • Finding Loosely Knit Clusters • Group Recommendations • Future Work
Motivation • Many large social networks: • A fundamental problem is finding communities automatically • Viral and Targeted Marketing • Recommendation Engines
Previous Work – Spectral Methods • Cuts the graph based on an eigenvector • Spectral Methods: • Kannan, Vempala, Vetta 2000, Spielman and Teng 1996, Shi and Malik 2000, Kempe and McSherry 2004, Karypis and Kumar 1998 and many others • cut = partitioning of all elements
Communities in Social Networks • Disjoint partitionings are not good for social networks
Objective: Internal Density, Each vertex in C is adjacent to at least fraction of (the rest of) C Examples: =1/2 =3/4 =1
Objective: External Sparsity, Each vertex outside of C is adjacent to at most of C < =1/5, =1 =1
(α, β)-Clusters • C is an (α, β)- cluster if: • Internally Dense: Every vertex in the cluster neighbors at least a β fraction of the cluster • Externally Sparse: Every vertex outside the cluster neighbors at most an α fraction of the cluster (1/4, 2/3) (1/4, 1)
Contributions of this work • Definition of criterion • Combinatorial results • 3 overlap results • Bound on number of (α,1) clusters • Three algorithms for varying cases • Experiments validating assumptions on real social networks • Novel formulation of group recommendation problem with experiments
Previous Work – (α, β)-clusters • Solved Areas: Our Contributions: 1 (1- ε,1) – Tsukiyama et al, Johnson et al. α = 0 – connected components α β > ½ + α/2 – Algorithm 1 and 2 0 α < β3 – Algorithm 3 1 0 β
Outline • Motivation • Previous Work • Finding Tightly Knit Clusters • Finding Loosely Knit Clusters • Group Recommendations • Future Work
Too Many Clusters.. n vertices MISSING edges drawn x1 y1 x2 y2 ... xn/2 yn/2 Problem:Every vertex in every cluster has as many neighbors outside the cluster as in it
ρ-Champions Ben Stiller Gwenyth Paltrow Will Ferrell Vince Vaughn Wes Anderson Owen Wilson ρ-champion Steve Martin Bill Murray Anjelica Houston
ρ-Champions • Def: A vertex is a ρ-champion of C if it has at most ρ|C| neighbors outside C • Claim: If ρ < 2β – 1 – α, every vertex can ρ-champion at most one cluster
Outline • Motivation • Previous Work • Combinatorial properties • Finding Tightly Knit Clusters • Deterministic Algorithm • Finding Loosely Knit Clusters • Group Recommendations • Future Work
Let c be a ρ-champion If v in C, then v and c share at least (2β -1)|C| neighbors If v is outside C then v and c share at most (ρ + α)|C| neighbors Intuition behind the Algorithm v α|C| β|C| v c ρ|C| β|C| (2β-1)|C| c
Deterministic Algorithm • To find all clusters of size s: • for each c in V do • C← • For each v within two steps of c do • If v and c share (2β – 1)s neighbors then add v to C • If C is an (α, β)-cluster then output C
Algorithmic Guarantees • Claim: Our algorithm will find all clusters of size s where β > ½ + (ρ + α)/2 • Runs in O(d0.7n1.9+n2+o(1)) time where d is the average degree • d is a small constant for social networks so O(n2)
Evaluation • Do ρ-champions exist in real graphs? • Tsukiyama’s algorithm finds all maximal cliques ((1-ε, 1)-clusters) in a graph • We compare our algorithm’s output with Tsukiyama’s ground truth
Theory Co-Author Dataset Results • Found 797 of 854 clusters ~ 93%
Outline • Motivation • Previous Work • Combinatorial properties • Finding Tightly Knit Clusters • Finding Loosely Knit Clusters • Technical Challenges • Randomized Algorithm • Group Recommendations • Future Work
Loosely Knit Clusters • β≤ ½ • Technical Problem: (0, 1/2)
Connectivity Assumption • Every subset of a cluster has an outside vertex in the cluster that neighbors more than a β-fraction Does satisfy assumption! (β = 2/7) Doesn’t satisfy assumption
Loosely Knit Randomized Algorithm • α < β3 • Two phases • Phase 1: • Draw a sample of the ρ-champion’s neighbors • Sample neighbors to add to the seed • Stop when the seed is “big enough” • Phase 2: • Exploit connectivity assumption to deterministically grow the seed into the cluster
Example • Phase 1: • Sample of the ρ-champion’s neighbors • Sample neighbors to add to the sample • Stop when the sample is “big enough” • Phase 2 • Deterministically grow cluster
Example • Phase 1: • Sample of the ρ-champion’s neighbors • Sample neighbors to add to the sample • Stop when the sample is “big enough” • Phase 2 • Deterministically grow cluster
Example • Phase 1: • Sample of the ρ-champion’s neighbors • Sample neighbors to add to the sample • Stop when the sample is “big enough” • Phase 2 • Deterministically grow cluster
Example • Phase 1: • Sample of the ρ-champion’s neighbors • Sample neighbors to add to the sample • Stop when the sample is “big enough” • Phase 2 • Deterministically grow cluster
Why does this work? • Random sampling guarantees the expected number of neighbors an outside vertex has with the seed is small • The connectivity assumption guarantees we’ll always make progress • Guarantees: Finds all clusters where α < β3 with probability 1 – δ • Runs in time O(n3/δ log(n/δ) |C|2)
Outline • Motivation • Previous Work • Combinatorial properties • Finding Tightly Knit Clusters • Finding Loosely Knit Clusters • Group Recommendations • Future Work
Group Recommendations • Clustering isn’t the end goal • What can we do with (α,β)-clusters? • We built a group recommendation engine powered by our clusters • Recommended groups to users of Orkut and LiveJournal
Recommendation Model • Hofmann and Puzicha ‘99 5 5/60 25/60 20 2/3 25 60 people 3/4 1/3 15 10 1/2 10 20 people
Previous Work • Kleinberg and Sandler: Given the Groups x User matrix, use matrix decomposition • Their code works ~ 100K variables max • No one uses the friendship graph or clusters!
Experimental Setup • Hold out 10% of users with group memberships • Cluster the rest • Create recommendations for held out users based on clusters
Results – LiveJournal Dataset Held out: 355,495 users – Succeeded on: 210,455
Conclusions • Defined (α, β)-clusters • Focus: Overlapping clusters • Introduced ρ-champions • Developed algorithms for a subset of the problem • Ran experiments to validate assumptions and show utility of the clusters • Introduced new interpretation of the recommendation model
Future Work • Algorithms that reduce the necessary α-β gap • Relaxing ρ-champion restriction • Weighted and directed graphs • Decentralized algorithms • Streaming algorithms • Expanding work on group recommendations
Citations • Clustering Social Networks, N. Mishra, R. Schreiber, I. Stanton and R. E. Tarjan, The 5th Workshop on Algorithms and Models for the Web-Graph, WAW2007. LNCS, vol 4863, pp. 56-67. • Clustering Social Networks,N.Mishra, R. Schreiber, I. Stanton and R. E. Tarjan, Journal of Internet Mathematics (under submission)
NewKid Algorithm • Input: Graph, Groups, (α,β)-clusters • For each group g and cluster c: • P(g|c) = |members of c in g| / | members in c| • For each new kid, u: • P(c|u) = |friends of u in c| / |friends of u| • Recommend g that maximizes Σc p(g|c)P(c|u)
HEP Co-Author Dataset Results • Found 115 of 126 clusters ~ 90%
LiveJournal Dataset Results • Too big to run Tsukiyama. Found 4289 clusters, 876 have large ρ-champions
Datasets • High Energy Physics Co-Authorship Graph • Theory Co-authorship graph • A subset of LiveJournal.com τ(v) = the neighbors and neighbors’ neighbors of v
Randomized Algorithm • To find all (α, β)-clusters of size s: • for each c in V do: • Repeat k times: • Draw a random sample S of size t from c’s neighbors • C← S U {c} • For each v within two steps of c do • If v has (2β – 1)/ β t neighbors in S then add v to C • If C is an (α, β)-cluster then output C
Randomized Algorithm • t = O( log(n / δ) ), k = O( n / δ ) • Guarantees: Finds all clusters where α < 2β – 1 with probability 1 – δ • Runs in time O(n3/δ log(n/δ) (log(n/δ)+|C|)) • Worst case: O(n4/δ log(n/δ)) • Average case: O(n2/δ log2(n/δ) d2)
Combinatorial Properties - Overlaps • Let A and B be (α, β)-clusters with |A|=|B| • Theorem: A and B overlap by at most (1-(β-α))|A| vertices 1 0 0 1
Combinatorial Properties - |Clusters| • Claim: There are at most (α,1)-clusters of size s in a graph • Bound is tight as α→ 1 and α = 0. Seems loose elsewhere • Proof is from Steiner Systems • 7 points, block size = 3, restriction = 2 • {1,2,4},{2,3,5},{3,4,6},{4,5,7},{1,5,6},{2,6,7},{1,3,7}
Outline • Motivation • Previous Work • Combinatorial properties • Finding Tightly Knit Clusters • Finding Loosely Knit Clusters • Experiments and Group Recommendations • Are ρ-champions valid? • What are these clusters good for? • Future Work