Clustering

Clustering

Communities • Previously studied structure of the network • Short paths; navigability; heavy tailed degrees • Communities • cohesive set of nodes • Link structure: “many” edges inside, “few” outside • “more” interconnections than ‘expected’ • many different definitions… • Over the next few classes we will cover some community definitions in more details • Density • Modularity • Local Spectral

Why Communities • Purely operational reasons • Compressing a network • Identifying link-spam • Predicting biological functions in protein networks • Link recommendation • Studying the network at appropriate level of detail • zooming in and out • Information propagation in links that cross communities • “weak ties” [Granovetter ‘73] • Social capital theory • groups with higher social capital “flourish” more

Micro-markets • Micro markets important for experimentation with pricing

Micro-markets in ad-analytics What is the CTR and advertiser ROI of sports gambling keywords? Movies Media Sports Sport videos Gambling Sports Gambling 1.4 Million Advertisers 10 million keywords

Communities capture network dynamics • Zachary’s karate club study • Small scale, indepth study of how communities predict change in a network • Group dynamics in LiveJournal and DBLP • How do people decide to join groups?

Zachary’s karate club (1977) • studied 34 members for a period of 2 years • Edges denote friendships • During study, club broke into two pieces • Instructor and administrator • Split along min-cut of instr-admin

Group Evolution(Backstrom, Kleinberg’05) • Membership: what determines whether a person will join a group • Randomly? • Along friendship edges? • Growth: What are the characteristics of the group that makes it grow? • Change: How do groups change over time?

Probability of joining • Probability of joining increases with number of friends in group, but has diminishing returns

Second order Effect • Of the two candidate x and y, who is more likely to join? • Two possible theories: • “Weak ties = information” [Granovetter ‘73]: more information obtained when ties are “weak” • “Social capital = more active community” [Coleman ‘88]: safety and more engagement when community is active y x

Social capital Wins

Dense subgraphs • Define community in terms of dense subgraphs • Densest subgraph: • -density: • Densest at least k: • Densest at most k: S

Mining dense graphs • For the webgraph • For detecting link-spam, for use in webgraph compression • For search engines • Looking for reasonably large dense subgraphs • Should be scalable = small memory + time • Algorithm • [Gibson, Kumar, Tomkins] give a fast heuristic • no theoretical guarantee • mines dense bipartite subgraphs

Tool: Minhashing • Sets A and B • Want to keep a sketch such that Jaccard coefficient can be estimated quickly A B • Consider a random permutation • For each set A, maintain

Shingling and Minhashing • For each node, consider “shingles” of s-subsets of neighbors • Compute c-minhash signatures for each node • For each signature, find s-shingle of nodes that contain it • Compute minhash, iterate. • Output all connected components as clusters • each original node can be present in multiple nodes now  clusters are overlapping s=3 v1 [s1, s2, s3] v2 [s1, s4, s5] v3 [s4, s7, s6]

Results • Run on graph with 50M nodes and 11B edges • Can play around with (s, c) to get different density • Most clusters were attributable to link-spam • Justified using handpicked editorial labeling • Dense subgraphs exhibited faster growth

Algorithms with Guarantees? • Finding out the set S with • Consider decision version • Can be solved exactly using max-flow (or LP) u du v dv

Algorithms • Can be solved using parametric max-flow because of special structure of the graph • Time O(V2 E) • Not practical for large networks • Can we get a faster approximation?

Faster Approximation Theorem [Charikar, Khuller+Saha]: The above algorithm gives a 2-approximation to the densest subgraph problem. Can be implemented in linear time (how?)

Reducing passes of Greedy Algorithm(Bahmani, Kumar, Vassilivetskii’12) • Will need only O(log(n)/) passes • Subgraph reduced by a constant factor at each stage • 2 + 2 approximation, similar proof • Similar constructions for Dalk, directed versions

Empirical Results • Much better performance in practice than predicted • Number of passes is at most 10-12 in practice

Variants: approximations or hardness • Exact-k • getting PTAS is UGC hard (Khot’05) • Feige et al. O(n0.33-a) approx. • Bhaskara et al. O(n0.25+a) approx. • DalkS (at least k): • NP hard • 2-approx (Andersen-Chellapilla, Khuller-Saha) • DamkS (at most k): • As hard as Exact-k • Directed graphs • 2-approx (Charikar, Khuller-Saha)

Applications of densest subgraph • Important in many theoretical settings, mostly as a candidate hard problem • Establishing hardness of financial derivatives; cryptography • variants hard even for random graphs • Practical implications • In making reachability and distance queries efficient • biology • Mining coherent dense subgraphs across massive biological networks for functional discovery [HYHHZ ’05] • dense protein interaction subgraph corresponds to a protein complex [BD’03] [SM’03]

Clustering

Clustering

Presentation Transcript

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering: Partition Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering

Clustering