230 likes | 334 Views
Clustering. Communities. Previously studied structure of the network Short paths; navigability; heavy tailed degrees Communities cohesive set of nodes Link structure: “many” edges inside, “few” outside “more” interconnections than ‘expected’ many different definitions…
E N D
Communities • Previously studied structure of the network • Short paths; navigability; heavy tailed degrees • Communities • cohesive set of nodes • Link structure: “many” edges inside, “few” outside • “more” interconnections than ‘expected’ • many different definitions… • Over the next few classes we will cover some community definitions in more details • Density • Modularity • Local Spectral
Why Communities • Purely operational reasons • Compressing a network • Identifying link-spam • Predicting biological functions in protein networks • Link recommendation • Studying the network at appropriate level of detail • zooming in and out • Information propagation in links that cross communities • “weak ties” [Granovetter ‘73] • Social capital theory • groups with higher social capital “flourish” more
Micro-markets • Micro markets important for experimentation with pricing
Micro-markets in ad-analytics What is the CTR and advertiser ROI of sports gambling keywords? Movies Media Sports Sport videos Gambling Sports Gambling 1.4 Million Advertisers 10 million keywords
Communities capture network dynamics • Zachary’s karate club study • Small scale, indepth study of how communities predict change in a network • Group dynamics in LiveJournal and DBLP • How do people decide to join groups?
Zachary’s karate club (1977) • studied 34 members for a period of 2 years • Edges denote friendships • During study, club broke into two pieces • Instructor and administrator • Split along min-cut of instr-admin
Group Evolution(Backstrom, Kleinberg’05) • Membership: what determines whether a person will join a group • Randomly? • Along friendship edges? • Growth: What are the characteristics of the group that makes it grow? • Change: How do groups change over time?
Probability of joining • Probability of joining increases with number of friends in group, but has diminishing returns
Second order Effect • Of the two candidate x and y, who is more likely to join? • Two possible theories: • “Weak ties = information” [Granovetter ‘73]: more information obtained when ties are “weak” • “Social capital = more active community” [Coleman ‘88]: safety and more engagement when community is active y x
Dense subgraphs • Define community in terms of dense subgraphs • Densest subgraph: • -density: • Densest at least k: • Densest at most k: S
Mining dense graphs • For the webgraph • For detecting link-spam, for use in webgraph compression • For search engines • Looking for reasonably large dense subgraphs • Should be scalable = small memory + time • Algorithm • [Gibson, Kumar, Tomkins] give a fast heuristic • no theoretical guarantee • mines dense bipartite subgraphs
Tool: Minhashing • Sets A and B • Want to keep a sketch such that Jaccard coefficient can be estimated quickly A B • Consider a random permutation • For each set A, maintain
Shingling and Minhashing • For each node, consider “shingles” of s-subsets of neighbors • Compute c-minhash signatures for each node • For each signature, find s-shingle of nodes that contain it • Compute minhash, iterate. • Output all connected components as clusters • each original node can be present in multiple nodes now clusters are overlapping s=3 v1 [s1, s2, s3] v2 [s1, s4, s5] v3 [s4, s7, s6]
Results • Run on graph with 50M nodes and 11B edges • Can play around with (s, c) to get different density • Most clusters were attributable to link-spam • Justified using handpicked editorial labeling • Dense subgraphs exhibited faster growth
Algorithms with Guarantees? • Finding out the set S with • Consider decision version • Can be solved exactly using max-flow (or LP) u du v dv
Algorithms • Can be solved using parametric max-flow because of special structure of the graph • Time O(V2 E) • Not practical for large networks • Can we get a faster approximation?
Faster Approximation Theorem [Charikar, Khuller+Saha]: The above algorithm gives a 2-approximation to the densest subgraph problem. Can be implemented in linear time (how?)
Reducing passes of Greedy Algorithm(Bahmani, Kumar, Vassilivetskii’12) • Will need only O(log(n)/) passes • Subgraph reduced by a constant factor at each stage • 2 + 2 approximation, similar proof • Similar constructions for Dalk, directed versions
Empirical Results • Much better performance in practice than predicted • Number of passes is at most 10-12 in practice
Variants: approximations or hardness • Exact-k • getting PTAS is UGC hard (Khot’05) • Feige et al. O(n0.33-a) approx. • Bhaskara et al. O(n0.25+a) approx. • DalkS (at least k): • NP hard • 2-approx (Andersen-Chellapilla, Khuller-Saha) • DamkS (at most k): • As hard as Exact-k • Directed graphs • 2-approx (Charikar, Khuller-Saha)
Applications of densest subgraph • Important in many theoretical settings, mostly as a candidate hard problem • Establishing hardness of financial derivatives; cryptography • variants hard even for random graphs • Practical implications • In making reachability and distance queries efficient • biology • Mining coherent dense subgraphs across massive biological networks for functional discovery [HYHHZ ’05] • dense protein interaction subgraph corresponds to a protein complex [BD’03] [SM’03]