1 / 23

Clustering

Clustering. Communities. Previously studied structure of the network Short paths; navigability; heavy tailed degrees Communities cohesive set of nodes Link structure: “many” edges inside, “few” outside “more” interconnections than ‘expected’ many different definitions…

olympe
Download Presentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering

  2. Communities • Previously studied structure of the network • Short paths; navigability; heavy tailed degrees • Communities • cohesive set of nodes • Link structure: “many” edges inside, “few” outside • “more” interconnections than ‘expected’ • many different definitions… • Over the next few classes we will cover some community definitions in more details • Density • Modularity • Local Spectral

  3. Why Communities • Purely operational reasons • Compressing a network • Identifying link-spam • Predicting biological functions in protein networks • Link recommendation • Studying the network at appropriate level of detail • zooming in and out • Information propagation in links that cross communities • “weak ties” [Granovetter ‘73] • Social capital theory • groups with higher social capital “flourish” more

  4. Micro-markets • Micro markets important for experimentation with pricing

  5. Micro-markets in ad-analytics What is the CTR and advertiser ROI of sports gambling keywords? Movies Media Sports Sport videos Gambling Sports Gambling 1.4 Million Advertisers 10 million keywords

  6. Communities capture network dynamics • Zachary’s karate club study • Small scale, indepth study of how communities predict change in a network • Group dynamics in LiveJournal and DBLP • How do people decide to join groups?

  7. Zachary’s karate club (1977) • studied 34 members for a period of 2 years • Edges denote friendships • During study, club broke into two pieces • Instructor and administrator • Split along min-cut of instr-admin

  8. Group Evolution(Backstrom, Kleinberg’05) • Membership: what determines whether a person will join a group • Randomly? • Along friendship edges? • Growth: What are the characteristics of the group that makes it grow? • Change: How do groups change over time?

  9. Probability of joining • Probability of joining increases with number of friends in group, but has diminishing returns

  10. Second order Effect • Of the two candidate x and y, who is more likely to join? • Two possible theories: • “Weak ties = information” [Granovetter ‘73]: more information obtained when ties are “weak” • “Social capital = more active community” [Coleman ‘88]: safety and more engagement when community is active y x

  11. Social capital Wins

  12. Dense subgraphs • Define community in terms of dense subgraphs • Densest subgraph: • -density: • Densest at least k: • Densest at most k: S

  13. Mining dense graphs • For the webgraph • For detecting link-spam, for use in webgraph compression • For search engines • Looking for reasonably large dense subgraphs • Should be scalable = small memory + time • Algorithm • [Gibson, Kumar, Tomkins] give a fast heuristic • no theoretical guarantee • mines dense bipartite subgraphs

  14. Tool: Minhashing • Sets A and B • Want to keep a sketch such that Jaccard coefficient can be estimated quickly A B • Consider a random permutation • For each set A, maintain

  15. Shingling and Minhashing • For each node, consider “shingles” of s-subsets of neighbors • Compute c-minhash signatures for each node • For each signature, find s-shingle of nodes that contain it • Compute minhash, iterate. • Output all connected components as clusters • each original node can be present in multiple nodes now  clusters are overlapping s=3 v1 [s1, s2, s3] v2 [s1, s4, s5] v3 [s4, s7, s6]

  16. Results • Run on graph with 50M nodes and 11B edges • Can play around with (s, c) to get different density • Most clusters were attributable to link-spam • Justified using handpicked editorial labeling • Dense subgraphs exhibited faster growth

  17. Algorithms with Guarantees? • Finding out the set S with • Consider decision version • Can be solved exactly using max-flow (or LP) u du v dv

  18. Algorithms • Can be solved using parametric max-flow because of special structure of the graph • Time O(V2 E) • Not practical for large networks • Can we get a faster approximation?

  19. Faster Approximation Theorem [Charikar, Khuller+Saha]: The above algorithm gives a 2-approximation to the densest subgraph problem. Can be implemented in linear time (how?)

  20. Reducing passes of Greedy Algorithm(Bahmani, Kumar, Vassilivetskii’12) • Will need only O(log(n)/) passes • Subgraph reduced by a constant factor at each stage • 2 + 2 approximation, similar proof • Similar constructions for Dalk, directed versions

  21. Empirical Results • Much better performance in practice than predicted • Number of passes is at most 10-12 in practice

  22. Variants: approximations or hardness • Exact-k • getting PTAS is UGC hard (Khot’05) • Feige et al. O(n0.33-a) approx. • Bhaskara et al. O(n0.25+a) approx. • DalkS (at least k): • NP hard • 2-approx (Andersen-Chellapilla, Khuller-Saha) • DamkS (at most k): • As hard as Exact-k • Directed graphs • 2-approx (Charikar, Khuller-Saha)

  23. Applications of densest subgraph • Important in many theoretical settings, mostly as a candidate hard problem • Establishing hardness of financial derivatives; cryptography • variants hard even for random graphs • Practical implications • In making reachability and distance queries efficient • biology • Mining coherent dense subgraphs across massive biological networks for functional discovery [HYHHZ ’05] • dense protein interaction subgraph corresponds to a protein complex [BD’03] [SM’03]

More Related