1 / 42

What is the right clustering of this graph?

What is the right clustering of this graph?. Clique Percolation. A community is a collection of adjacent -cliques. Questions: What is a good k? How to find cliques?. Clique Finding. Find the largest clique in a graph? NP-complete Find maximal clique containing a node? Polynomial.

dylan
Download Presentation

What is the right clustering of this graph?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What is the right clustering of this graph?

  2. Clique Percolation A community is a collection of adjacent -cliques. Questions: What is a good k? How to find cliques?

  3. Clique Finding Find the largest clique in a graph? NP-complete Find maximal clique containing a node? Polynomial

  4. Percolation Algorithm • Find all maximal cliques • Create clique-clique overlap matrix • Ignore entries less than k

  5. Running Time • Maximal clique finding is output-polynomial • Extensively studied • “we note that a complete analysis of a co-authorship network with 127000 links takes less than 2 hours on a PC.”

  6. A more Theory approach Important features of a community Internally dense Externally sparse Clique-percolation ignores externally sparse Modularity defines it as the edge cut

  7. clustering A cluster C is an cluster if • Internally Dense: Every vertex in the cluster neighbors at least a β fraction of the cluster • Externally Sparse: Every vertex outside the cluster neighbors at most an α fraction of the cluster (1/5,4/5) (1/5,4/5)

  8. First approach - -Champions Jack Black Ben Stiller Gwenyth Paltrow Will Ferrell Vince Vaughn Wes Anderson Owen Wilson Ellijah Wood Dan Akroyd Steve Martin Bill Murray Scarlett Johanssen Anjelica Houston

  9. Algorithm with -Champions v • Let c be a ρ-champion • If v in C, then v and c share at least neighbors • If v is outside C then v and c share at most neighbors α|C| β|C| v c ρ|C| β|C| (2β-1)|C| Runs in O(d0.7n1.9+n2+o(1)) time where d is the average degree

  10. Discussion • Pros • Very parallel • Experiments show good results • Are a good feature in recommendation algs • Cons • Beta > ½ doesn’t seem realistic • The champion is fairly restrictive • Not based on observed data

  11. Finding Overlapping Communities Assumptions • Community edges are chosen according to the expected affinity (degree) model. • Maximality assumption with gap • Community membership accounts for a significant portion of each node’s edges,

  12. Another Algorithm Style • Grow a community from a set of seed nodes. • Clique finding: Pick s starting nodes at random For each starting node v, sample For each clique in S, grow to maximal clique. Output if it satisfies your conditions.

  13. Ego Networks You are the ego. Your friends form the ego network.

  14. Sociology on Ego Networks Functions Served by Ego Networks • Social support • Sense-making • Social control • Access to resources • Behavioral models

  15. Dunbar Circles Dunbar number - “the theoretical cognitive limit to the number of people with whom one can maintain stable social relationships.” between 100 and 250

  16. Community Detection with Egonets microsoft Idea 1 – When you remove the ego, the egonet becomes disconnected components. Idea 2 – It becomes weakly connected components. eecs TCS college uva radio family Grade school

  17. Egonet based Systems - DEMON

  18. DEMON • Apply a community detection algorithm (Label Propagation) to the Egonet • Repeat this for every user in the network. Community Definition: The set of communities is the set of maximal sets that ‘contain’ the egonet communities.

  19. Demon • Merge the results. Output:Set of overlapping communities Running time:

  20. Cornell Study Metis ‘Real’ Community Random Walk Louvain Infomap Newman-Modularity 21 Slides due to Bruno Abrahao

  21. Community Detection • Community structure is not well defined • different people have different notions of community structure • Traditional strategy • (1) start with an expectation of what a community should look like • e.g., a set of nodes that interact more within the set than with the outside • (2) define an optimization problem • (3) design heuristic • (4) the solution gives the desired communities 22

  22. Key questions • A multitude of algorithms • different objective functions • different heuristics • How dissimilar are their outputs? • Communities may differ from the proposed mathematical constructs • e.g., preponderance of links to the outside • Which algorithms extract communities that most closely resemble the structure of real communities? 23

  23. Obstacles to answering the questions • We don't know what properties communities possess • We can't characterize communities in the absence of negative examples • Look at real communities and determine their structure • do other sets that are not communities have these properties? • every other connected set could be a negative example - intractable • sets that are not annotated could also be communities • We don't know what metrics we should use • modularity, conductance, clustering coefficient... 24

  24. Apply Extract community examples Building structural classes Algorithm Network 25

  25. Building structural classes Algorithm 1 Class 1 Algorithm 2 Class 2 Algorithm 3 Class 3 Class 4 Algorithm 4 Class k Algorithm k 26

  26. Feature Vector Building a feature space Labeled Example 27

  27. Feature Space Building a feature space 28

  28. Class Separability Measure Inter-class separability Are the classes separable? Separability = Distinct structures Feature Space 29

  29. + Rice University Large-scale network datasets • Social • Commercial • Biological Facebook+Rice with permission of Mislove et al.. Other datasets publicly available. 30

  30. Community detection algorithms • BFS (Random connected subgraphs) • Random-Walk-based (with and without restart) • (α,β)-communities • InfoMap • Markov Clustering • Metis • Louvain • Newman-Clauset-Moore • Link Communities 31

  31. + Rice University Annotated communities Metadata included in the datasets identifies exemplar communities that form in these domains 32

  32. Annotated communities Algorithm 2 Algorithm 1 To what extent are the classes separable? Train Probabilistic k-way classifier (SVM, k-NN) 33

  33. Probabilistic multi-class learners Classify (cross-validation) Probabilistic k-way classifier (SVM, k-NN) Pr(Algorithm 1) = 0.05 Pr(Algorithm 2) = 0.08 ... Pr(Annotated) = 0.48 34

  34. Cross-validation performance 35

  35. Matching annotated communities • Which algorithms extract communities that most closely resemble the structure of annotated communities?

  36. Algorithm k Algorithm 2 Algorithm 1 Probabilistic multi-class learners Learn Probabilistic k-way classifier 37

  37. Probabilistic multi-class learners Classify Probabilistic k-way classifier Pr(Algorithm 1) = 0.02 Pr(Algorithm 2) = 0.19 ... Pr(Algorithm k) = 0.12 38

  38. Classification of annotated into extracted 39

  39. Step 1: identifying the most important features 7 features out of 36 retain the discriminative power of the full set 40

  40. Tendencies of algorithms with respect to most discriminative features 41

  41. Summary • Traditional methods are unsupervised • they find a particular type of community • little sensitivity to different purposes, structures of interest and domains of application • Our approach suggests a supervised approach to community detection • user specifies what they intended to find through examples (real or synthetic) • algorithm learns from those examples and retrieves similar structures in the network 42

  42. Experimental Assignment • Goal: Do some data mining research, comparing real networks and the models in class • Due: Email a report by Friday, October 12.

More Related