430 likes | 630 Views
What is the right clustering of this graph?. Clique Percolation. A community is a collection of adjacent -cliques. Questions: What is a good k? How to find cliques?. Clique Finding. Find the largest clique in a graph? NP-complete Find maximal clique containing a node? Polynomial.
E N D
Clique Percolation A community is a collection of adjacent -cliques. Questions: What is a good k? How to find cliques?
Clique Finding Find the largest clique in a graph? NP-complete Find maximal clique containing a node? Polynomial
Percolation Algorithm • Find all maximal cliques • Create clique-clique overlap matrix • Ignore entries less than k
Running Time • Maximal clique finding is output-polynomial • Extensively studied • “we note that a complete analysis of a co-authorship network with 127000 links takes less than 2 hours on a PC.”
A more Theory approach Important features of a community Internally dense Externally sparse Clique-percolation ignores externally sparse Modularity defines it as the edge cut
clustering A cluster C is an cluster if • Internally Dense: Every vertex in the cluster neighbors at least a β fraction of the cluster • Externally Sparse: Every vertex outside the cluster neighbors at most an α fraction of the cluster (1/5,4/5) (1/5,4/5)
First approach - -Champions Jack Black Ben Stiller Gwenyth Paltrow Will Ferrell Vince Vaughn Wes Anderson Owen Wilson Ellijah Wood Dan Akroyd Steve Martin Bill Murray Scarlett Johanssen Anjelica Houston
Algorithm with -Champions v • Let c be a ρ-champion • If v in C, then v and c share at least neighbors • If v is outside C then v and c share at most neighbors α|C| β|C| v c ρ|C| β|C| (2β-1)|C| Runs in O(d0.7n1.9+n2+o(1)) time where d is the average degree
Discussion • Pros • Very parallel • Experiments show good results • Are a good feature in recommendation algs • Cons • Beta > ½ doesn’t seem realistic • The champion is fairly restrictive • Not based on observed data
Finding Overlapping Communities Assumptions • Community edges are chosen according to the expected affinity (degree) model. • Maximality assumption with gap • Community membership accounts for a significant portion of each node’s edges,
Another Algorithm Style • Grow a community from a set of seed nodes. • Clique finding: Pick s starting nodes at random For each starting node v, sample For each clique in S, grow to maximal clique. Output if it satisfies your conditions.
Ego Networks You are the ego. Your friends form the ego network.
Sociology on Ego Networks Functions Served by Ego Networks • Social support • Sense-making • Social control • Access to resources • Behavioral models
Dunbar Circles Dunbar number - “the theoretical cognitive limit to the number of people with whom one can maintain stable social relationships.” between 100 and 250
Community Detection with Egonets microsoft Idea 1 – When you remove the ego, the egonet becomes disconnected components. Idea 2 – It becomes weakly connected components. eecs TCS college uva radio family Grade school
DEMON • Apply a community detection algorithm (Label Propagation) to the Egonet • Repeat this for every user in the network. Community Definition: The set of communities is the set of maximal sets that ‘contain’ the egonet communities.
Demon • Merge the results. Output:Set of overlapping communities Running time:
Cornell Study Metis ‘Real’ Community Random Walk Louvain Infomap Newman-Modularity 21 Slides due to Bruno Abrahao
Community Detection • Community structure is not well defined • different people have different notions of community structure • Traditional strategy • (1) start with an expectation of what a community should look like • e.g., a set of nodes that interact more within the set than with the outside • (2) define an optimization problem • (3) design heuristic • (4) the solution gives the desired communities 22
Key questions • A multitude of algorithms • different objective functions • different heuristics • How dissimilar are their outputs? • Communities may differ from the proposed mathematical constructs • e.g., preponderance of links to the outside • Which algorithms extract communities that most closely resemble the structure of real communities? 23
Obstacles to answering the questions • We don't know what properties communities possess • We can't characterize communities in the absence of negative examples • Look at real communities and determine their structure • do other sets that are not communities have these properties? • every other connected set could be a negative example - intractable • sets that are not annotated could also be communities • We don't know what metrics we should use • modularity, conductance, clustering coefficient... 24
Apply Extract community examples Building structural classes Algorithm Network 25
Building structural classes Algorithm 1 Class 1 Algorithm 2 Class 2 Algorithm 3 Class 3 Class 4 Algorithm 4 Class k Algorithm k 26
Feature Vector Building a feature space Labeled Example 27
Feature Space Building a feature space 28
Class Separability Measure Inter-class separability Are the classes separable? Separability = Distinct structures Feature Space 29
+ Rice University Large-scale network datasets • Social • Commercial • Biological Facebook+Rice with permission of Mislove et al.. Other datasets publicly available. 30
Community detection algorithms • BFS (Random connected subgraphs) • Random-Walk-based (with and without restart) • (α,β)-communities • InfoMap • Markov Clustering • Metis • Louvain • Newman-Clauset-Moore • Link Communities 31
+ Rice University Annotated communities Metadata included in the datasets identifies exemplar communities that form in these domains 32
Annotated communities Algorithm 2 Algorithm 1 To what extent are the classes separable? Train Probabilistic k-way classifier (SVM, k-NN) 33
Probabilistic multi-class learners Classify (cross-validation) Probabilistic k-way classifier (SVM, k-NN) Pr(Algorithm 1) = 0.05 Pr(Algorithm 2) = 0.08 ... Pr(Annotated) = 0.48 34
Matching annotated communities • Which algorithms extract communities that most closely resemble the structure of annotated communities?
Algorithm k Algorithm 2 Algorithm 1 Probabilistic multi-class learners Learn Probabilistic k-way classifier 37
Probabilistic multi-class learners Classify Probabilistic k-way classifier Pr(Algorithm 1) = 0.02 Pr(Algorithm 2) = 0.19 ... Pr(Algorithm k) = 0.12 38
Step 1: identifying the most important features 7 features out of 36 retain the discriminative power of the full set 40
Tendencies of algorithms with respect to most discriminative features 41
Summary • Traditional methods are unsupervised • they find a particular type of community • little sensitivity to different purposes, structures of interest and domains of application • Our approach suggests a supervised approach to community detection • user specifies what they intended to find through examples (real or synthetic) • algorithm learns from those examples and retrieves similar structures in the network 42
Experimental Assignment • Goal: Do some data mining research, comparing real networks and the models in class • Due: Email a report by Friday, October 12.