290 likes | 609 Views
Community Detection . Laks V.S. Lakshmanan (based on Girvan & Newman. Finding and evaluating community structure in networks . Physical Review E 69, 026113 (2004). M. E. J. Newman. Fast algorithm for detecting community structure in networks . Physical Review E 69, 066133 (2004). .
E N D
Community Detection Laks V.S. Lakshmanan (based on Girvan & Newman. Finding and evaluating community structure in networks. Physical Review E 69, 026113 (2004). M. E. J. Newman.Fast algorithm for detecting community structure in networks. Physical Review E 69, 066133 (2004).
The Problem • Can we partition the network into groups s.t. the inter-group edges are sparse while the intra-group edges are dense? • Why is it interesting/useful? • Understanding comm. structure – means to understanding n/w structure. • Graph partitioning – similar problem; graph of processes, edges=communication; assign sub-graphs to processors to minimize inter-processor comm. & balance processor load. (NP-hard in general.) • Diff. w/ graph partitioning.
A Hierarchical Clustering Approach • Define a notion of similarity or affinity between nodes. • E.g.: := #node-disjoint paths between and . • := #edge-disjoint paths between and . • := weighted sum of all paths, with longer paths weighted down, e.g., Katz! • Qn: how can we compute #2, 3 fast? • (Efficient algorithms for Katz have been developed.)
Community detection via hierarchical clustering • Compute all pairwise node similarities for every edge present. • Repeatedly add edges with greatest similarity. • leads to a tree (called dendrogram). • A slice throguh the dendogram represents a clustering or comm. structure.
Limitations of HC approach • “Misplaces” nodes in the periphery. • E.g.: Which community should 5 belong to? Alternative approach based on “edge betweenness”. 1 5 2 4 3
Key Intuition • An inter-comm. edge has a higher “betweenness” compared to an intra-comm. edge, i.e., more paths between node pairs pass through it. • Start with G. • Repeatedly remove edges with highest betweenness until <some stopping criterion>. • Communities = resulting components.
Basic Algorithm • repeat { • Calculate betweenness of all edges; • Remove one with highest betweenness, breaking ties arbitrarily; } • Until no edges left. • Remarks: • Which betweenness score? • Calculate upfront and reuse or recalculate? • Can we incrementally recalculate after each edge removal? • Related algorithms for node betweenness by Newman and Brandes.
A Real Example (Zachary’s Karate Club) With recalculation of betweenness. Without recalculation of betweenness.
Scalability Issues • Edge betweenness for all edges can be computed in time (=#edges, =#nodes). [Newman 2001] – details soon. • Recalculation makes algorithm , so not feasible for large networks.
Computing edge betweenness • An Example b d a g Compute #geodesics from every node to g. c e f Breadth-first search – means for doing many things.
Computing edge betweenness • An Example b d d=0 w=1 a g c e f Breadth-first search – means for doing many things.
Computing edge betweenness • An Example d=1 w=1 b d d=0 w=1 a g c e d=1 w=1 f Breadth-first search – means for doing many things.
Computing edge betweenness • An Example d=2 w=2 d=1 w=1 b d d=0 w=1 a g c e d=1 w=1 d=2 w=2 f d=2 w=2 Breadth-first search – means for doing many things.
Computing edge betweenness • An Example Have all info. we need for edge betweenness now. d=2 w=2 d=1 w=1 b d d=0 w=1 d=3 w=4 a g c e d=1 w=1 d=2 w=2 f d=2 w=2 Breadth-first search – means for doing many things.
Computing edge betweenness • An Example Note: a and f are like leaves: no geodesic to g from other nodes passes through them. d=2 w=2 d=1 w=1 b d d=0 w=1 2/4 d=3 w=4 1/2 a g 2/4 c e d=1 w=1 d=2 w=2 1/2 f d=2 w=2 Breadth-first search – means for doing many things.
Computing edge betweenness • An Example Note: a and f are like leaves: no geodesic to g from other nodes passes through them. d=2 w=2 d=1 w=1 b d ½(1+2/4) d=0 w=1 2/4 d=3 w=4 ½(1+2/4) 1/2 a g 2/4 c e d=1 w=1 ½(1+2/4) ½(1+2/4) d=2 w=2 1/2 f d=2 w=2 Breadth-first search – means for doing many things.
Computing edge betweenness • An Example 1/1[ 1+½(1+2/4)+1/2(1+2/4)+1/2] d=2 w=2 d=1 w=1 Note: a and f are like leaves: no geodesic to g from other nodes passes through them. b d ½(1+2/4) d=0 w=1 2/4 d=3 w=4 ½(1+2/4) 1/2 a g 2/4 c e d=1 w=1 ½(1+2/4) ½(1+2/4) d=2 w=2 1/2 f d=2 w=2 Breadth-first search – means for doing many things.
EB Computation summary • For any one target node, compute weights of nodes by BFS; = #geodesics from to target. • Suppose rest of (containing target). • Then intuitively, of the geodesics from to the target node go through .
EB Computation summary (contd.) • For any edge ( further from target than ), = • The above is wrt a specific target node. • Overall bet for any edge = sum of bet wrt every node treated as target node.
EB computation – complexity analysis • For any one target node, BFS gives bet of every edge w.r.t. that target node, in time. • Doing so for every node treated as target node time for final betweenness score for every edge. • Quite elegant, but recalculation bumps up complexity to • Need more scalable approaches for CD.
On scaling up CD algorithm • determine intelligently which edges need their bet recalculated, when an edge is removed. • When is removed, needs to be recalculated only if is in the same connected component as . • For a very large component, doesn’t prune much. • Perhaps it’s only important to determine the edge with the next highest bet. • can we maintain enough “state” so that when is removed, we can recalculate incrementally, i.e., not from scratch? Point to ponder!
Closing Remarks 1/2 • Newman also proposed other bases for defining edge betweenness. • Electrical current flow through the edge where every edge is viewed as unit resistance and we consider all source-sink pairs. • Based on random walks. • Both less effective and more expensive than geodesics (see paper for details). • What about directed and weighted cases?
Closing Remarks 2/2 • Goodness metric of community division. • Helpful when we don’t know the ground truth. • Q = ∑i(eii– ai2 ), where Ekxk= matrix of community division: eij = fraction of edges linking comm. i to comm. j; ai = ∑j eij . Q measures fraction of intra-comm. edges over what is expected by chance (assuming uniform distribution). See paper for details of experimental results. • Turns out study of influence/information propagation can suggest new ways of detecting communities: will revisit this issue after we study influence propagation.
Recommended Reading • J. Ruan and W. Zhang. An Efcient Spectral Algorithm for Network Community Discovery and Its Applications to Biological and Social Networks. ICDM 2007. • M. E. J. Newman "Modularity and community structure in networks", physics/0602124 = Proceedings of the National Academy of Sciences (USA) 103 (2006): 87577—8582. • Jure Leskovec, Kevin J. Lang, and Michael W. Mahoney. Empirical Comparison of Algorithms for Network Community Detection. WWW 2010. • M. E. J. Newman. Communities, modules and large-scale structure in networks. Nature Physics 8, 25–31 (2012) doi:10.1038/nphys2162 Received 23 September 2011 Accepted 04 November 2011 Published online 22 December 2011.