270 likes | 291 Views
A scalable multilevel algorithm for community structure detection. Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory. Models and Algorithms for the Web Graph (WAW 2006) November 29 – December 2, 2006. Community Structure Detection Problem.
E N D
A scalable multilevel algorithm for community structure detection Melih Onus Hristo Djidjev Arizona State University Los Alamos National Laboratory Models and Algorithms for the Web Graph (WAW 2006) November 29 – December 2, 2006
Community Structure Detection Problem • The problem of identifying communities in a network is usually modeled as a graph clustering problem • Vertices correspond to individual items • Edges describe relationships • The communities correspond to subgraphs • Dense connections between vertices from the same subgraph • Fewer connections between vertices in different subgraphs
Motivation: Why to detect communities? • Analyze and understand the information contained in the huge amount of data available on the WWW • Finding related commercial items • Recommendation systems • Important for • Social networks • Ad-hoc networks • Protein interaction networks • Genetic networks
Motivation: Why to detect communities? Predict how much someone going to love a movie based on their movie preferences Grand Prize $1.000.000
Outline of the talk • Previous work • Graph partitioning problem • Our approach • Modularity • Reduction • Multilevel graph partitioning • Experimental results • Conclusions
Previous Work • Two main classes • Agglomerative Methods (addition of edges) • Divisive Methods (removal of edges) • Algorithms based on • Laplacian Matrix • Centrality measures • Flow models • Random walks • Resistor networks • Optimization • Not fast enough or inaccurate
Graph Partitioning Problem • Given a graph G(V, E), find a partition such that • The partition is balanced (i.e., the number of vertices of all subsets are roughly equal) • Cut size is minimized (i.e., the number of the edges with endpoints in different subsets is minimized) • Previous Work: • Kernighan-Lin algorithm • Spectral partitioning • Multilevel algorithms
u v Kernighan - Lin Algorithm • Find an initial random partition • Improve by a greedy procedure that swaps pairs of vertices from different partitions • Minimize the size of the cut set u v
Graph Partitioning vs Graph Clustering • Find Clusters • Community sizes may differ • Number of subsets varies • Minimize cut size • Equal number of vertices in each subset • Number of subsets is an input • Algorithms for graph partitioning can not be directly used to produce good quality clustering
Our approach • Convert original graph G into a complete graph G’ • Find min-cut of G’ using modified graph partitioning method • This will produce a good quality (high modularity) clustering for G
Modularity • A useful measure of clustering quality • Introduced by Newman [6] • Modularity of a partitioning = (number of edges within communities) – (expected number of such edges) • We are trying to find a division of graph with high modularity
Reduction Min-Cut Problem: The problem of finding a minimum cut in a complete edge-weighted graph G' Graph Clustering Problem: The problem of finding a clustering of maximum modularity in G
Reduction Graph Clustering Problem: Maximize modularity Maximize modularity of a partitioning = (number of edges within communities) – (expected number of such edges) Minimize (- modularity) = (cut size) – (expected cut size) Min-Cut Problem: Minimize cut size
Random Graph Models pij : the probability that there is an edge between vertices i and j in a random graph from a given distribution Erdos - Renyi Model: Chung - Lu Model:
Multilevel graph partitioning • Fast and an accurate method for producing high-quality partitions • Consists of the three phases: • Coarsening phase • Partitioning phase • Uncoarsening and refinement phase
Coarsening Phase • Find a maximal matching and collapse edges to a vertex • Recursive coarsening: < G = G1, G2, …, Gk >
Partitioning Phase • Greedy graph growing partitioning • Partition Gk
Uncoarsening and Refinement Phase • Project the partitioning Pi of Gi to Pi-1 of Gi-1 • More degrees of freedom at Gi than Gi-1 • Improve Pi using KL algorithm
Implementation • Our implementation is based on the graph partitioning package METIS [3] that employs a multilevel strategy • Convert the graph partitioning algorithm into a clustering one • The optimal clustering might not be balanced. We ignore the restrictions that control the sizes of the parts. • The number of the parts in the optimal clustering is not known. We employ a recursive bisection procedure. • The original graph G might be sparse, while the transformed one G' is complete. Our algorithm does not explicitly generate G’.
Modularity: Erdos - Renyi Model (- Modularity) = cut size – n1n2p (- Modularity)’ = cut size’ – (n1+1)(n2-1)p n1 n2 Erdos - Renyi Model:
Modularity: Chung - Lu Model (- Modularity) = cut size – w1w2/2m (- Modularity)’ = cut size’ – (w1 + w(v))(w2 - w(v))/2m w1 w2 wi: Sum of degrees in partition i
Analysis • Time Complexity: O(n+m) • Experiments • Random Graphs • k-community graphs • nd.edu
Experiment I: Random Graphs • We generated random graphs with 128 vertices and 4 communities of size 32 each • The expected degree of any vertex is 16 • Out degree varies
Experiment II: k-community graphs • We generated graphs with k communities • Size of each community is 100 • Expected number of edges in the community is equal to expected number of edges going outside from community. • Probability of an edge in communities varies between 0.5 and 0.1. • Results show that graphs are clustered especially %99 correctly.
Experiment III: nd.edu • Data consists of the complete map of the nd.edu domain, which contains 325,729 document and 1090108 links • Our algorithm clusters this graph into 280 clusters with modularity 0.925579 • This high modularity indicates strong community structure in the graph • We show the dendrogram generated by our algorithm. • The size of rectangles are proportional to size of communities.
Conclusions • Community structure detection problem • A scalable algorithm • Based on multilevel graph partitioning • Uses modularity as a quality measure