230 likes | 421 Views
Overcoming Resolution Limits in MDL Community Detection. L. Karl Branting The MITRE Corporation. Outline. Utility functions in community detection Resolution limits MDL-based community detection Previous: RB and AP New: SGE Experimental Evaluation Lessons.
E N D
Overcoming Resolution Limits in MDL Community Detection L. Karl Branting The MITRE Corporation
Outline • Utility functions in community detection • Resolution limits • MDL-based community detection • Previous: RB and AP • New: SGE • Experimental Evaluation • Lessons
Utility functions in community detection • Two components of community detection algorithms • Utility function – quality criterion to be optimized • Search strategy – procedure for finding optimal partition • Examples • Garvin & Newman (2003) • Utility function: modularity • Search strategy: greedy divisive hierarchical clustering (iteratively remove highest betweenness edge) • Newman (2003) • Utility function: modularity • Search strategy: greedy agglomerative hierarchical clustering (iteratively choose highest modularity merge) • Tasgin & Bingol (2006) • Utility function: modularity • Search strategy: genetic algorithm
Utility functions in community detection • Other search strategies used with modularity • Rattigan, Maier, Jensen (2007) • Utility function: modularity • Search strategy: Greedy divisive hierarchical clustering using a Network Structured Index to approximation edge betweenness • Donetti & Munoz (2004) • Utility function: modularity • Search strategy: greedy agglomerative hierarchical clustering with spectral division
Utility functions in community detection • Statistical Approaches • Zhang, Qiu, Giles, Foley, & Yen (2007) • Utility function: log-likelihood (LDA parameters) • Search strategy: fixed-point iteration • Compression-Based Approaches • Rosvall & Bergstrom (2007) • Utility function: Minimum Description Length • Search strategy: simulated annealing • Chakrabarti (2004) • Utility function: Minimum Description Length • Search strategy: exhaustive search for k, hill-climbing given k • Utility function implicit in search strategy • Raghavan, Albert, & Kumara (2007) – marker passing • Cliques, cores, etc.
Modularity • W(Dii) = number of edges internal to group i • li = number of edges incident to vertices in group I • l = total number of edges • Intuitive – expresses intuition that ratio of internal to external edges is greater for groups than for non-groups • Popular • Imperfect • Fortunato & Barthelemy (2007) Resolution limit: groups conflated if number of vertices less than • Rosvall & Bergstrom (2007) Biased towards same-sized groups
Resolution Limit • Ring graph R15,4 • 15 communities • 4 nodes per community • Community structure that maximizes modularity conflates groups
Approaches to modularity’s resolution limit • Apply recursively to large communities (Ruan & Zhang 2007) • Apply locally (Clauset 2005) • Choose a different utility function
Description Length • Utility of community structure is sum of bits needed to represent • Community structure + • Graph given community structure • Search strategy attempts to minimize description length • There is no unique bit count • Undecidability of Kolmogorov complexity • Previous approaches • Rosvall & Bergstrom (2007): RB • Handles group size skew better than modularity • Chakrabarti (2004): AP • Comparison • Similar breakdown of bits • Different calculation
Components of Description • Components (details in paper) • Bits to represent number of nodes in graph • ignored because not specific to community structure • Bits to represent number of groups • Bits to represent mapping between nodes and groups • Bits needed for number of group-to-group edges • Bits needed for adjacencies between nodes • Purpose • 2, 3, 4: represent group structure • 1, 5: represent graph as a whole
Surprising Experimental Result • RB, AP, and modularity compared as utility functions • Applied to ring graphs Rm,c for 4 ≤ m ≤ 16 and 3 ≤ c ≤ 9 • Search strategy: greedy divisive hierarchical clustering (iteratively remove highest betweenness edge) • Unsurprising result. Modularity led to conflated groups for: • m > 8 and c = 3 • m > 10 and c = 4 • m > 11 and c = 5 • m > 13 and c = 6,7 • Surprising result. • Both RB and AP conflated at least one pair of groups in every Rm,c!
Hypothesis • Both RB and AP require at least one bit per pair of groups in term 4 • Perhaps this estimation causes group conflation • Term 4 grows as the square of the number of groups • If graph is sparse, conflating groups may save more in term 4 reduction than it costs in term 5 increase Components • Bits to represent number of nodes in graph • ignored because not specific to community structure • Bits to represent number of groups • Bits to represent mapping between nodes and groups • Bits needed for number of group-to-group edges • Bits needed for adjacencies between nodes
SGE (Sparse Graph Encoding) • Components • Bits to represent number of nodes in graph • Ignored, as in RB and AP • Bits to represent number of groups • Follows RB • Bits to represent mapping between nodes and groups • Similar to AP • Bits needed for number of group to group edges • Split into 2 terms • Which pairs of groups are connected (much less than one bit per pair if pairs sparsely or densely connected) • Number of edges between connected groups • Grows as number of connected pairs, not total number of pairs • Bits needed for adjacencies between nodes • Follows RB
Performance of SGE on Ring Graphs • Correct community structure found for every Rm,c for 4 ≤ m ≤ 16 and 3 ≤ c ≤ 9 except • R4,3 • R13,3 • Results confirm hypothesis that resolution limit in RB and AP is result of over-counting term 4: the bits needed for group-to-group edges • Significance • Ring graphs rare in real world • How does SGE compare on more realistic graphs?
Uniform random graph • Similar to graphs in Rosvall & Bergstrom (2007) • Test set • 32 vertices • 4 groups • average degree 6 • size ratio {1.0,1.25,1.5,1.75,2.0} • Proportion internal edges {0.6,0.75,0.9} • Example: • 32 vertices • 4 groups • average degree 6 • size ratio 1.25 • Proportion internal edges 0.67
Embedded Barabasi-Albert Graphs • Test set • 4 communities separately generated by preferential attachment • In each community • 4 initial vertices • 2-4 edges added per time step • 20 time steps • Example • 4 communities • 4 initial vertices • 3 edges added per time step • 20 time steps
Evaluation Criteria • Rand index (Rand 1971) • Adjusted Rand index (Hubert & Arabie 1985) • F-measure – based on same-cluster pairs • Recall = • Precision = • F-measure =
Summary of Evaluation • Random graphs • Community structure is weak • Group sizes are balanced – modularity is best • Group sizes are imbalanced – RS is best (as per Rosvall & Bergstrom 2007) • Community structure is strong • Group sizes are balanced – not much difference • Group sizes are imbalanced – modularity is particularly bad (as per Rosvall & Bergstrom 2007), SGE slightly better than RS and AP • EBA graphs • Sparse – AP and SGE weaker than modularity and RS • Dense – essentially identical accuracy
Conclusion • Narrow • Conflation of groups by MDL in sparse graphs (e.g., ring graphs) can be avoided by adjusting group-to-group edge counts. • This change doesn’t hurt performance in more common types of graphs. • Compression-based clustering works well, but requires tinkering • Modularity detects weak structure well when graph not too big and groups not too imbalanced • Broad • Still unclear what utility function is best overall • Needed: theory relating graph typology to utility functions