Large networks , clusters and Kronecker products

Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos Faloutsos (CMU), Michael Mahoney (Stanford), Kevin Lang(Yahoo), AnirbanDasgupta (Yahoo) Large networks, clusters and Kronecker products

Rich data: Networks • Large on-line computing applications have detailed records of human activity: • On-line communities: Facebook (120 million) • Communication: Instant Messenger (~1 billion) • News and Social media: Blogging (250 million) • We model the data as a network (an interaction graph) Can observe and study phenomena at scales not possible before Communication network

Small vs. Large networks • Community (cluster) structure of networks Tiny part of a large social network Collaborations in NetSci(N=380) What is the structure of the network? How can we model that?

[w/ Mahoney, Lang, Dasgupta, WWW ’08] How expressed are communities? S • How community like is a set of nodes? • Idea:Use approximation algorithms for NP-hard graph partitioning problems as experimental probes of network structure. S’ Conductance (normalized cut): • SmallΦ(S) == more community-like sets of nodes

[w/ Mahoney, Lang, Dasgupta, WWW ’08] Network Community Profile Plot • We define: Network community profile (NCP) plot Plot the score of best community of size k k=5 k=7 log Φ(k) Φ(5)=0.25 Φ(7)=0.18 Community size, log k

[w/ Mahoney, Lang, Dasgupta, WWW ’08] NCP plot: Network Science • Collaborations between scientists in Networks [Newman, 2005] Conductance, log Φ(k) Community size, log k

[w/ Mahoney, Lang, Dasgupta, WWW ’08] NCP plot: Large network • Typical example: General relativity collaboration network (4,158 nodes, 13,422 edges)

[w/ Mahoney, Lang, Dasgupta, WWW ’08] More NCP plots of networks

[w/ Mahoney, Lang, Dasgupta, WWW ’08] NCP: LiveJournal(n=5m, e=42m) Better and better communities Communities get worse and worse Φ(k), (conductance) Best community has ~100 nodes k, (community size)

[w/ Mahoney, Lang, Dasgupta, WWW ’08] Community size is bounded! • Each dot is a different network Practically constant!

Structure of large networks Denser and denser core of the network So, what’s a good model? Core contains ~60% nodes and ~80% edges Small good communities Core-periphery (jellyfish, octopus)

[w/ Chakrabarti-Kleinberg-Faloutsos, PKDD ’05] Kronecker product: Definition • Kronecker product of matrices A and B is given by • We define a Kronecker product of two graphs as a Kronecker product of their adjacency matrices N x M K x L N*K x M*L

[w/ Chakrabarti-Kleinberg-Faloutsos, PKDD ’05] Kronecker graphs • Kronecker graph: a growing sequence of graphs by iterating the Kronecker product • Each Kronecker multiplication exponentially increases the size of the graph • One can easily use multiple initiator matrices (G1’, G1’’, G1’’’) that can be of different sizes

[w/ Chakrabarti, Kleinberg, Faloutsos, PKDD ’05] Kronecker graphs Edge probability Edge probability • Kroneckergraphs mimic real networks: • Theorem: Power-law degree distribution, Densification, Shrinking/stabilizing diameter, Spectral properties pij (3x3) (9x9) (27x27) Initiator Starting intuition: Recursion & self-similarity

Various Kronecker initiator matrices

Kronecker graphs: Interpretation • Initiator matrix G1is a similarity matrix • Node u is described with kbinary attributes:u1, u2 ,…, uk • Probabilityof a link between nodes u, v: P(u,v) = ∏ G1[ui, vi] Given a real graph. How to estimate the initiator G1? v 1 0 u a a b b u = (0,1,1,0) 0 v = (1,1,0,1) 1 P(u,v) = b∙d∙c∙b c d c d

Estimating Kronecker graphs • Want to generate realistic networks: How to estimate initiator matrix: • Method of moments[Owen ‘09]: • Compare counts of subgraphs and solve • Maximum likelihood[Leskovec&Faloutsos, ’07]: • arg max P( | G1) • SVD[VanLoan&Pitsianis ‘93]: • Can solve using SVD Compare graphs properties, e.g., degree distribution Given a real network Generate a synthetic network

[w/ Dasgupta-Lang-Mahoney, WWW ’08] Kronecker & Network structure • What do estimated parameters tell us about the network structure? b edges a edges d edges c edges

[w/ Dasgupta-Lang-Mahoney, WWW ’08] Kronecker & Network structure • What do estimated parameters tell us about the network structure? 0.5 edges Core 0.9 edges Periphery0.1 edges Core-periphery (jellyfish, octopus) 0.5 edges

Small vs. Large networks • Small and large networks are very different: G1 = G1 = Scientific collaborations (N=397, E=914) Collaboration network (N=4,158, E=13,422)

Conclusion • Computational tools as probes into the structure of large networks • Community structure of large networks: • Core-periphery structure • Scale to natural community size: Dunbar number • Model:Kronecker graphs • Analytically tractable: provable properties • Can efficiently estimate parameters from data • Implications: • No large clusters: no/little hierarchical structure • Can’t be well embedded – no underlying geometry

Reflections • Why are networks the way they are? • Only recently have basic properties been observed on a large scale • Confirms social science intuitions; calls others into question • What are good tractable network models? • Builds intuition and understanding • Benefits of working with large data • Observe structures not visible at smaller scales

jure@cs.stanford.edu http://cs.stanford.edu/~jure

References • Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations, by J. Leskovec, J. Kleinberg, C. Faloutsos, KDD 2005 • Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication, by J. Leskovec, D. Chakrabarti, J. Kleinberg and C. Faloutsos, PKDD 2005 • Scalable Modeling of Real Graphs using Kronecker Multiplication, by J. Leskovec and C. Faloutsos, ICML 2007 • Statistical Properties of Community Structure in Large Social and Information Networks, by J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney, WWW 2008 • Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters, by J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney, Arxiv 2008

Large networks , clusters and Kronecker products