450 likes | 566 Views
Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney, Kevin Lang, Aniraban Dasgupta. Size matters: 1) Cluster structure of large networks 2) Searching the world’s social network.
E N D
Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney, Kevin Lang, AnirabanDasgupta Size matters:1) Cluster structure of large networks2) Searching the world’s social network
Rich data: Networks • Large on-line computing applications have detailed records of human activity: • On-line communities: Facebook (120 million) • Communication: Instant Messenger (~1 billion) • News and Social media: Blogging (250 million) • We model the data as a network (an interaction graph) Can observe and study phenomena at scales not possible before Communication network
Outline • The Small-world experiment: • On a 240 million node communication network of Microsoft Instant Messenger • Small vs. large networks: • Modeling community (cluster) structure of large networks Zachary’s karate club (N=34) Tiny part of a large social network
How expressed are communities? S • How community like is a set of nodes? • Idea:Use approximation algorithms for NP-hard graph partitioning problems as experimental probes of network structure. • Conductance (normalized cut) S’ • Φ(S) = # edges cut / # edges inside • SmallΦ(S) corresponds to more community-like sets of nodes
Community score (quality) What is “best” community of 5 nodes? • Score: Φ(S) = # edges cut / # edges inside
Community score (quality) What is “best” community of 5 nodes? Bad community Φ=5/6 = 0.83 • Score: Φ(S) = # edges cut / # edges inside
Community score (quality) What is “best” community of 5 nodes? Bad community Φ=5/7 = 0.7 Better community Φ=2/5 = 0.4 • Score: Φ(S) = # edges cut / # edges inside
Community score (quality) What is “best” community of 5 nodes? Bad community Φ=5/7 = 0.7 Best community Φ=2/8 = 0.25 Better community Φ=2/5 = 0.4 • Score: Φ(S) = # edges cut / # edges inside
Network Community Profile Plot • We define: Network community profile (NCP) plot Plot the score of best community of size k k=5 k=7 log Φ(k) Φ(5)=0.25 Φ(7)=0.18 Community size, log k
NCP plot: Low-dimensional and random graphs Hierarchically nested clusters d-dimensional meshes
NCP plot: Zachary’s karate club • Zachary’s university karate club social network • During the study club split into 2 • The split (squares vs. circles) corresponds to cut B
NCP plot: Network Science • Collaborations between scientists in Networks [Newman, 2005]
Present work: Large networks • Previous work mostly focused on community structure of smallnetworks (~100 nodes) • We examined 108 different largenetworks
Example of a large network • Typical example: General relativity collaboration network (4,158 nodes, 13,422 edges)
NCP: LiveJournal (N=5M, E=42M) Better and better communities Communities get worse and worse Φ(k), (conductance) Best community has ~100 nodes k, (community size)
Explanation: Downward part NCP plot Small clusters on the edge of the network are responsible for downward part of NCP plot Best cluster
Explanation: Upward part • Each additional edge inside the cluster costs more: Φ=1/3 = 0.33 Φ=2/4 = 0.5 NCP plot Φ=8/6 = 1.3 Φ=64/14 = 4.5 Each node has twice as many children
Suggested network structure Denser and denser core of the network Core contains ~60% nodes and ~80% edges Whiskers are responsible for good communities Network structure: Core-periphery (jellyfish, octopus)
What is a good model? • What is a good model that explains such network structure? Flat and Down Flat Down and Flat Geometric Pref. Attachment Small World Pref. attachment
Forest Fire model works • Notes: • Preferential attachment flavor - second neighbor is not uniform at random. • Copying flavor - since burn seed’s neighbors. • Hierarchical flavor - seed is parent. • “Local” flavor - burn “near” -- in a diffusion sense -- the seed vertex. • Forest Fire [LKF05]: connections spread like a fire • New node joins the network • Selects a seed node • Connects to some of its neighbors • Continue recursively As community grows it blends into the core of the network
Forest Fire NCP plot rewired network
Typical cluster size • How does the size of best cluster scale with the size of the network?
Size of best cluster over time • Cluster size remains constant (even if one allows nesting) over time Linked in network over time
Cluster size vs. network size • Each dot is a different network
Connections • The Dunbar number • 150 individuals is maximum community size • What edges “mean” and community identification • Using node and edge types/attributes • Implications for machine learning • No large clusters • No/little (assortative) hierarchical structure • Can’t be well embedded – no underlying geometry
Joint work with Eric Horvitz, Microsoft Research The small-world of the MSN Instant Messenger
The Small-world experiment • The Small-world experiment[Milgram ’67, Dodds-Muhamad-Watts ‘03] • People send letters from Nebraska to Boston • How many steps does it take? • 6.2 on the average, thus “6 degrees of separation” Milgram’s small world experiment
The Small-world experiment • 1) Short paths exist in a social network • 2) People are able to find them (using only partial knowledge of the network) Local search: forwarding a message Good nodes: d=h-1 t s Target d(s,t)=h Bad nodes: d≥h
Our dataset: Instant Messaging • Contact (buddy) list • Messaging window
MSN communication • We collected the data for June 2006 4.5Tb of compressed data: • 245 million users logged in • 180 million users engaged in conversations • 255 billion exchanged messages • 1 billion conversations / day
MSN network The network:180M nodes, 1.3B undirected edges
MSN: path lengths MSN Messenger network Number of steps between pairs of people Avg. path length 6.6 90% of the people can be reached in < 8 hops
Degree distribution: A node that exchanged messages with ~2 million people
Robustness of shortest paths Short paths exist and they are robust Both way links All links Randomized network (same degree distr.)
Learning to search in a network • What is the decision function that makes me forward the message to the target? Good nodes: d=h-1 t s Target What are the characteristics of shortest paths? How hard is it to find them? d(s,t)=h Bad nodes: d≥h
How hard is to find a good node? Probability of success if we forward to a random neighbor s t
Algorithm accuracy at hops Use a decision tree to learn a classifier: Model: 0.4128 Random : 0.0207 s t
The learned model Green bar is prob. that node is good
Comparing search heuristics • Pick a pair of nodes: start at s • Walk until hit the target twhere next node is chosen: Search alg. % found Mean path length Random 0.0008 3,709 MinGeoDist 0.0282 778 MaxDeg 0.0158 4,964 Deg/Geo2 0.1446 2,676 Cntry 0.0108 402 Cntry*Deg 0.1313 3,114 Lang 0.0055 1,699 Lang*Deg 0.0496 3,163 Age 0.0012 2,890 Age*Deg 0.0203 5,324 It works! (in a network with 180 million nodes) -- Milgram’s path completion is 29% -- Dodds,Muhhamad, Watts: 0.015% comp s t
Conclusions and reflections • Why are networks the way they are? • Only recently have basic properties been observed on a large scale • Confirms social science intuitions; calls others into question • Benefits of working with large data • Observe structures not visible at smaller scales