260 likes | 391 Views
Algorithms for Large Data Sets. Ziv Bar-Yossef. Lecture 7 May 14, 2006. http://www.ee.technion.ac.il/courses/049011. Web Structure I : Power Laws and Small World Phenomenon. Outline. Power laws The preferential attachment model Small-world networks The Watts-Strogatz model.
E N D
Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006 http://www.ee.technion.ac.il/courses/049011
Outline • Power laws • The preferential attachment model • Small-world networks • The Watts-Strogatz model
Observed Phenomena • Few multi-billionaires, but many with modest income [Pareto, 1896] • Few frequent words, but many infrequent words [Zipf, 1932] • Few “mega-cities” but many small towns [Zipf, 1949] • Few web pages with high degree, but many with low degree [Kumar et al, 99] [Barabási & Albert, 99] All the above obey power laws.
Power Law (Pareto) Distribution • > 0: shape parameter (“slope”) • k > 0: location parameter • Ex: (k = $1000, = 2) • 1/100 earn ≥ $10,000 • 1/10,000 earn ≥ $100,000 • 1/1,000,000 earn ≥ $1,000,000
Power Law Properties • PDF: • Infinite mean for ≤ 1 • Infinite variance for ≤ 2 • When X is discrete,
Power Law Graphs Linear Scale Plot Log-Log Plot Slope = - - 1
Scale-Free Distributions • Power laws are invariant to scale • Ex: (k = arbitrary, = 2) • 1/100 earn ≥ 10k • 1/10,000 earn ≥ 100k • 1/1,000,000 earn ≥ 1000k
Heavy Tailed Distributions • In many “classical” distributions • Ex: normal, exponential • In power law distributions “light tail” “heavy tail”
Zipf’s Law • Size of r-th largest city is • Equivalent to a power law: • X = size of a city • Change variables:
Power Laws and the Internet • Web Graph • In- and out-degrees (in slope: ~2.1, out slope: ~2.7) [Kumar et al. 99, Barabási & Albert 99, Broder et al 00] • Sizes of connected components [Broder et al 00] • Website sizes [Huberman & Adamic 99] • Internet graph • Degrees [Faloutsos3 99] • Eigenvalues [Mihail & Papadimitriou 02] • Traffic • Number of visits to websites
Power Laws and Graphs • If X is a random web page, then • What random graph model explains this phenomenon?
Erdős-Rényi Random Graphs • Gn,p • n: size of the graph (fixed) • p: edge existence probability (fixed): • Every pair u,v is connected by an edge with probability p. • Theorem [Erdős & Rényi, 60] For any node x in Gn,p,
Preferential Attachment [Barabási & Albert 99] • A novel random graph model • Initialization: graph starts with a single node with two self loops. • Growth: At every step a new node v is added to the graph. v has a self loop and connects to one neighbor. • Preferential attachment: v connects to u with probability • The rich get richer / The winner takes it all
Why Does it Work? • : # of nodes whose indegree = k after t steps • k > 1: • Expected growth: • k = 1:
Why Does it Work? (2) • Fact: After sufficiently many steps, reaches a “steady state”. • ck = value of at the steady state. • Since at steady state, • Hence, • Therefore:
Why Does it Work? (3) • Then: • And: • Therefore:
Six Degrees of Separation[Stanley Milgram, 67] • “Random starters” at Nebraska, Kansas, etc. • Destinations: in Boston • Intermediaries send postcards to Milgram • Findings: average of 6 postcards • “Conclusion”: every two people in the US are connected by a path of length ~ 6
Small-World Networks • Average diameter: length of shortest path from u to v, averaged over all pairs u,v • Clustering coefficient: fraction of neighbors of v that are neighbors of each other, averaged over all v • Small-world network: a sparse graph with average diameter O(log n) and a constant clustering coefficient
The Web as a Small World Network Low diameter • Study of a synthetic web graph model [Albert, Jeong, Barabási 99] • Average diameter of the Web is ~19 • Grows logarithmically with size of the Web. • Study of a large crawl [Broder et al 00] • Average diameter of the SCC is ~ 16 • Maximum diameter of the SCC is ≥ 28 • Diameter of host graph [Adamic 99] • Average diameter of SCC: ~4 • High clustering coefficient • Clustering coefficient of host graph [Adamic 99] • Clustering coefficient: ~0.08 (compared to 0.001 in a comparable random graph)
Model for Small-World Networks[Watts & Strogatz 98] • One extreme: random networks • Low diameter • Low clustering coefficient • Other extreme: “regular” networks (e.g., a lattice) • High clustering coefficient • High diameter • Small-world: interpolation between the two • Low diameter • High clustering coefficient • Regularity: social networking • Randomness: individual interests
Random Network The model: • n vertices • Every pair u,v is connected by an edge with probability p = d/n Properties: • Expected number of edges: ~dn • Graph is connected w.h.p • Diameter: O(log n) w.h.p. • Clustering coefficient: ~ p = d/n = o(1)
Ring Lattice The model: • n vertices on a circle • Every vertex has d neighbors: the d/2 vertices to its right and the d/2 vertices to its left Properties: • Number of edges: dn/2 • Graph is connected • Diameter: O(n/d) • Clustering coefficient:
Random Rewiring • Start from a ring lattice • for i = 1 to d/2 do • for v = 1 to n do • Pick i-th clockwise nearest neighbor of v • With probability p, replace this neighbor by a random vertex
Analysis • If p = 0, ring lattice • High clustering coefficient • High diameter • If p = 1, random network • Logarithmic diameter • Low clustering coefficient • However, • Diameter goes down rapidly as p grows • Clustering coefficient goes down slowly as p grows • Therefore, for small p, we get a small-world network. • Logarithmic diameter • High clustering coefficient