Algorithms for Large Data Sets

Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006 http://www.ee.technion.ac.il/courses/049011

Web Structure I:Power Laws and Small World Phenomenon

Outline • Power laws • The preferential attachment model • Small-world networks • The Watts-Strogatz model

Observed Phenomena • Few multi-billionaires, but many with modest income [Pareto, 1896] • Few frequent words, but many infrequent words [Zipf, 1932] • Few “mega-cities” but many small towns [Zipf, 1949] • Few web pages with high degree, but many with low degree [Kumar et al, 99] [Barabási & Albert, 99] All the above obey power laws.

Power Law (Pareto) Distribution •  > 0: shape parameter (“slope”) • k > 0: location parameter • Ex: (k = $1000,  = 2) • 1/100 earn ≥ $10,000 • 1/10,000 earn ≥ $100,000 • 1/1,000,000 earn ≥ $1,000,000

Power Law Properties • PDF: • Infinite mean for  ≤ 1 • Infinite variance for  ≤ 2 • When X is discrete,

Power Law Graphs Linear Scale Plot Log-Log Plot Slope = - - 1

Scale-Free Distributions • Power laws are invariant to scale • Ex: (k = arbitrary,  = 2) • 1/100 earn ≥ 10k • 1/10,000 earn ≥ 100k • 1/1,000,000 earn ≥ 1000k

Heavy Tailed Distributions • In many “classical” distributions • Ex: normal, exponential • In power law distributions “light tail” “heavy tail”

Zipf’s Law • Size of r-th largest city is • Equivalent to a power law: • X = size of a city • Change variables:

Power Laws and the Internet • Web Graph • In- and out-degrees (in slope: ~2.1, out slope: ~2.7) [Kumar et al. 99, Barabási & Albert 99, Broder et al 00] • Sizes of connected components [Broder et al 00] • Website sizes [Huberman & Adamic 99] • Internet graph • Degrees [Faloutsos3 99] • Eigenvalues [Mihail & Papadimitriou 02] • Traffic • Number of visits to websites

Power Laws and Graphs • If X is a random web page, then • What random graph model explains this phenomenon?

Erdős-Rényi Random Graphs • Gn,p • n: size of the graph (fixed) • p: edge existence probability (fixed): • Every pair u,v is connected by an edge with probability p. • Theorem [Erdős & Rényi, 60] For any node x in Gn,p,

Preferential Attachment [Barabási & Albert 99] • A novel random graph model • Initialization: graph starts with a single node with two self loops. • Growth: At every step a new node v is added to the graph. v has a self loop and connects to one neighbor. • Preferential attachment: v connects to u with probability • The rich get richer / The winner takes it all

Why Does it Work? • : # of nodes whose indegree = k after t steps • k > 1: • Expected growth: • k = 1:

Why Does it Work? (2) • Fact: After sufficiently many steps, reaches a “steady state”. • ck = value of at the steady state. • Since at steady state, • Hence, • Therefore:

Why Does it Work? (3) • Then: • And: • Therefore:

Six Degrees of Separation[Stanley Milgram, 67] • “Random starters” at Nebraska, Kansas, etc. • Destinations: in Boston • Intermediaries send postcards to Milgram • Findings: average of 6 postcards • “Conclusion”: every two people in the US are connected by a path of length ~ 6

Small-World Networks • Average diameter: length of shortest path from u to v, averaged over all pairs u,v • Clustering coefficient: fraction of neighbors of v that are neighbors of each other, averaged over all v • Small-world network: a sparse graph with average diameter O(log n) and a constant clustering coefficient

The Web as a Small World Network Low diameter • Study of a synthetic web graph model [Albert, Jeong, Barabási 99] • Average diameter of the Web is ~19 • Grows logarithmically with size of the Web. • Study of a large crawl [Broder et al 00] • Average diameter of the SCC is ~ 16 • Maximum diameter of the SCC is ≥ 28 • Diameter of host graph [Adamic 99] • Average diameter of SCC: ~4 • High clustering coefficient • Clustering coefficient of host graph [Adamic 99] • Clustering coefficient: ~0.08 (compared to 0.001 in a comparable random graph)

Model for Small-World Networks[Watts & Strogatz 98] • One extreme: random networks • Low diameter • Low clustering coefficient • Other extreme: “regular” networks (e.g., a lattice) • High clustering coefficient • High diameter • Small-world: interpolation between the two • Low diameter • High clustering coefficient • Regularity: social networking • Randomness: individual interests

Random Network The model: • n vertices • Every pair u,v is connected by an edge with probability p = d/n Properties: • Expected number of edges: ~dn • Graph is connected w.h.p • Diameter: O(log n) w.h.p. • Clustering coefficient: ~ p = d/n = o(1)

Ring Lattice The model: • n vertices on a circle • Every vertex has d neighbors: the d/2 vertices to its right and the d/2 vertices to its left Properties: • Number of edges: dn/2 • Graph is connected • Diameter: O(n/d) • Clustering coefficient:

Random Rewiring • Start from a ring lattice • for i = 1 to d/2 do • for v = 1 to n do • Pick i-th clockwise nearest neighbor of v • With probability p, replace this neighbor by a random vertex

Analysis • If p = 0, ring lattice • High clustering coefficient • High diameter • If p = 1, random network • Logarithmic diameter • Low clustering coefficient • However, • Diameter goes down rapidly as p grows • Clustering coefficient goes down slowly as p grows • Therefore, for small p, we get a small-world network. • Logarithmic diameter • High clustering coefficient

End of Lecture 7

Algorithms for Large Data Sets