Algorithms for Large Data Sets

Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006 http://www.ee.technion.ac.il/courses/049011

Data Streams (cont.)

Outline • Distinct elements • Lp norms • Notation: for integers a < b, [a,b] = {a, a+1, …, b}

Distinct Elements[Flajolet, Martin 85] [Alon, Matias, Szegedy 96], [Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan 02] • Input: a vector x  [1,m]n • Goal: find D = number of distinct elements of x • Exact algorithms: need (m) bits of space • Deterministic algorithms: need (m) bits of space • Approximate randomized algorithms: O(log m) bits of space

Distinct Elements, 1st Attempt • Let M >> m2 • Pick a “random hash function” h: [1,m]  [1,M] • h(1),…,h(m) are chosen uniformly and independently from [1,M] • Since M >> m2, probability of collisions is tiny • min  M • for i = 1 to n do • read xi from stream • if h(xi) < min, min  h(xi) • output M/min

Distinct Elements: Analysis • Space: • O(log M) = O(log m) for min • O(m log M) = O(m log m) for h • Too much! • Worse than the naïve O(m) space algorithm • Next: show how to use more “space-efficient” hash functions

Small Families of Hash Functions • H = {h | h: [1,m]  [1,M] }: a family of hash functions • |H| = O(mc) for some constant c • Therefore, each h  H can be represented in O(log m) bits • Need H to be “explicit”: given representation of h, can compute h(x), for any x, efficiently. • How do we make sure H has the “random-like” properties of random hash functions?

Universal Hash Functions[Carter, Wegman 79] • H is a 2-universal family of hash functions if: For all x  y  [1,m] and for all z,w  [1,M], when choosing h from H randomly, then Pr[h(x) = z and h(y) = w] = 1/M2 • Conclusions: • For each x, h(x) is uniform in [1,M] • For all x  y, h(x) and h(y) are independent • h(1),…,h(m) is a sequence of uniform pairwise-independent random variables • k-universal families: straightforward generalization

Construction of a Universal Family • Suppose M = prime power • [1,M] can be viewed as a finite field FM • [1,m] can be viewed as elements of FM • H = { ha,b | a,b  FM } is defined as: ha,b(x) = ax + b • Note: • |H| = M2 • If x  y  FM and z,w  Fm, then ha,b(x) = z and ha,b(y) = w iff • Since x  y, the above system has a unique solution • Hence, Pra,b[ha,b(x) = z and ha,b(y) = w] = 1/M2.

Distinct Elements, 2nd Attempt • Use 2-universal hash functions rather than random hash function • Space: • O(log m) for tracking the minimum • O(log m) for storing the hash function • Correctness: • Part 1: • h(a1),…,h(aD) are still uniform in [1,M] • Linearity of expectation holds regardless of whether Z1,…,Zk are independent or not. • Part 2: • h(a1),…,h(aD) are still uniform in [1,M] • Main point: variance of pairwise independent variables is additive:

Distinct Elements, Better Approximation • So far we had a factor 6 approximation. • How do we get a better one? • 1 +  approximation algorithm: • Find the t = O(1/2) smallest elements, rather than just the smallest one. • If v is the largest among these, output tM/v • Space: O(1/2 log m) • Better algorithm: O(1/2 + log m)

Lp Norms • Input: an integer vector x  [-m,+m]n • Goal: find ||x||p = Lp norm of x • Popular instantiations: • L2: Euclidean distance • L1: Manhattan distance • L: max • L0: # of non-zeros (assuming 1/0 = 1, 00 = 0) • Not a norm • Data stream algorithm: • Can be done trivially in O(log m) space

Lp Norms: The “Cash Register” Model • Input: a sequence X of N pairs (i1,a1),…,(iN,aN) • For each j, ij {1,…,n} • For each j, aj  [-m,m] • Ex: X = (1,3), (3,-2), (1,-5), (2,4), (2,1) • For each i = 1,…,n, let Si = { j | ij = i } • Ex: S1 = {1,3}, S2 = {4,5}, S3 = {2} • Define: xi = jSi aj • Ex: x1 = -2, x2 = 5, x3 = -2 • Goal: find ||x||p = Lp norm of x

Lp Norms in the “Cash Register” Model: Applications • Standard Lp norms • Lp distances • Input: two vectors x,y  [-m,+m]n (interleaved arbitrarily) • Goal: find ||x – y||p • Frequency moments: • Input: a vector X  [1,n]N • Ex: X = (1 2 3 1 1 2) • For each i = 1,…,n, define: xi = frequency of i in X • Ex: x1 = 3, x2 = 2, x3 = 1 • Goal: output ||x||p • Special cases: • p = : Most frequent element • p = 0: Distinct elements

Lp Norms: State of the Art Results • 0 < p ≤ 2: O(log n log m) space algorithm [Indyk 00] • 2 <p < : O(n1-2/p log m) space algorithm [Indyk,Woodruff 05] • (n1-2/p-o(1)) space lower bound[Saks, Sun 02], [Bar-Yossef,Jayram,Kumar,Sivakumar 02], [Chakrabarti, Khot, Sun 03] • p = : O(n) space algorithm [Alon,Matias,Szegedy 96] • (n) space lower bound[Alon,Matias,Szegedy 96] • p = 0 (distinct elements): O(log n + 1/2) space algorithm [Bar-Yossef,Jayram,Kumar,Sivakumar,Trevisan 02] • (log n + 1/2) space lower bound[Alon,Matias,Szegedy 96], [Indyk, Woodruff 03]

Stable Distributions • D: distribution on R, x Rn, p (0,2] • The distribution Dx: • Z1,…,Zn: i.i.d. random variables with distribution D • Dx = distribution of i xi Zi • The distribution Dp,x: • Z: random variable with distribution D • Dp,x = distribution of ||x||p Z • Definition: D is p-stable, if for every x, Dx = Dp,x. • Examples: • p = 2: Standard normal distribution. • p = 1: Cauchy distribution. • Other p’s: no closed form pdf.

Indyk’s Algorithm • For simplicity, assume p = 1. • Input: a sequence X = (i1,a1),…,(iN,aN) • Output: a value z s.t. • “Cauchy hash function”: h:[1,n]  R • h(1),…,h(n) are i.i.d. with Cauchy distribution • In practice, use bounded precision

Indyk’s Algorithm, 1st Attempt • k  O(1/2 log(1/)) • generate k Cauchy hash functions h1,…,hk • for t = 1,…,k do • At  0 • for j = 1,…,N do • read (ij,aj) from data stream • for t = 1,…,k do • At  At + aj ht(ij) • output median(A1,…,Ak)

Correctness Analysis • Fix some t  [1,k] • What value does At have at the end of the execution? • Recall: ht(1),…,ht(n) are i.i.d. with 1-stable distribution • Therefore, At is distributed the same as: ||x||1 Z • Z: random variable with Cauchy distribution

Correctness Analysis (cont.) • Z1,…,Zk: i.i.d. random variables with Cauchy distribution • Output of algorithm: median(A1,…,Ak) • Same as: median(||x||1 Z1,…,||x||1 Zk) = ||x||1 median(Z1,…,Zk) • Conclusion: enough to show:

Correctness Analysis (cont.) • Claim: Let Z be distributed Cauchy. Then, • Proof: The cdf of the Cauchy distribution is: • Therefore, • Claim: Let Z be distributed Cauchy. For any sufficiently small  > 0,

Correctness Analysis (cont.) • Claim: Let Z1,…,Zk be k = O(1/2 log(1/)) i.i.d. Cauchy random variables. Then, • Proof: • For j = 1,…,k, let • Then, median(Z1,…,Zk) < 1 -  iff jYj ≥ k/2 • E[jYj] = k/2 - k/4 • By Chernoff-Heoffding bound, Pr[jYj ≥ k/2] < /2 • Similar analysis shows: Pr[median(Z1,…,Zk) > 1 + ] < /2

Space Analysis • Space used: k = O(1/2 log(1/)) times: • At: O(log m) bits • ht: O(n log m) bits • Too much! • This time we really need ht(1),…,ht(n) to be totally independent • Otherwise, resulting distribution is not stable • Cannot use universal hashing • What can we do?

Pseudo-Random Generators for Space-Bounded Computations [Nisan 90] • Notation: Uk = a random sequence of k bits • An S-space R-random bits randomized algorithm A: • Uses at most S bits of space • Uses at most R random bits • Accesses random bits sequentially • A(x,UR): (random) output of A on input x • Nisan’s pseudo-random generator: G: {0,1}S log R {0,1}R s.t. • For every S-space R-random bits randomized algorithm A, • for every input x, • A(x,UR) has almost the same distribution as A(x,G(US log R))

Space Analysis • Suppose input stream is guaranteed to come in the following order: • First all pairs of the form (1,*) • Then, all pairs of the form (2,*), • … • Finally, all pairs of the form (n,*) • Then, we can generate the values ht(1),…,ht(n) on the fly, and no need to store them • O(log m) bits will suffice to store the hash function • Therefore, for such input streams, Indyk’s algorithm uses: • O(log m) bits of space • O(n log m) random bits

Space Analysis (cont.) • Conclusion: For “ordered” input streams, Indyk’s algorithm is an O(log m)-space O(n log m)-random bits randomized algorithm. • Can use Nisan’s generator • ht can now be generated from only O(log m log n) random bits • Space needed: O(log n log m) bits • Crucial observation: Indyk’s algorithm does not depend on the order of the input stream. • Conclusion: If we generate the Cauchy hash functions using Nisan’s generator, then Indyk’s algorithm will work even for “unordered” streams.

Wrapping Up • Space used: k = O(1/2 log(1/)) times: • At: O(log m) bits • ht: O(log n log m) bits (using Nisan’s generator) • Total: O(1/2 log(1/) log n log m) bits

End of Lecture 13

Algorithms for Large Data Sets