180 likes | 272 Views
Optimal Approximations of the Frequency Moments of Data Streams. Piotr Indyk David Woodruff. 4. 3. 7. 3. 1. 1. 7. The Streaming Model. …. Stream of elements a 1 , …, a n each in {1, …, m} Want to compute statistics on stream Elements arranged in adversarial order
E N D
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff
4 3 7 3 1 1 7 The Streaming Model … • Stream of elements a1, …, an each in {1, …, m} • Want to compute statistics on stream • Elements arranged in adversarial order • Algorithms given one pass over stream • Goal: Minimum space algorithm
Frequency Moments [AMS96] n = stream size, m = universe size fi= # occurrences of item i k-th moment • F0 = # of distinct elements • F1 = n = stream size • F2 = self-join size Why are frequency moments important?
Applications • Estimating distinct elements with low space • Estimate query selectivity to huge DB without sorting • Routers gather # distinct destinations • F2 estimates size of self-joins: , fB2 + fA2 = 4 + 1 = 5 • Fk measures data skewness
The Best Deterministic Algorithm • Trivial algorithm for Fk • Store/update fifor each item i, sum fik at end • Space = O(mlog n): m items i, log n bits to count fi • Negative Results [AMS96]: • Compute Fk exactly (m) space • Any deterministic alg. outputs X with • |Fk – X| < Fk must use (m) space What about randomized algorithms?
Randomized Approx Algs for Fk • Randomized alg. -approximatesFk if outputs X s.t. Pr[|Fk – X| < Fk ] > 2/3 • Previous work (table suppresses polylog mn)
Matching Upper Bound Our Contribution: For every k there is a 1-pass O~(m1-2/k) space algorithm to -approximate Fk • Additional Features: • Works even if we allow deletions, that is, stream of elements (i, +), (i,-) • 2. Constant update time
Techniques • Previous Algorithms [AMS96, CK04, G04] • 1. Cleverly construct small-space estimator X s.t. • E[X] = Fk • Var[X] small • 2. Apply Chebyshev’s inequality • Our “algorithm’’ • 1. Divide frequencies into “buckets” • 0, [1, 2), [2, 4), [4, 8), …, [2i-1, 2i), … • 2. Estimate size si of each bucket • 3.Output X = i si 2ik
What’s Left? • Remaining Problem: Estimate si = # of elements with frequency in each bucket [2i-1, 2i) • Is this always easy? No. • Suppose always easy – then could approximate the maximum frequency • This is HARD – (m) space [AMS96] • However, (m) only applies to “worst-case” streams, otherwise can do better: Countsketch [CCF-C]
For the moment, let’s assume: 1. 9 a 1-pass oracle Max returning the maximum frequency using O(B) space (we remove this using CountSketch) Max frequency items • 2. We have a very long RAM of random bits • (we remove this using Nisan’s generator)
General Idea: Max + Sampling • Restrict input stream to a random subset of items in {1, …, m}, where items are included independently with probability p. … 4 3 7 3 1 1 7 Random subset = {1, 3} …
Restrict input to a random subset of items in {1, …, m}, where items are included independently with probability p. General Idea: Max + Sampling • What are chances the maximum lies in • Si= elements r such that fr2 [2i-1, 2i)? q = (1-p) j > i sj¢ (1 – (1-p)si) Idea: 1. Estimate q as q’ by taking independent trials and computing fraction of max in Si 2. If already estimated sj for j > i, solve this expression for si.
When is this estimate any good? Recall q = (1-p){j > i} sj (1 – (1-p)si), so estimate si: Need 1. (holds inductively) (tight concentration of q’) 2. Requires 9 p so that q > 1/R, where R = # trials used to estimate q
When is this estimate any good? q = (1-p)j > i sj (1 – (1-p)si) p too large? ! q too small p too small? ! q too small Motivates the following: Say a class Sicontributes if and only if si > j > i sj /R If R = (log n), then Fk¼contributing i si 2ik
The Idealized Algorithm • Use the random string to generate hash functions hjr : [m] -> [2j]for j 2 [log m] and r 2 [R] • Restrict stream Str to Strjr, those items i with hjr(i) = 1 • For each Strjr, compute Max(Strjr) • To estimate si given s’t for t > i, find some j for which “enough” of the Max(Strjr) come from Si, and then set • Output F’k = i s’i 2ik
Removing the assumptions 1. Assumption: 9 a 1-pass oracle Max returning the maximum frequency using O(B) space [CCF-C02]:9 a 1-pass O(B)-space algorithm CountSketch which, given stream Str, outputs all x for which fx2¸ F2/B Recall: Sicontributes if and only if si > j > i sj /R Lemma: If Si = [2i-1, 2i) contributes, then Proof: Holder’s inequality.
Consider a space-S algorithm A and a function f, with random strings R1, …, Rn that, when processing a stream, maintains a variable C, and updates as follows: C = C + f(i, Ri) Removing the assumptions 2. We have an infinite string of random bits [Indyk00] Then R1, …, Rn can be generated using Nisan’s PRG, and: The new algorithm A’ has space O~(S) The outputs of A’ and A are indistinguishable Our algorithm follows this framework
Conclusions • Result: Tight O~(m1-2/k) upper bound • Handle deletions (j, -) • O~(1) update time • Open Problem: Reduce O~ factors