1 / 18

Optimal Approximations of the Frequency Moments of Data Streams

Optimal Approximations of the Frequency Moments of Data Streams. Piotr Indyk David Woodruff. 4. 3. 7. 3. 1. 1. 7. The Streaming Model. …. Stream of elements a 1 , …, a n each in {1, …, m} Want to compute statistics on stream Elements arranged in adversarial order

wiley
Download Presentation

Optimal Approximations of the Frequency Moments of Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff

  2. 4 3 7 3 1 1 7 The Streaming Model … • Stream of elements a1, …, an each in {1, …, m} • Want to compute statistics on stream • Elements arranged in adversarial order • Algorithms given one pass over stream • Goal: Minimum space algorithm

  3. Frequency Moments [AMS96] n = stream size, m = universe size fi= # occurrences of item i k-th moment • F0 = # of distinct elements • F1 = n = stream size • F2 = self-join size Why are frequency moments important?

  4. Applications • Estimating distinct elements with low space • Estimate query selectivity to huge DB without sorting • Routers gather # distinct destinations • F2 estimates size of self-joins: , fB2 + fA2 = 4 + 1 = 5 • Fk measures data skewness

  5. The Best Deterministic Algorithm • Trivial algorithm for Fk • Store/update fifor each item i, sum fik at end • Space = O(mlog n): m items i, log n bits to count fi • Negative Results [AMS96]: • Compute Fk exactly (m) space • Any deterministic alg. outputs X with • |Fk – X| < Fk must use (m) space What about randomized algorithms?

  6. Randomized Approx Algs for Fk • Randomized alg. -approximatesFk if outputs X s.t. Pr[|Fk – X| <  Fk ] > 2/3 • Previous work (table suppresses polylog mn)

  7. Matching Upper Bound Our Contribution: For every k there is a 1-pass O~(m1-2/k) space algorithm to -approximate Fk • Additional Features: • Works even if we allow deletions, that is, stream of elements (i, +), (i,-) • 2. Constant update time

  8. Techniques • Previous Algorithms [AMS96, CK04, G04] • 1. Cleverly construct small-space estimator X s.t. • E[X] = Fk • Var[X] small • 2. Apply Chebyshev’s inequality • Our “algorithm’’ • 1. Divide frequencies into “buckets” • 0, [1, 2), [2, 4), [4, 8), …, [2i-1, 2i), … • 2. Estimate size si of each bucket • 3.Output X = i si 2ik

  9. What’s Left? • Remaining Problem: Estimate si = # of elements with frequency in each bucket [2i-1, 2i) • Is this always easy? No. • Suppose always easy – then could approximate the maximum frequency • This is HARD – (m) space [AMS96] • However, (m) only applies to “worst-case” streams, otherwise can do better: Countsketch [CCF-C]

  10. For the moment, let’s assume: 1. 9 a 1-pass oracle Max returning the maximum frequency using O(B) space (we remove this using CountSketch) Max frequency items • 2. We have a very long RAM of random bits • (we remove this using Nisan’s generator)

  11. General Idea: Max + Sampling • Restrict input stream to a random subset of items in {1, …, m}, where items are included independently with probability p. … 4 3 7 3 1 1 7 Random subset = {1, 3} …

  12. Restrict input to a random subset of items in {1, …, m}, where items are included independently with probability p. General Idea: Max + Sampling • What are chances the maximum lies in • Si= elements r such that fr2 [2i-1, 2i)? q = (1-p) j > i sj¢ (1 – (1-p)si) Idea: 1. Estimate q as q’ by taking independent trials and computing fraction of max in Si 2. If already estimated sj for j > i, solve this expression for si.

  13. When is this estimate any good? Recall q = (1-p){j > i} sj (1 – (1-p)si), so estimate si: Need 1. (holds inductively) (tight concentration of q’) 2. Requires 9 p so that q > 1/R, where R = # trials used to estimate q

  14. When is this estimate any good? q = (1-p)j > i sj (1 – (1-p)si) p too large? ! q too small p too small? ! q too small Motivates the following: Say a class Sicontributes if and only if si > j > i sj /R If R = (log n), then Fk¼contributing i si 2ik

  15. The Idealized Algorithm • Use the random string to generate hash functions hjr : [m] -> [2j]for j 2 [log m] and r 2 [R] • Restrict stream Str to Strjr, those items i with hjr(i) = 1 • For each Strjr, compute Max(Strjr) • To estimate si given s’t for t > i, find some j for which “enough” of the Max(Strjr) come from Si, and then set • Output F’k = i s’i 2ik

  16. Removing the assumptions 1. Assumption: 9 a 1-pass oracle Max returning the maximum frequency using O(B) space [CCF-C02]:9 a 1-pass O(B)-space algorithm CountSketch which, given stream Str, outputs all x for which fx2¸ F2/B Recall: Sicontributes if and only if si > j > i sj /R Lemma: If Si = [2i-1, 2i) contributes, then Proof: Holder’s inequality.

  17. Consider a space-S algorithm A and a function f, with random strings R1, …, Rn that, when processing a stream, maintains a variable C, and updates as follows: C = C + f(i, Ri) Removing the assumptions 2. We have an infinite string of random bits [Indyk00] Then R1, …, Rn can be generated using Nisan’s PRG, and: The new algorithm A’ has space O~(S) The outputs of A’ and A are indistinguishable Our algorithm follows this framework

  18. Conclusions • Result: Tight O~(m1-2/k) upper bound • Handle deletions (j, -) • O~(1) update time • Open Problem: Reduce O~ factors

More Related