1 / 20

Sublinear Algorithms for Distinct Count and Impossibility Results in Streaming

Learn about the Median trick, Chernoff bound, Distinct Elements counting, and why deterministic exact algorithms won't work for streaming data. Explore the usage of randomization and approximations in sublinear algorithms.

jessd
Download Presentation

Sublinear Algorithms for Distinct Count and Impossibility Results in Streaming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 2:Median trick + Chernoff,Distinct Count,Impossibility Results COMS E6998-9F15

  2. Administrivia, Plan • Website moved: sublinear.wikischolars.columbia.edu/main • Piazza: sign-up! • Plan: • Median trick, Chernoff bound (from Tue) • Distinct Elements Count • Impossibility Results

  3. Last Lecture • Counting frequency • Morris Algorithm: • Initialize • On increment, with prob. • Estimator:

  4. “Median trick” • Chernoff/Hoeffding bounds: are independent r.v. in Algorithm : output correct range with 90% probability Algorithm output correct range with probability • Median trick: • Repeat for times • Take median of the answers

  5. Using Chernoff for Median trick • Chernoff: • Define = 1 iff copy of is correct • ( is correct with 90% prob.) • New alg is correct when • Use Chernoff to bound: for

  6. Problem: Distinct Elements • Streaming elements from • Approximate the number of elements with non-zero freq. • Length of stream = • Space required? • bits • bits

  7. Algorithm for approximating DE • Main tool: hash function • random in • Algorithm [Flajolet-Martin 1985] • Init • When see element : • Estimator: Where from? Will return later…

  8. Analysis • Algorithm DE: • Init: • when see element : • Estimator: • Let = count of dist. elm. • Claim 1: • Proof: • = minimum of random numbers in [0,1] • Pick another random number • What’s the probability ? • 1) exactly • 2) probability it is smallest among reals: 7 2 5

  9. Analysis 2 • Algorithm DE: • Init: • when see element : • Estimator: • Need variance too… • Can prove • How do we get approximation though? • We can take for independent

  10. Alternative: Bottom-k • Algorithm DE: • Init: • when see element : • Estimator: • Bottom-k alg. [BJKS’02]: • Init ( • Keep smallest hashes seen • Estimator: • Proof: will prove • Probability that is 0.05 • Probability that is 0.05 • Overall only 0.1 probability outside the correct range

  11. Analysis for Bottom-k • Algorithm Bottom-k: • Init: • Keep smallest hashes seen using • Estimator: Compute: • Suppose we see {1…d} • Define iff • Then: iff • We have: • By Chebyshev: or: requires Implied by for

  12. Hash functions in Streaming • We used • Issue 1: reals? • Issue 2: how do we store it? • Issue 1: • Ok with: for • Probability that random numbers collide: • at most

  13. Issue 2: bounded randomness • Pairwise independent hash functions • Definition: s.t. • for all and • (i.e., like random on pairs) • Such hash function enough: • Variance cares only about pairs! • We defined iff • And computed same for fully random and pairwise independent

  14. Pairwise-Independent: example • Definition: s.t. • for all and • (A) construction: • Suppose is prime • Pick • Space: only bits • Proof of correctness: • and : system of 2 equations in 2 unknowns () • Exactly one pair satisfies it • Probability it is chosen: exactly

  15. Impossibility Results • Relaxations: • Approximation • Randomization • Need both for space

  16. Deterministic Exact Won’t Work • Suppose algorithm , estimator • uses space • We build the following stream: • Let vector • in stream iff • Run on it and let be memory content didn’t change increased

  17. Deterministic Exact Won’t Work • Using, can recover entire ! • “= encoding of a string of length ” • But has only bits! • Can think didn’t change increased

  18. Deterministic Exact Won’t Work • Using, can recover entire • “ = encoding of a string of length ” • But has only bits! • Can think • Must be injective • Otherwise, suppose • The recovery implies • Hence

  19. Deterministic ApproxWon’t Too • Similar: use to compress from a code • Code: set s.t. for all distinct • Use to encode an input into • For each check whether : • Append • If , then • By injectivity of on : or

  20. Concluding Remarks • Median trick + Chernoff • Distinct Elements • Can also store hashes approximately (store number of leading zeros) • bit per hash value • Plus other bells and whisles • HyperLogLog • Impossibility results • Can also prove randomized, exact won’t work

More Related