230 likes | 285 Views
Lecture 2: Median trick + Chernoff , Distinct Count, Impossibility Results. COMS E6998-9 F15. Administrivia , Plan. Website moved: sublinear.wikischolars.columbia.edu/main Piazza : sign-up! Plan: Median trick, Chernoff bound (from Tue) Distinct Elements Count Impossibility Results.
E N D
Lecture 2:Median trick + Chernoff,Distinct Count,Impossibility Results COMS E6998-9F15
Administrivia, Plan • Website moved: sublinear.wikischolars.columbia.edu/main • Piazza: sign-up! • Plan: • Median trick, Chernoff bound (from Tue) • Distinct Elements Count • Impossibility Results
Last Lecture • Counting frequency • Morris Algorithm: • Initialize • On increment, with prob. • Estimator:
“Median trick” • Chernoff/Hoeffding bounds: are independent r.v. in Algorithm : output correct range with 90% probability Algorithm output correct range with probability • Median trick: • Repeat for times • Take median of the answers
Using Chernoff for Median trick • Chernoff: • Define = 1 iff copy of is correct • ( is correct with 90% prob.) • New alg is correct when • Use Chernoff to bound: for
Problem: Distinct Elements • Streaming elements from • Approximate the number of elements with non-zero freq. • Length of stream = • Space required? • bits • bits
Algorithm for approximating DE • Main tool: hash function • random in • Algorithm [Flajolet-Martin 1985] • Init • When see element : • Estimator: Where from? Will return later…
Analysis • Algorithm DE: • Init: • when see element : • Estimator: • Let = count of dist. elm. • Claim 1: • Proof: • = minimum of random numbers in [0,1] • Pick another random number • What’s the probability ? • 1) exactly • 2) probability it is smallest among reals: 7 2 5
Analysis 2 • Algorithm DE: • Init: • when see element : • Estimator: • Need variance too… • Can prove • How do we get approximation though? • We can take for independent
Alternative: Bottom-k • Algorithm DE: • Init: • when see element : • Estimator: • Bottom-k alg. [BJKS’02]: • Init ( • Keep smallest hashes seen • Estimator: • Proof: will prove • Probability that is 0.05 • Probability that is 0.05 • Overall only 0.1 probability outside the correct range
Analysis for Bottom-k • Algorithm Bottom-k: • Init: • Keep smallest hashes seen using • Estimator: Compute: • Suppose we see {1…d} • Define iff • Then: iff • We have: • By Chebyshev: or: requires Implied by for
Hash functions in Streaming • We used • Issue 1: reals? • Issue 2: how do we store it? • Issue 1: • Ok with: for • Probability that random numbers collide: • at most
Issue 2: bounded randomness • Pairwise independent hash functions • Definition: s.t. • for all and • (i.e., like random on pairs) • Such hash function enough: • Variance cares only about pairs! • We defined iff • And computed same for fully random and pairwise independent
Pairwise-Independent: example • Definition: s.t. • for all and • (A) construction: • Suppose is prime • Pick • Space: only bits • Proof of correctness: • and : system of 2 equations in 2 unknowns () • Exactly one pair satisfies it • Probability it is chosen: exactly
Impossibility Results • Relaxations: • Approximation • Randomization • Need both for space
Deterministic Exact Won’t Work • Suppose algorithm , estimator • uses space • We build the following stream: • Let vector • in stream iff • Run on it and let be memory content didn’t change increased
Deterministic Exact Won’t Work • Using, can recover entire ! • “= encoding of a string of length ” • But has only bits! • Can think didn’t change increased
Deterministic Exact Won’t Work • Using, can recover entire • “ = encoding of a string of length ” • But has only bits! • Can think • Must be injective • Otherwise, suppose • The recovery implies • Hence
Deterministic ApproxWon’t Too • Similar: use to compress from a code • Code: set s.t. for all distinct • Use to encode an input into • For each check whether : • Append • If , then • By injectivity of on : or
Concluding Remarks • Median trick + Chernoff • Distinct Elements • Can also store hashes approximately (store number of leading zeros) • bit per hash value • Plus other bells and whisles • HyperLogLog • Impossibility results • Can also prove randomized, exact won’t work