200 likes | 331 Views
Lecture 4: CountSketch High Frequencies. COMS E6998-9 F15. Plan. Scriber? Plan: CountMin / CountSketch (continuing from last time) High frequency moments via Precision Sampling. Part 1: CountMin / CountSketch. Let be frequency of Last lecture: 2 nd moment: Tug of War
E N D
Lecture 4:CountSketchHigh Frequencies COMS E6998-9F15
Plan • Scriber? • Plan: • CountMin/CountSketch (continuing from last time) • High frequency moments via Precision Sampling
Part 1: CountMin/CountSketch • Let be frequency of • Last lecture: • 2nd moment: • Tug of War • Max: heavy hitter • CountMin
CountMin: overall AlgorithmCountMin: Initialize(L, w): array S[L][w] L hash functions , into {1,…w} Process(inti): for(j=0; j<L; j++) S[j][ ] += 1; Estimator: foreachi in PossibleIP { = (S[j][]); } • Heavy hitters: • If , not reported • If , reported as heavy hitter • Space: cells
Time • Can improve time; space degrades to • Idea: dyadic intervals • Each level: one CountMin sketch on the virtual stream • Find heavy hitters by following down the tree the heavy hitters (virtual) stream 1, with 1 element [1,n] [1,n] (virtual) stream 2, with 2 elements [1,n/2], [n/2+1,n] [1,n/2] … (virtual) stream , with elements [1,n/2^j], … [3,4] (real) stream , with elements 1,2,…n 3
CountMin: linearity • Is CountMin linear? • CountMin() from CountMin() and CountMin() ? • Just sum the two! • sum the 2 arrays, assuming we use the same hash function • Used a lot in practice https://sites.google.com/site/countminsketch/
CountSketch AlgorithmCountSketch: Initialize(L, w): array S[L][w] L hash func’s, into [w] L hash func’s, into {} Process(inti, real ): for(j=0; j<L; j++) S[j][ ] += ; Estimator: foreachi in PossibleIP { = (S[j][]); } • How about ? • Or general streaming • “Heavy hitter”: • if • “min” is an issue • But median is still ok • Ideas to improve it further? • Use Tug of War in each bucket => CountSketch • Better in certain sense (cancelations in a cell)
CountSketchCompressed Sensing • Sparse approximations: • -sparse approximation : • Solution: = the heaviest elements of • Compressed Sensing: [Candes-Romberg-Tao’04, Donoho’04] • Want to acquire signal • Acquisition: linear measurements (sketch) • Goal: recover -sparse approximation from • Error guarantee: • Theorem: need only –size sketch!
Signal Acquisition for CS • Single pixel camera [Takhar-Laska-Waskin-Duarte-Baron-Sarvotham-Kelly-Baraniuk’06] • One linear measurement = one row of • CountSketch: a version of Compr Sensing • Set • : take all the heavy hitters (or largest) • Space: source: http://dsp.rice.edu/sites/dsp.rice.edu/files/cs/cscam-SPIEJan06.pdf
Back to Moments • General moments: • moment: • normalized: • : • via Tug of War (Lec. 3) • : count # distinct! • [Flajolet-Martin] from Lec. 2 • : • : will see later (for all ) • (normalized): • Impossible to approximate, but can heavy hitters (Lec. 3) • Remains: ? • Space: Precision Sampling (next)
A task: estimate sum Compute an estimate from a3 a1 a2 a4 a3 a1 • Given: quantities in the range • Goal: estimate “cheaply” • Standard sampling: pick random set of size • Estimator: • Chebyshev bound: with 90% success probability • For constant additive error, need
Precision Sampling Framework Compute an estimate from u4 u1 u2 u3 ã3 ã4 ã1 ã2 a2 a4 a3 a1 • Alternative “access” to ’s: • For each term , we get a (rough) estimate • up to some precision, chosen in advance: • Challenge: achieve good trade-off between • quality of approximation to • use only weak precisions (minimize “cost” of estimating )
Formalization Sum Estimator Adversary 1. fix precisions 1. fix • 2. fix s.t. 3. given , output s.t. (for ) • What is cost? • Here, average cost = • to achieve precision , use “resources”: e.g., if is itself a sum computed by subsampling, then one needs samples • For example, can choose all • Average cost ≈
Precision Sampling Lemma • Goal: estimate from satisfying . • Precision Sampling Lemma: can get, with 90% success: • O(1) additive error and 1.5 multiplicative error: • with average cost equal to • Example: distinguish vs • Consider two extreme cases: • if three : enough to have crude approx for all • if all : only few with good approx, and the rest with
Precision Sampling: Algorithm • Precision Sampling Lemma: can get, with 90% success: • O(1) additive error and 1.5 multiplicative error: • with average cost equal to • Algorithm: • Choose each i.i.d. • Estimator: . • Proof of correctness: • Claim 1: • Hence, • Claim 2: Avg cost =
-moments via Prec. Sampling • Theorem:linear sketch for -moment with approximation, and space (with 90% success probability). • Sketch: • Pick random , and • let • Hash into a hash table , cells • Estimator: • Linear
Correctness of estimation AlgorithmPrecisionSamplingFp: Initialize(w): array S[w] hash func, into [w] hash func, into {} reals , from distribution Process(vector ): for(i=0; i<n; i++) S[] += ; Estimator: • Theorem: is approximation with 90% probability, with cells • Proof: • Use Precision Sampling Lem. • Need to show small • more precisely:
Correctness 2 • Claim: • Consider cell • How much chaff is there? • What is ? • By Markov’s: with probability >90% • Set , then where exponential r.v.
Correctness (final) • Claim: • where • Proved: • this implies with 90% for fixed • But need for all ! • Want: with high probability for some smallish • Can indeed prove for with strong concentration inequality (Bernstein).
Recap • CountSketch: • Linear sketch for general streaming • -moment for • Via Precision Sampling • Estimate of sum from poor estimates • Sketch: Exp variables + CountSketch