220 likes | 232 Views
Explore sublinear algorithms for computing something on a table using limited space. Topics include sketching, sampling, and other sublinear algorithms. Learn about distinct elements, heavy hitters, moments, and more.
E N D
Sketching, Sampling and other Sublinear Algorithms:Streaming Alex Andoni (MSR SVC)
A scenario Challenge: compute something on the table, using small space. 131.107.65.14 18.9.22.69 • Example of “something”: • # distinct IPs • max frequency • other statistics… 131.107.65.14 80.97.56.20 18.9.22.69 80.97.56.20 131.107.65.14
Sublinear: a panacea? • Sub-linear space algorithm for solving Travelling Salesperson Problem? • Sorry, perhaps a different lecture • Hard to solve sublinearly even very simple problems: • Ex: what is the count of distinct IPs seen • Will settle for: • Approximate algorithms: 1+ approximation true answer ≤ output≤ (1+) * (true answer) • Randomized: above holds with probability 95% • Quick and dirty way to get a sense of the data
Streaming data • Data through a router • Data stored on a hard drive, or streamed remotely • More efficient to do a linear scan on a hard drive • Working memory is the (smaller) main memory 2 2
Application areas • Data can come from: • Network logs, sensor data • Real time data • Search queries, served ads • Databases (query planning) • …
Problem 1: # distinct elements • Problem: compute the number of distinct elements in the stream • Trivial solution: space for distinct elements • Will see: space (approximate) 2 5 7 5 5
Distinct Elements: idea 1 [Flajolet-Martin’85, Alon-Matias-Szegedy’96] Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process(inti): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 • Algorithm: • Hash function • Compute • Output is • “Analysis”: • repeats of the same element idon’t matter • , for distinct elements 7 2 5
Distinct Elements: idea 2 Algorithm DISTINCT: Initialize: minHash2=0 hash function h into [0,1] Process(inti): if (h(i) < 1/2^minHash2) minHash2 = ZEROS(h(index)); Output: 2^minHash2 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process(inti): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 • Store approximately • Store just the count of trailing zeros • Need only bits • Randomness: 2-wise enough! • bits • Better accuracy using more space: • error • repeat times with different hash functions • HyperLogLog: can also with just one hash function[FFGM’07] x=0.0000001100101 ZEROS(x)
Problem 2: max count heavy hitters • Problem: compute the maximum frequency of an element in the stream • Bad news: • Hard to distinguish whether an element repeated (max = 1 vs 2) • Good news: • Can find “heavy hitters” • elements with frequency > total frequency / s • using space proportional to s 2 5 7 5 5
Heavy Hitters: CountMin [Charikar-Chen-FarachColton’04, Cormode-Muthukrishnan’05] AlgorithmCountMin: Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process(inti): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreachi in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate 7 2 5 5 5 11 freq freq freq freq
Heavy Hitters: analysis 5 • = frequency of 5, plus “extra mass” • Expected “extra mass” ≤ total mass / w • Chebyshev: true with probability >1/2 • to get high probability (for all elements) • Compute heavy hitters from freq[] AlgorithmCountMin: Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process(inti): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreachi in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate 3
Problem 3: Moments • Problem: compute frequency moment • variance or • higher moments for • Skewness (k=3), kurtosis (k=4), etc • a different proxy for max: 1+81+16=98 1+9+4=14
moment • Use Johnson-Lindenstrauss lemma! (2nd lecture) • Store sketch • = frequency vector • = by matrix of Gaussian entries • Update on element : • Guarantees: • counters (words) • time to update • Better: entries, update [AMS’96, TZ’04] • : precision sampling => next
Scenario 2: distributed traffic • Statistics on traffic difference/aggregate between two routers • Eg: traffic different by how many packets? • Linearity is the power! • Sketch(data ) + Sketch(data ) = Sketch(data + data ) • Sketch(data ) - Sketch(data ) = Sketch(data - data ) 131.107.65.14 35.8.10.140 18.9.22.69 18.9.22.69 • Two sketches should be sufficient to compute • something on the difference or sum
Common primitive: estimate sum • Given: quantities in the range • Goal: estimate “cheaply” • Standard sampling: pick random set of size • Estimator: • Chebyshev bound: with 90% success probability • For constant additive error, need Compute an estimate from a3 a1 a2 a4 a3 a1
Precision Sampling Framework • Alternative “access” to ’s: • For each term , we get a (rough) estimate • up to some precision, chosen in advance: • Challenge: achieve good trade-off between • quality of approximation to • use only weak precisions (minimize “cost” of estimating ) Compute an estimate from u4 u1 u2 u3 ã3 ã4 ã1 ã2 a2 a4 a3 a1
Formalization • What is cost? • Here, average cost = • to achieve precision , use “resources”: e.g., if is itself a sum computed by subsampling, then one needs samples • For example, can choose all • Average cost ≈ Sum Estimator Adversary 1. fix precisions 1. fix • 2. fix s.t. 3. given , output s.t. .
Precision Sampling Lemma [A-Krauthgamer-Onak’11] • Goal: estimate ∑aifrom {ãi} satisfying |ai-ãi|<ui. • Precision Sampling Lemma: can get, with 90% success: • O(1) additive error and 1.5 multiplicative error: S – O(1) < S̃ < 1.5*S + O(1) • with average cost equal to O(log n) • Example: distinguish Σai=3 vsΣai=0 • Consider two extreme cases: • if three ai=1: enough to have crude approx for all (ui=0.1) if all ai=3/n: only few with good approxui=1/n, and the rest with ui=1 ε 1+ε S – ε < S̃ < (1+ ε)S + ε O(ε-3 log n)
Precision Sampling Algorithm • Precision Sampling Lemma: can get, with 90% success: • O(1) additive error and 1.5 multiplicative error: S – O(1) < S̃ < 1.5*S + O(1) • with average cost equal to O(log n) • Algorithm: • Choose each ui[0,1] i.i.d. • Estimator: S̃ = count number of i‘s s.t. ãi / ui > 6 (up to a normalization constant) • Proof of correctness: • we use only ãi which are 1.5-approximation to ai • E[S̃] ≈ ∑ Pr[ai / ui > 6] = ∑ ai/6. • E[1/ui] = O(log n) w.h.p. ε 1+ε S – ε < S̃ < (1+ ε)S + ε O(ε-3 log n) concrete distrib. = minimum of O(ε-3) u.r.v. function of [ãi /ui - 4/ε]+and ui’s
Moments () via precision sampling • Theorem: linear sketch for with approximation, and space (90% succ. prob.). • Sketch: • Pick random , and let • throw into one hash table , • cells • Estimator: • Randomness: independence suffices x= H=
Streaming++ • LOTS of work in the area: • Surveys • Muthukrishnan: http://algo.research.googlepages.com/eight.ps • McGregor: http://people.cs.umass.edu/~mcgregor/papers/08-graphmining.pdf • Chakrabarti: http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11/Notes/lecnotes.pdf • Open problems: http://sublinear.info • Examples: • Moments, sampling • Median estimation, longest increasing sequence • Graph algorithms • E.g., dynamic graph connectivity [AGG’12, KKM’13,…] • Numerical algorithms (e.g., regression, SVD approximation) • Fastest (sparse) regression […CW’13,MM’13,KN’13,LMP’13] • related to Compressed Sensing