1 / 22

Sketching, Sampling and other Sublinear Algorithms: Streaming

Explore sublinear algorithms for computing something on a table using limited space. Topics include sketching, sampling, and other sublinear algorithms. Learn about distinct elements, heavy hitters, moments, and more.

shirleyk
Download Presentation

Sketching, Sampling and other Sublinear Algorithms: Streaming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sketching, Sampling and other Sublinear Algorithms:Streaming Alex Andoni (MSR SVC)

  2. A scenario Challenge: compute something on the table, using small space. 131.107.65.14 18.9.22.69 • Example of “something”: • # distinct IPs • max frequency • other statistics… 131.107.65.14 80.97.56.20 18.9.22.69 80.97.56.20 131.107.65.14

  3. Sublinear: a panacea? • Sub-linear space algorithm for solving Travelling Salesperson Problem? • Sorry, perhaps a different lecture • Hard to solve sublinearly even very simple problems: • Ex: what is the count of distinct IPs seen • Will settle for: • Approximate algorithms: 1+ approximation true answer ≤ output≤ (1+) * (true answer) • Randomized: above holds with probability 95% • Quick and dirty way to get a sense of the data

  4. Streaming data • Data through a router • Data stored on a hard drive, or streamed remotely • More efficient to do a linear scan on a hard drive • Working memory is the (smaller) main memory 2 2

  5. Application areas • Data can come from: • Network logs, sensor data • Real time data • Search queries, served ads • Databases (query planning) • …

  6. Problem 1: # distinct elements • Problem: compute the number of distinct elements in the stream • Trivial solution: space for distinct elements • Will see: space (approximate) 2 5 7 5 5

  7. Distinct Elements: idea 1 [Flajolet-Martin’85, Alon-Matias-Szegedy’96] Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process(inti): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 • Algorithm: • Hash function • Compute • Output is • “Analysis”: • repeats of the same element idon’t matter • , for distinct elements 7 2 5

  8. Distinct Elements: idea 2 Algorithm DISTINCT: Initialize: minHash2=0 hash function h into [0,1] Process(inti): if (h(i) < 1/2^minHash2) minHash2 = ZEROS(h(index)); Output: 2^minHash2 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process(inti): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 • Store approximately • Store just the count of trailing zeros • Need only bits • Randomness: 2-wise enough! • bits • Better accuracy using more space: • error • repeat times with different hash functions • HyperLogLog: can also with just one hash function[FFGM’07] x=0.0000001100101 ZEROS(x)

  9. Problem 2: max count heavy hitters • Problem: compute the maximum frequency of an element in the stream • Bad news: • Hard to distinguish whether an element repeated (max = 1 vs 2) • Good news: • Can find “heavy hitters” • elements with frequency > total frequency / s • using space proportional to s 2 5 7 5 5

  10. Heavy Hitters: CountMin [Charikar-Chen-FarachColton’04, Cormode-Muthukrishnan’05] AlgorithmCountMin: Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process(inti): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreachi in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate 7 2 5 5 5 11 freq freq freq freq

  11. Heavy Hitters: analysis 5 • = frequency of 5, plus “extra mass” • Expected “extra mass” ≤ total mass / w • Chebyshev: true with probability >1/2 • to get high probability (for all elements) • Compute heavy hitters from freq[] AlgorithmCountMin: Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process(inti): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreachi in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate 3

  12. Problem 3: Moments • Problem: compute frequency moment • variance or • higher moments for • Skewness (k=3), kurtosis (k=4), etc • a different proxy for max: 1+81+16=98 1+9+4=14

  13. moment • Use Johnson-Lindenstrauss lemma! (2nd lecture) • Store sketch • = frequency vector • = by matrix of Gaussian entries • Update on element : • Guarantees: • counters (words) • time to update • Better: entries, update [AMS’96, TZ’04] • : precision sampling => next

  14. Scenario 2: distributed traffic • Statistics on traffic difference/aggregate between two routers • Eg: traffic different by how many packets? • Linearity is the power! • Sketch(data ) + Sketch(data ) = Sketch(data + data ) • Sketch(data ) - Sketch(data ) = Sketch(data - data ) 131.107.65.14 35.8.10.140 18.9.22.69 18.9.22.69 • Two sketches should be sufficient to compute • something on the difference or sum

  15. Common primitive: estimate sum • Given: quantities in the range • Goal: estimate “cheaply” • Standard sampling: pick random set of size • Estimator: • Chebyshev bound: with 90% success probability • For constant additive error, need Compute an estimate from a3 a1 a2 a4 a3 a1

  16. Precision Sampling Framework • Alternative “access” to ’s: • For each term , we get a (rough) estimate • up to some precision, chosen in advance: • Challenge: achieve good trade-off between • quality of approximation to • use only weak precisions (minimize “cost” of estimating ) Compute an estimate from u4 u1 u2 u3 ã3 ã4 ã1 ã2 a2 a4 a3 a1

  17. Formalization • What is cost? • Here, average cost = • to achieve precision , use “resources”: e.g., if is itself a sum computed by subsampling, then one needs samples • For example, can choose all • Average cost ≈ Sum Estimator Adversary 1. fix precisions 1. fix • 2. fix s.t. 3. given , output s.t. .

  18. Precision Sampling Lemma [A-Krauthgamer-Onak’11] • Goal: estimate ∑aifrom {ãi} satisfying |ai-ãi|<ui. • Precision Sampling Lemma: can get, with 90% success: • O(1) additive error and 1.5 multiplicative error: S – O(1) < S̃ < 1.5*S + O(1) • with average cost equal to O(log n) • Example: distinguish Σai=3 vsΣai=0 • Consider two extreme cases: • if three ai=1: enough to have crude approx for all (ui=0.1) if all ai=3/n: only few with good approxui=1/n, and the rest with ui=1 ε 1+ε S – ε < S̃ < (1+ ε)S + ε O(ε-3 log n)

  19. Precision Sampling Algorithm • Precision Sampling Lemma: can get, with 90% success: • O(1) additive error and 1.5 multiplicative error: S – O(1) < S̃ < 1.5*S + O(1) • with average cost equal to O(log n) • Algorithm: • Choose each ui[0,1] i.i.d. • Estimator: S̃ = count number of i‘s s.t. ãi / ui > 6 (up to a normalization constant) • Proof of correctness: • we use only ãi which are 1.5-approximation to ai • E[S̃] ≈ ∑ Pr[ai / ui > 6] = ∑ ai/6. • E[1/ui] = O(log n) w.h.p. ε 1+ε S – ε < S̃ < (1+ ε)S + ε O(ε-3 log n) concrete distrib. = minimum of O(ε-3) u.r.v. function of [ãi /ui - 4/ε]+and ui’s

  20. Moments () via precision sampling • Theorem: linear sketch for with approximation, and space (90% succ. prob.). • Sketch: • Pick random , and let • throw into one hash table , • cells • Estimator: • Randomness: independence suffices x= H=

  21. Streaming++ • LOTS of work in the area: • Surveys • Muthukrishnan: http://algo.research.googlepages.com/eight.ps • McGregor: http://people.cs.umass.edu/~mcgregor/papers/08-graphmining.pdf • Chakrabarti: http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11/Notes/lecnotes.pdf • Open problems: http://sublinear.info • Examples: • Moments, sampling • Median estimation, longest increasing sequence • Graph algorithms • E.g., dynamic graph connectivity [AGG’12, KKM’13,…] • Numerical algorithms (e.g., regression, SVD approximation) • Fastest (sparse) regression […CW’13,MM’13,KN’13,LMP’13] • related to Compressed Sensing

More Related