310 likes | 443 Views
Sublinear Algorihms for Big Data. Lecture 2. Grigory Yaroslavtsev http://grigory.us. Recap. (Markov) For every ( Chebyshev ) For every ( Chernoff ) Let be independent and identically distributed r.vs with range [0, c] and expectation . Then if and. Today. Approximate Median
E N D
SublinearAlgorihms for Big Data Lecture 2 GrigoryYaroslavtsev http://grigory.us
Recap • (Markov) For every • (Chebyshev) For every • (Chernoff) Let be independent and identically distributedr.vs with range [0, c] and expectation . Then if and
Today • Approximate Median • Alon-Mathias-Szegedy Sampling • Frequency Moments • Distinct Elements • Count-Min
Data Streams • Stream: elements from universe , e.g. • = frequency of in the stream = # of occurrences of value
Approximate Median • (all distinct) and let • Problem: Find -approximate median, i.e. such that • Exercise: Can we approximate the value of the median with additive error in sublinear time? • Algorithm: Return the median of a sample of size taken from (with replacement).
Approximate Median • Problem: Find -approximate median, i.e. such that • Algorithm: Return the median of a sample of size taken from (with replacement). • Claim: If then this algorithm gives -median with probability
Approximate Median • Partition into 3 groups • Key fact: If less than elements from each of and are in sample then its median is in • Let if -th sample is in and 0 otherwise. • Let . By Chernoff, if • Same for + union bound error probability
AMS Sampling • Problem: Estimate , for an arbitrary function with • Estimator: Sample ,where is sampled uniformly at random from and compute: Output: • Expectation:
Frequency Moments • Define for • # number of distinct elements • # elements • = “Gini index”, “surprise index”
Frequency Moments • Define for • Use AMS estimator with • Exercise: , where • Repeat times and take average . By Chernoff: • Taking gives
Frequency Moments • Lemma: • Result: memory suffices for -approximation of • Question: What if we don’t know • Then we can use probabilistic guessing (similar to Morris’s algorithm), replacing with .
Frequency Moments • Lemma: • Exercise:Hint: worst-case when . Use convexity of • Case 1:
Frequency Moments • Lemma: • Case 2:
Hash Functions • Definition: A family of functions from is -wise independent if for any distinct and • Example: If for prime is a -wise independent family of hash functions.
Linear Sketches • Sketching algorithm: picks a random matrix where and computes • Can be incrementally updated: • We have a sketch • When arrives, new frequencies are • Updating the sketch: • Need to choose random matrices carefully
Problem: -approximation for • Algorithm: • Let , where entries of each row are -wise independent and rows are independent • Don’t store the matrix: 4-wise independent hash functions • Compute average squared entries “appropriately” • Analysis: • Let be any entry of . • Lemma: • Lemma: Var
Let be a row of with entries • We used 2-wise independence for .
: Distinct Elements • Problem:-approximation for • Simplified: For fixed with prob. distinguish: vs. • Original problem reduces by trying values of T:
: Distinct Elements • Simplified: For fixed with prob. distinguish: vs. • Algorithm: • Choose random sets where • Compute • If at least of the values are zero, output
vs. • Algorithm: • Choose random sets where • Compute • If at least of the values are zero, output • Analysis: • If , then • If , then • Chernoff: gives correctness w.p.
vs. • Analysis: • If , then • If , then • If is large and is small then: • If • If
Count-Min Sketch • https://sites.google.com/site/countminsketch/ • Stream: elements from universe , e.g. • = frequency of in the stream = # of occurrences of value • Problems: • Point Query: For estimate • Range Query: For estimate • Quantile Query: For find with • Heavy Hitters: For find all with
Count-Min Sketch: Construction • Let be -wise independent hash functions • We maintain counters with values: elements in the stream with • For every the value and so: • If and then: .
Count-Min Sketch: Analysis • Define random variables such that • Define if and 0 otherwise: • By -wise independence: • By Markov inequality,
Count-Min Sketch: Analysis • All are independent • The w.p. there exists such that • CountMin estimates values up to with total memory
Dyadic Intervals • Define partitions of : … • Exercise: Any interval can be written as a disjoint union of at most such intervals. • Example: For
Count-Min: Range Queries and Quantiles • Range Query: For estimate • Approximate median: Find such that: and
Count-Min: Range Queries and Quantiles • Algorithm: Construct Count-Min sketches, one for each such that for any we have an estimate for such that: • To estimate let be decomposition: • Hence,
Count-Min: Heavy Hitters • Heavy Hitters: For find all with but no elements with • Algorithm: • Consider binary tree whose leaves are [n] and associate internal nodes with intervals corresponding to descendant leaves • Compute Count-Min sketches for each • Level-by-level from root, mark children of marked nodes if • Return all marked leaves • Finds heavy-hitters in steps
Thank you! • Questions?