250 likes | 380 Views
Sublinear Algorihms for Big Data. Lecture 3. Grigory Yaroslavtsev http://grigory.us. SOFSEM 2015. URL: http://www.sofsem.cz 41 st International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM’15) When and where?
E N D
SublinearAlgorihms for Big Data Lecture 3 GrigoryYaroslavtsev http://grigory.us
SOFSEM 2015 • URL: http://www.sofsem.cz • 41st International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM’15) • When and where? • January 24-29, 2015. Czech Republic, Pec pod Snezkou • Deadlines: • August 1st (tomorrow!): Abstract submission • August 15th: Full papers (proceedings in LNCS) • I am on the Program Committee ;)
Today • Count-Min (continued), Count Sketch • Sampling methods • -sampling • -sampling • Sparse recovery • -sparse recovery • Graph sketching
Recap • Stream: elements from universe , e.g. • = frequency of in the stream = # of occurrences of value
Count-Min • are -wise independent hash functions • Maintain counters with values: elements in the stream with • For every the value and so: • If and then: .
More about Count-Min • Authors: Graham Cormode, S. Muthukrishnan [LATIN’04] • Count-Min is linear: Count-Min(S1 + S2) = Count-Min(S1) + Count-Min(S2) • Deterministic version: CR-Precis • Count-Min vs. Bloom filters • Allows to approximate values, not just 0/1 (set membership) • Doesn’t require mutual independence (only 2-wise) • FAQ and Applications: • https://sites.google.com/site/countminsketch/home/ • https://sites.google.com/site/countminsketch/home/faq
Fully Dynamic Streams • Stream: updates that define vector where . • Example: For • Count Sketch: Count-Min with random signs and median instead of min:
Count Sketch • In addition to use random signs • Estimate: • Parameters:
-Sampling • Stream: updates that define vector where . • -Sampling: Return random and :
Application: Social Networks • Each of people in a social network is friends with some arbitrary set of other people • Each person knows only about their friends • With no communication in the network, each person sends a postcard to Mark Z. • If Mark wants to know if the graph is connected, how long should the postcards be?
Optimal estimation • Yesterday: -approximate • space for • space for • New algorithm: Let be an -sample. Return , where is an estimate of • Expectation:
Optimal estimation • New algorithm: Let be an -sample. Return , where is an estimate of • Variance: • Exercise: Show that • Overall: • Apply average + median to copies
-Sampling: Basic Overview • Assume Weight by where where • For some value return if there is a unique such that • Probability is returned if is large enough: • Probability some value is returned repeat times.
-Sampling: Part 1 • Use Count-Sketch with parameters to sketch • To estimate and • Lemma: With high probability if • Corollary: With high probability if and • Exercise: for large
Proof of Lemma • Let • By the analysis of Count Sketch and by Markov: • If then • If , then where the last inequality holds with probability • Take median over repetitions high probability
-Sampling: Part 2 • Let if and otherwise • If there is a unique with then return • Note that if then and so , therefore • Lemma: With probability there is a unique such that . If so then • Thm: Repeat times. Space:
Proof of Lemma • Let We can upper-bound Similarly, • Using independence of , probability of unique with :
Proof of Lemma • Let We can upper-bound Similarly, • We just showed: • If there is a unique i, probability is:
-sampling • Maintain and -approximation to • Hash items using for • For each , maintain: • Lemma: At level there is a unique element in the streams that maps to (with constant probability) • Uniqueness is verified if . If so, then output as the index and as the count.
Proof of Lemma • Let and note that • For any • Probability there exists a unique such that , • Holds even if are only 2-wise independent
Sparse Recovery • Goal: Find such that is minimized among s with at most non-zero entries. • Definition: • Exercise: where are indices of largest • Using space we can find such that and
Count-Min Revisited • Use Count-Min with • For let for some row • Let be the indices with max. frequencies. Let be the event there doesn’t exist with • Then for • Because w.h.p. all ’s approx . up to
Sparse Recovery Algorithm • Use Count-Min with • Let be frequency estimates: • Let be with all but the k-th largest entries replaced by 0. • Lemma:
Let be indices corresponding to largest value of and . • For a vector and denote as the vector formed by zeroing out all entries of except for those in . + 2 + 2
Thank you! • Questions?