1 / 90

Streaming Algorithms

This presentation discusses various streaming algorithms for processing data streams in real-time and computing functions of the stream, such as counting bits, estimating heavy hitters, and approximate set membership queries using Bloom filters.

briankelley
Download Presentation

Streaming Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Streaming Algorithms CS6234 Advanced AlgorithmsFebruary 10 2015

  2. The stream model Data sequentially enters at a rapid rate from one or more inputs We cannot store the entire stream Processing in real-time Limited memory (usually sub linear in the size of the stream) Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence Approximate answer is usually preferable

  3. Overview Counting bits with DGIM algorithm Bloom Filter Count-Min Sketch Approximate Heavy Hitters AMS Sketch AMS Sketch Applications

  4. Counting bits with DGIM algorithm Presented by Dmitrii Kharkovskii

  5. Sliding windows A useful model : queries are about a window of length N The N most recent elements received (or last N time units) Interesting case: N is still so large that it cannot be stored Or, there are so many streams that windows for all cannot be stored

  6. Problem description Problem Given a stream of 0’s and 1’s Answer queries of the form “how many 1’s in the last k bits?” where k ≤ N Obvious solution Store the most recent N bits (i.e., window size = N) When a new bit arrives, discard the N +1st bit Real Problem Slow ‐ need to scan k‐bits to count What if we cannot afford to store N bits? Estimate with an approximate answer

  7. Datar-Gionis-Indyk-Motwani Algorithm (DGIM) Overview Approximate answer Uses N)  of memory Performance guarantee: error no more than 50% Possible to decrease error to any fraction 𝜀 > 0 with N) memory Possible to generalize for the case of positive integer stream

  8. Main idea of the algorithm Represent the window as a set of exponentially growing non-overlapping buckets

  9. Timestamps Each bit in the stream has a timestamp - the position in the stream from the beginning. Record timestamps modulo N (window size) - use o(log N) bits Store the most recent timestamp to identify the position of any other bit  in the window  

  10. Buckets Each buckethas two components: Timestampof the most recent end. Needs N) bits Sizeof the bucket - the number of ones in it. Size is always . To store j we need N) bits Each bucket needs N) bits

  11. Representing the stream by buckets The right end of a bucket is always a position with a 1. Every position with a 1 is in some bucket. Buckets do not overlap. There are one or two buckets of any given size, up to some maximum size. All sizes must be a power of 2. Buckets cannot decrease in size as we move to the left (back in time).

  12. Updating buckets when a new bit arrives Drop the last bucket if it has no overlap with the window If the current bit is zero, no changes are needed If the current bit is one Create a new bucket with it. Size = 1, timestamp = current time modulo N. If there are 3 buckets of size 1, merge two oldest into one of size 2. If there are 3 buckets of size 2, merge two oldest into one of size 4. ...

  13. Example of updating process

  14. Query Answering How many ones are in the most recent k bits? Find all buckets overlappingwith last k bits Sum the sizes of all but the oldest one Add the half of the size of the oldest one Ans = 1 + 1 + 2 + 4 + 4 + 8 + 8/2 = 24 k

  15. Memory requirements

  16. Performance guarantee Suppose the last bucket has size . By taking half of it, maximum error is At least one bucket of every size less than The true sum is at least 1+ 2 + 4 + … +  = - 1 The first bit of the last bucket is always equal to 1. Error is at most 50%

  17. References J. Leskovic,A. Rajamaran,J. Ulmann. “Mining of Massive Datasets”. Cambridge University Press

  18. Bloom FilterPresented by-NaheedAnjum Arafat

  19. Motivation:The “Set Membership” Problem • x: An Element • S: A Set of elements (Finite) • Input: x, S • Output: • True (if x in S) • False (if x not in S) • Streaming Algorithm: • Limited Space/item • Limited Processing time/item • Approximate answer based on a summary/sketch of the data stream in the memory. Solution: Binary Search on an array of size |S|. Runtime Complexity: O(log|S|)

  20. Bloom Filter • Consists of • vector of n Boolean values, initially all set false (Complexity:- O(n) ) • k independent and uniform hash functions, , … , each outputs a value within the range {0, 1, … , n-1} n = 10

  21. Bloom Filter • For each element sϵS, the Boolean value at positions , , … , are set true. • Complexity of Insertion:- O(k) = 1 = 6 = 4 T T T k = 3

  22. Bloom Filter • For each element sϵS, the Boolean value at positions , , … , are set true. Note: A particular Boolean value may be set to True several times. = 4 = 9 = 7 T T k = 3

  23. Algorithm to Approximate Set Membership Query Runtime Complexity:- O(k) Input: x ( may/may not be an element) Output: Boolean For all iϵ {0,1,…,k-1} if hi(x) is False return False return True k = 3 = S1 = S3

  24. Algorithm to Approximate Set Membership Query False Positive!! = 6 = 4 = 1 = 4 = 9 = 7 = 6 = 1 = 9 k = 3

  25. Error Types • False Negative – Answering “is not there” on an element which “is there” • Never happens for Bloom Filter • False Positive – Answering “is there” for an element which “is not there” • Might happens. How likely?

  26. Probability of false positives S2 S1 n = size of table m = number of items k = number of hash functions Consider a particular bit 0 <= j <= n-1 Probability that does not set bit j after hashing only 1 item: Probability that does not set bit j after hashing m items:

  27. Probability of false positives S1 S2 n = size of table m = number of items k = number of hash functions Probability that none of the hash functions set bit j after hashing m items: We know that, =

  28. Probability of false positives S1 S2 n = size of table m = number of items k = number of hash functions Approximate Probability of False Positive Probability that bit j is not set The prob. of having all k bits of a new element already set For a fixed m, n which value of k will minimize this bound? kopt = Bit per item The probability of False Positive

  29. Bloom Filters: cons • Small false positive probability • Cannot handle deletions • Size of the Bit vector has to be set a priori in order to maintain a predetermined FP-rates :- Resolved in “Scalable Bloom Filter” – Almeida, Paulo; Baquero, Carlos; Preguica, Nuno; Hutchison, David (2007), "Scalable Bloom Filters" (PDF), Information Processing Letters 101 (6): 255–261

  30. References • https://en.wikipedia.org/wiki/Bloom_filter • Graham Cormode, Sketch Techniques for Approximate Query Processing, ATT Research • Michael Mitzenmacher, Compressed Bloom Filters, Harvard University, Cambridge

  31. Count-Min Sketch Erick Purwanto A0050717L

  32. Motivation Count-Min Sketch • Implemented in real system • AT&T: network switch to analyze network traffic using limited memory • Google: implemented on top of MapReduce parallel processing infrastructure • Simple and used to solve other problems • Heavy Hitters by Joseph • Second Moment , AMS Sketch by Manupa • Inner Product, Self Join by Sapumal

  33. Frequency Query • Given a stream of data vector of length , and update (increment) operation, • we want to know at each time, what is the frequency of item • assume frequency • Trivial if we have count array • we want sublinear space • probabilistically approximately correct

  34. Count-Min Sketch • Assumption: • family of –independent hash function • sample hash functions • Use: indep. hash func. and integer array CM[]

  35. Count-Min Sketch • Algorithm to Update: • Inc: for each row CM[] CM

  36. Count-Min Sketch • Algorithm to estimate Frequency Query: • Count: = min CM[] CM

  37. Collision • Entry is an estimate of the frequency of item at row • for example, • Let : frequency of , and random variable : frequency of all , row

  38. Count-Min Sketch Analysis row • Estimate frequency of at row :

  39. Count-Min Sketch Analysis • Let : approximation error, and set • The expectation of other item contribution: .

  40. Count-Min Sketch Analysis • Markov Inequality: • Probability an estimate far from true value:

  41. Count-Min Sketch Analysis • Let : failure probability, and set • Probability final estimate far from true value:

  42. Count-Min Sketch • Result • dynamic data structure CM, item frequency query • set and • with probability at least , • sublinear space, does not depend on nor • running time update and freq. query

  43. Approximate Heavy Hitters TaeHoon Joseph, Kim

  44. Count-Min Sketch (CMS) • takes time • update values • takes time • return the minimum of values

  45. Heavy Hitters Problem • Input: • An array of length with distinct items • Objective: • Find all items that occur more than times in the array • there can be at most such items • Parameter

  46. Heavy Hitters Problem: Naïve Solution • Trivial solution is to use array • Store all items and each item’s frequency • Find all items that has frequencies

  47. -Heavy Hitters Problem (-) • Relax Heavy Hitters Problem • Requires sub-linear space • cannot solve exact problem • parameters : and

  48. -Heavy Hitters Problem (-) • Returns every item occurs more than times • Returns some items that occur more than times • Count min sketch

  49. Naïve Solution using CMS

  50. Naïve Solution using CMS • Query the frequency of all items • Return items with • slow

More Related