Streaming Algorithms

Streaming Algorithms CS6234 Advanced AlgorithmsFebruary 10 2015

The stream model Data sequentially enters at a rapid rate from one or more inputs We cannot store the entire stream Processing in real-time Limited memory (usually sub linear in the size of the stream) Goal: Compute a function of stream, e.g., median, number of distinct elements, longest increasing sequence Approximate answer is usually preferable

Overview Counting bits with DGIM algorithm Bloom Filter Count-Min Sketch Approximate Heavy Hitters AMS Sketch AMS Sketch Applications

Counting bits with DGIM algorithm Presented by Dmitrii Kharkovskii

Sliding windows A useful model : queries are about a window of length N The N most recent elements received (or last N time units) Interesting case: N is still so large that it cannot be stored Or, there are so many streams that windows for all cannot be stored

Problem description Problem Given a stream of 0’s and 1’s Answer queries of the form “how many 1’s in the last k bits?” where k ≤ N Obvious solution Store the most recent N bits (i.e., window size = N) When a new bit arrives, discard the N +1st bit Real Problem Slow ‐ need to scan k‐bits to count What if we cannot afford to store N bits? Estimate with an approximate answer

Datar-Gionis-Indyk-Motwani Algorithm (DGIM) Overview Approximate answer Uses N) of memory Performance guarantee: error no more than 50% Possible to decrease error to any fraction 𝜀 > 0 with N) memory Possible to generalize for the case of positive integer stream

Main idea of the algorithm Represent the window as a set of exponentially growing non-overlapping buckets

Timestamps Each bit in the stream has a timestamp - the position in the stream from the beginning. Record timestamps modulo N (window size) - use o(log N) bits Store the most recent timestamp to identify the position of any other bit in the window

Buckets Each buckethas two components: Timestampof the most recent end. Needs N) bits Sizeof the bucket - the number of ones in it. Size is always . To store j we need N) bits Each bucket needs N) bits

Representing the stream by buckets The right end of a bucket is always a position with a 1. Every position with a 1 is in some bucket. Buckets do not overlap. There are one or two buckets of any given size, up to some maximum size. All sizes must be a power of 2. Buckets cannot decrease in size as we move to the left (back in time).

Updating buckets when a new bit arrives Drop the last bucket if it has no overlap with the window If the current bit is zero, no changes are needed If the current bit is one Create a new bucket with it. Size = 1, timestamp = current time modulo N. If there are 3 buckets of size 1, merge two oldest into one of size 2. If there are 3 buckets of size 2, merge two oldest into one of size 4. ...

Example of updating process

Query Answering How many ones are in the most recent k bits? Find all buckets overlappingwith last k bits Sum the sizes of all but the oldest one Add the half of the size of the oldest one Ans = 1 + 1 + 2 + 4 + 4 + 8 + 8/2 = 24 k

Memory requirements

Performance guarantee Suppose the last bucket has size . By taking half of it, maximum error is At least one bucket of every size less than The true sum is at least 1+ 2 + 4 + … + = - 1 The first bit of the last bucket is always equal to 1. Error is at most 50%

References J. Leskovic,A. Rajamaran,J. Ulmann. “Mining of Massive Datasets”. Cambridge University Press

Bloom FilterPresented by-NaheedAnjum Arafat

Motivation:The “Set Membership” Problem • x: An Element • S: A Set of elements (Finite) • Input: x, S • Output: • True (if x in S) • False (if x not in S) • Streaming Algorithm: • Limited Space/item • Limited Processing time/item • Approximate answer based on a summary/sketch of the data stream in the memory. Solution: Binary Search on an array of size |S|. Runtime Complexity: O(log|S|)

Bloom Filter • Consists of • vector of n Boolean values, initially all set false (Complexity:- O(n) ) • k independent and uniform hash functions, , … , each outputs a value within the range {0, 1, … , n-1} n = 10

Bloom Filter • For each element sϵS, the Boolean value at positions , , … , are set true. • Complexity of Insertion:- O(k) = 1 = 6 = 4 T T T k = 3

Bloom Filter • For each element sϵS, the Boolean value at positions , , … , are set true. Note: A particular Boolean value may be set to True several times. = 4 = 9 = 7 T T k = 3

Algorithm to Approximate Set Membership Query Runtime Complexity:- O(k) Input: x ( may/may not be an element) Output: Boolean For all iϵ {0,1,…,k-1} if hi(x) is False return False return True k = 3 = S1 = S3

Algorithm to Approximate Set Membership Query False Positive!! = 6 = 4 = 1 = 4 = 9 = 7 = 6 = 1 = 9 k = 3

Error Types • False Negative – Answering “is not there” on an element which “is there” • Never happens for Bloom Filter • False Positive – Answering “is there” for an element which “is not there” • Might happens. How likely?

Probability of false positives S2 S1 n = size of table m = number of items k = number of hash functions Consider a particular bit 0 <= j <= n-1 Probability that does not set bit j after hashing only 1 item: Probability that does not set bit j after hashing m items:

Probability of false positives S1 S2 n = size of table m = number of items k = number of hash functions Probability that none of the hash functions set bit j after hashing m items: We know that, =

Probability of false positives S1 S2 n = size of table m = number of items k = number of hash functions Approximate Probability of False Positive Probability that bit j is not set The prob. of having all k bits of a new element already set For a fixed m, n which value of k will minimize this bound? kopt = Bit per item The probability of False Positive

Bloom Filters: cons • Small false positive probability • Cannot handle deletions • Size of the Bit vector has to be set a priori in order to maintain a predetermined FP-rates :- Resolved in “Scalable Bloom Filter” – Almeida, Paulo; Baquero, Carlos; Preguica, Nuno; Hutchison, David (2007), "Scalable Bloom Filters" (PDF), Information Processing Letters 101 (6): 255–261

References • https://en.wikipedia.org/wiki/Bloom_filter • Graham Cormode, Sketch Techniques for Approximate Query Processing, ATT Research • Michael Mitzenmacher, Compressed Bloom Filters, Harvard University, Cambridge

Count-Min Sketch Erick Purwanto A0050717L

Motivation Count-Min Sketch • Implemented in real system • AT&T: network switch to analyze network traffic using limited memory • Google: implemented on top of MapReduce parallel processing infrastructure • Simple and used to solve other problems • Heavy Hitters by Joseph • Second Moment , AMS Sketch by Manupa • Inner Product, Self Join by Sapumal

Frequency Query • Given a stream of data vector of length , and update (increment) operation, • we want to know at each time, what is the frequency of item • assume frequency • Trivial if we have count array • we want sublinear space • probabilistically approximately correct

Count-Min Sketch • Assumption: • family of –independent hash function • sample hash functions • Use: indep. hash func. and integer array CM[]

Count-Min Sketch • Algorithm to Update: • Inc: for each row CM[] CM

Count-Min Sketch • Algorithm to estimate Frequency Query: • Count: = min CM[] CM

Collision • Entry is an estimate of the frequency of item at row • for example, • Let : frequency of , and random variable : frequency of all , row

Count-Min Sketch Analysis row • Estimate frequency of at row :

Count-Min Sketch Analysis • Let : approximation error, and set • The expectation of other item contribution: .

Count-Min Sketch Analysis • Markov Inequality: • Probability an estimate far from true value:

Count-Min Sketch Analysis • Let : failure probability, and set • Probability final estimate far from true value:

Count-Min Sketch • Result • dynamic data structure CM, item frequency query • set and • with probability at least , • sublinear space, does not depend on nor • running time update and freq. query

Approximate Heavy Hitters TaeHoon Joseph, Kim

Count-Min Sketch (CMS) • takes time • update values • takes time • return the minimum of values

Heavy Hitters Problem • Input: • An array of length with distinct items • Objective: • Find all items that occur more than times in the array • there can be at most such items • Parameter

Heavy Hitters Problem: Naïve Solution • Trivial solution is to use array • Store all items and each item’s frequency • Find all items that has frequencies

-Heavy Hitters Problem (-) • Relax Heavy Hitters Problem • Requires sub-linear space • cannot solve exact problem • parameters : and

-Heavy Hitters Problem (-) • Returns every item occurs more than times • Returns some items that occur more than times • Count min sketch

Naïve Solution using CMS

Naïve Solution using CMS • Query the frequency of all items • Return items with • slow

Streaming Algorithms