220 likes | 409 Views
Joe Kelley Data Engineer. July 2013. Streaming Algorithms. Leading Provider of. Data Science & Engineering for Big Analytics . Accelerating Your Time to Value. IMAGINE. ILLUMINATE. IMPLEMENT. Strategy and Roadmap. Training and Education. Hands-On Data Science and Data Engineering.
E N D
Joe Kelley Data Engineer July 2013 Streaming Algorithms
Leading Provider of Data Science & Engineering for Big Analytics Accelerating Your Time to Value IMAGINE ILLUMINATE IMPLEMENT Strategy and Roadmap Training and Education Hands-On Data Science and Data Engineering
What is a Streaming Algorithm? • Operates on a continuous stream of data • Unknown or infinite size • Only one pass; options: • Store it • Lose it • Store an approximation • Limited processing time per item • Limited total memory
Why use a Streaming Algorithm? • Compare to typical “Big Data” approach: store everything, analyze later, scale linearly • Streaming Pros: • Lower latency • Lower storage cost • Streaming Cons: • Less flexibility • Lower precision (sometimes) • Answer? • Why not both?
General Techniques • Tunable Approximation • Sampling • Sliding window • Fixed number • Fixed percentage • Hashing: useful randomness
Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries storing 1% is good enough
Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries storing 1% is good enough • Algorithm: for each element e: with probability 0.01: store e else: throw out e Can lead to some insidious statistical “bugs”…
Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries storing 1% is good enough • Query: • How many errors has the average device encountered? • Answer: • SELECT AVG(n) FROM ( • SELECT COUNT(*) AS n FROM events • WHERE event = 'ERROR' • GROUP BY device_id • ) • Simple… but off by up to 100x. Each device had only 1% of its events sampled. • Can we just multiply by 100?
Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries storing 1% is good enough • Better Algorithm: for each element e: if (hash(e.device_id) mod 100) == 0 store e else: throw out e Choose how to hash carefully... or hash every different way
Example 2: Sampling fixed number Want to sample a fixed count (k), not a fixed percentage. Algorithm: Let arr = array of size k for each element e: if arr is not yet full: add e to arr else: with probability p: replace a random element of arr with e else: throw out e • Choice of p is crucial: • p = constant prefer more recent elements. Higher p = more recent • p = k/n sample uniformly from entire stream
Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Naïve approach: • Store all user_id’s in a list/tree/hashtable • Millions of users = lot of memory • Better approach: • Store all user_id’s in a database • Good, but maybe it’s not fast enough… • What if an approximate count is ok?
Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Approximate count is ok • Flajolet-Martin Idea: • Hash each user_id into a bit string • Count the trailing zeros • Remember maximum number of trailing zeros seen
Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Intuition: • If we had seen 2 distinct users, we would expect 1 trailing zero • If we had seen 4, we would expect 2 trailing zeros • If we had seen , we would expect • In general, if there has been a maximum of trailing zeros, is a reasonable estimation of distinct users • Want more precision? User more independent hash functions, and combine the results • Median = only get powers of two • Mean = subject to skew • Median of means of groups works well in practice
Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Flajolet-Martin, all together: arr = int[k] for each item e: for i in 0...k-1: z = trailing_zeros(hashi(e)) if z > arr[i]: arr[i] = z means = group_means(arr) median = median(means) return pow(2, median)
Example 3: Counting unique users • Flajolet-Martin in practice • Devil is in the details • Tunable precision • more hash functions = more precise • See the paper for bounds on precision • Tunable latency • more hash functions = higher latency • faster hash functions = lower latency • faster hash functions = more possibility of correlation = less precision • Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer
Example 4: Counting Individual Item Frequencies • Want to keep track of how many times each item has appeared in the stream • Many applications: • How popular is each search term? • How many times has this hashtag been tweeted? • Which IP addresses are DDoS’ing me? • Again, two obvious approaches: • In-memory hashmap of itemcount • Database • But can we be more clever?
Example 4: Counting Individual Item Frequencies • Want to keep track of how many times each item has appeared in the stream • Idea: • Maintain array of counts • Hash each item, increment array at that index • To check the count of an item, hash again and check array at that index • Over-estimates because of hash “collisions”
Example 4: Counting Individual Item Frequencies • Count-Min Sketch algorithm: • Maintain 2-d array of size w x d • Choose d different hash functions; each row in array corresponds to one hash function • Hash each item with every hash function, increment the appropriate position in each row • To query an item, hash it d times again, take the minimum value from all rows
Example 4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Count-Min Sketch, all together: arr = int[d][w] for each item e: for i in 0...d-1: j = hashi(e) mod w arr[i][j]++ def frequency(q): min = +infinity for i in 0...d-1: j = hashi(e) mod w if arr[i][j] < min: min = arr[i][j] return min
Example 4: Counting Individual Item Frequencies • Count-Min Sketch in practice • Devil is in the details • Tunable precision • Bigger array = more precise • See the paper for bounds on precision • Tunable latency • more hash functions = higher latency • Better at estimating more frequent items • Can subtract out estimation of collisions • Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer
Questions? • Feel free to reach out • www.thinkbiganalytics.com • joe.kelley@thinkbiganalytics.com • www.slideshare.net/jfkelley1 • References: • http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf • http://infolab.stanford.edu/~ullman/mmds.html • We’re hiring! Engineers and Data Scientists