360 likes | 531 Views
Maintaining Stream Statistics Over Sliding Windows. Paper by Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani. Presentation by Adam Morrison. Sliding Window Intro. Infinite stream. Only last N elements relevant. Packet streams. N is huge. Stronger model…. 1. 2. 3. 4.
E N D
Maintaining Stream Statistics Over Sliding Windows Paper by Mayur Datar, Aristides Gionis, Piotr Indyk, Rajeev Motwani Presentation by Adam Morrison.
Sliding Window Intro • Infinite stream. • Only last N elements relevant. • Packet streams. • N is huge. • Stronger model…
1 2 3 4 Model • Count memory bits. • Online algorithm. Arrival: 5 6 7 Timestamp: 3 2 1 3 2 1 3 2 1
Plan • Basic Counting • Given a bit stream, maintain at every time instant the count of 1s in the last N elements. • Sum • Given an integer stream, maintain the sum of the last N elements. • Everything else
Basic Counting • Exact Solution? (Counter?) Exact solution requires (N) bits. 2 1 1 0 2 1
=0.05 95 100 105 Approximate Basic Counting • Solution: Approximate the answer and bound the relative error
Bucket sizes? Policy for creating new buckets? What is it good for? The idea • Dynamic histogram of active 1s. • New 1s go into right most bucket. • For each bucket keep the timestamp of the most recent 1 and the bucket’s size. • When timestamp expires, free bucket.
Timestamp: Size: 1 1 2 3 4 5 1 1 2 2 2 2 2 1 Example (N=4)
9 10 11 12 13 14 14 9 10 11 12 13 14 0 0 5 4 3 2 1 0 6 5 4 3 2 1 0 (Timestamps are easy) Cyclic counter mod N. N=15
What does the histogram buy us? • Active bucket Contains an active 1. • Only the last bucket might contain expired 1s.
Estimating number of 1s Conclusion: • T – sum of all bucket sizes but last. • So there are at least T 1s. • C – size of last bucket. • Actual # of 1s can be anything from 1 to C.
Bucket sizes: True count Absolute Relative
If at all times we’d have that for all j, Bounding the error Goal: Relative error at most =1/k.
Exponential Histogram How can we do that?(With as few buckets as possible?) • Non-decreasing bucket sizes. • Bucket sizes constrained to • At most buckets of each size. • For all sizes but that of last bucket, at least buckets of each size.
4 2 4 4 2 2 2 2 1 1 1 1 3 1 2 2 3 2 2 1 1 1 3 1 2 1 4 2 4 1 1 1 5 2 5 2 4 1 2 1 6 2 7 2 7 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 4 2 3 2 2 2 2 1 1 1 3 1 New 1 – create bucket Check if invariant violated. Too many buckets – merge
Why it works (correctness) If there are at least buckets of sizes
Why it works (space) • Can account for all 1s with just
Space usage # of buckets: Bucket size: T counter for estimation:
Bucket of size B accounts for all operations related to it: B inserts, B-1 merges (& maybe delete). Sum of all buckets in life time (including deleted) is all insertions. past Operations • Estimation: O(1) • Insertion: Cascading makes it worst case. • But only O(1) amortized!
Plan • Basic Counting • Given a bit stream, maintain at every time instant the count of 1s in the last N elements. • Sum • Given an integer stream, maintain the sum of the last N elements. • Everything else
Extending to Sum • Integers in range [0, R]. • On value V, insert V 1s. • Timestamps: • Bucket counter: • # of buckets: • Total space: Insertion takes (R)!
Picking gives amortized time. Reducing insertion time • If we had a way to rebuild the entire histogram… • We could buffer new values… • And rebuild histogram when buffer reaches size B. • If it takes , amortized is
Would it really? Is this representation unique? k/2 canonical representation The k/2 canonical representation of S : If S is the total size of the buckets, computing its k/2 canonical representation would help us rebuild the histogram.
Find the largest j for which If find Total time required is O(log S). =01 j=2 =5
2 5 If a value gets “unindexed”, it will never be indexed in the future. 8 6 4 3 2 1 10 8 6 5 4 3 9 7 5 4 3 2 Calculate S1+S2 representation: 10 6 2 1 1 1 1
Lower Bounds • More about timestamps. • Applications. • More problems Plan • Basic Counting • Given a bit stream, maintain at every time instant the count of 1s in the last N elements. • Sum • Given an integer stream, maintain the sum of the last N elements. • Everything else
Lower Bounds • More about timestamps. • Applications. • More problems Lower bounds • Basic Counting and Sum algorithms are optimal. • Similar techniques will show that lots of other problems are intractable. (Later.)
Big block d Left most such subblock Same idea works for Sum.
Lower bound applies to randomized algorithms. Randomized bound • Yao minimax principle: • Expected space complexity of optimal algorithm for an input distribution is a lower bound on expected space complexity of randomized algorithm.
Lower Bounds • More about timestamps. • Applications. • More problems Timestamps If much less than N items can arrive during the window, memory usage is reduced. • Define window based on real time – equate timestamp with clock. • No work needs to be done when items don’t arrive, so deletions can be deferred.
Lower Bounds • More about timestamps. • Applications. • More problems Applications • Adapting algorithms to the sliding window model using EH to replace counters. • Counters require bits, EH takes . • Also factor loss in accuracy.
Lower Bounds • More about timestamps. • Applications. • More problems More Problems • Min/Max • Storing subsequence of (say) mins is optimal. • Distinct values • Basic Counting reduces to it.
Other Problems • Distinct values with deletions. • Factor 2 estimation requires (N) space. • Map 1s in a bit string to distinct values. Pad with zeros to infer value of last bit, then use deletion to cancel that bit. • Repeat.
Other Problems • Sum with negative integers. • Factor 2 estimation requires (N) space. • Maps 1s in bit string to (-1,1) and 0s to (1,-1). • Pad with 0s and query at odd time instants.