230 likes | 256 Views
Maintaining Time-Decaying Stream Aggregates. Edith Cohen Martin Strauss AT&T Labs-research. The Problem. A data stream is a sequence of data items observed over time. Presence of multiple massive data streams.
E N D
Maintaining Time-Decaying Stream Aggregates Edith Cohen Martin Strauss AT&T Labs-research PODS 2003
The Problem • A data stream is a sequence of data items observed over time. • Presence of multiple massive data streams. • Storage constraints allow only to maintain a compact summary of the “essence” of information in each stream. • Relevance of information decays with time. • Thus, when aggregating across time, older information should be discounted. PODS 2003
Applications • IP routing - RED protocol: time-decayed average of previous queue lengths is used to estimate impending congestion at router • Internet gateway selection: tracks the quality (eg packet loss rate) of alternative paths to select a more reliable one. • Usage statistics of phone customers: AT&T has about 100M customers. • More ….. PODS 2003
Decay Functions • A decay function is non-increasing g(x)>=0 defined for x>=1. • f(t) >=0 is the value of the data item observed at time t. • The weight at time T of an item obtained at time t is g(T-t) • The decayed value of the item is f(t)g(T-t) PODS 2003
Time-Decaying Sum • When f(t) are 0/1 we refer to the problem as time-decaying count. • Maintaining the decaying sum exactly can generally consume linear bits. • We consider approximately maintaining it to within PODS 2003
Maintaining time-decaying average reduces to maintaining two time-decaying sums Time-Decaying Average • Time-decaying weighted average of observed values. • is the value of item observed at time PODS 2003
Exponential decay [Jacobson 88] • Sliding Windows [DGIM02] • g(x)=1 for x<W • g(x)=0 otherwise • Polynomial decay Interesting Families of Decay Functions • General Decay functions… PODS 2003
Lemma: • Exact tracking requires storage bits • Approximate tracking uses bits Exponential Decay • Used in networking applications (RED) • Very simple maintenance: PODS 2003
Sliding Window Decay Lemma: [DGIM02] Sliding window decay can be approximately tracked using bits (for 0/1 or poly size values). • “Sharp Threshold” • Upper bound using the Exponential Histogram (EH) technique. PODS 2003
Polynomial Decay Lemma: Lower bound: Upper bound: (N is elapsed time) • Often more appropriate to applications than Exponential or Sliding Window decay • More efficient than SliWin decay (nearly quadratic gap), almost as efficient as Exponential decay. PODS 2003
Algorithm based on an adaptation of the Exponential Histograms technique. • Sliding windows, (with ), [DGIM02] are as “hard” to maintain as general decay General Decay Functions • Lemma: Can be (approximately) maintained using bits (N is minimum of elapsed time and min x for which g(x)=0 ) PODS 2003
Time t0 good Which link should we select past time t0? bad Initially A or B, eventually B. Why Polynomial Decay? • Link performance over time Link A Link B PODS 2003
Poly decay can model our expectation (also other smooth subexponential functions…) Link Selection Example) cont) • Polynomial decay (by tuning parameter): Initially A or B, eventually B. • Exponential decay: Constant relative value of A and B: Either A forever or B forever • Sliding Window decay: First B then A then same… PODS 2003
Approximate to within Summary of Bounds • N is minimum of elapsed time and min x for • which g(x)=0 PODS 2003
Time Time width: 4 Count: 2 Time width: 3 Count: 2 Time width: 3 Count: 1 Time width: 7 Count: 4 Merge Bucketing the Stream 1 0 0 1 1 0 1 0 0 1 • Histogram determined by time boundaries and bucket counts • Time boundaries can be fixed (counts maintained per stream) • Counts can be fixed (time boundaries maintained per stream) PODS 2003
Bucket counts are independent of stream • Sum of bucket counts is a constant-factor approximation for Exponential Histograms [DGIM02] • Introduced for Sliding Windows • Each new item is placed in a new bucket. • Two buckets are merged when their combined count is at most a fraction of the combined count of all earlier buckets. • Buckets with start time greater than W are discarded. PODS 2003
Exponential Histograms (cont) • Example for factor 2 approximation: (bucket counts) • 1 • 1, 1 • 1, 1, 1 • 1, 1, 2 (merge) • 1, 1, 1, 2 • 1, 1, 2, 2 (merge) • Values with time “in question” (before or after W) are aggregated in least recent bucket. PODS 2003
EHs properties • Number of buckets is O(log W), for each bucket we need to record exact start time, thus we need O(log W) storage per bucket. (total is O(log^2 W)) • An EH for Sliding Window W can be used to approximate Sliding Window j for all j<W Lemma: EH can be used to approximate general decay functions. (With W= minimum of elapsed time and min x for which g(x)=0.) PODS 2003
With an EH with W=N we can compute (approximately) decayed sums according to all decay functions g() up to elapsed time N (or forever if g(N)=0). From (approximate) for all W<=N we can compute (approximate) decayed sum according to g(). Reducing any Decay Function to Sliding Windows. • Decay function g(x) PODS 2003
O(log N log log N) storage for polynomial decay Weight-Based Merging • Bucket start times depend only on elapsed time. • WBM Histograms applies to decay functions where g(x)/g(x+1) is non-increasing. • Number of buckets is O(log(g(1)/g(N))). • O(log log N) storage per bucket (for approximate bucket counts). • More efficient than EH on decay that is slightly super-polynomial or slower. PODS 2003
At most 2 buckets per region WBM Histograms – How? • Region boundariesb1,b2,b3,… : • Current most-recent bucket is sealed and new bucket is started at T s.t. T mod b1=0 • Two consecutive buckets that are in the same region (according to elapsed start and end times) are merged. PODS 2003
T=1 T=2 T=3 T=4 T=5 T=6 WBMH Example • g(x)=1/x, (1+e)=2 • Regions: 1,1/2, 1/3,1/4,1/5,1/6, 1/7,1/8,…,1/14 PODS 2003
Conclusion • Summary: • Efficient computation of time-decayed sum/averages for general decay functions. • Very efficient computation for polynomial decay • Open question: • O(log n) storage for polynomial decay • Subsequent related work: • Spatial decay (sensor nets/p2p nets) PODS 2003