160 likes | 255 Views
Maintaining Variance over Data Stream Windows. Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles of Database Systems (PODS 2003), June 2003 Presented by C.-L. Lin. Data streams.
E N D
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan, Stanford University ACM Symp. on Principles of Database Systems (PODS 2003), June 2003 Presented by C.-L. Lin
Data streams • Traditional DBMS( Data Base Management System) – data stored in finite andpersistentdata sets, one time query, relatively low update rate. • Data Stream Management System (DSMS) - data input as continuous, possiblyinfinitedata streams, continuous queries .. etc. • An example of continuous query :In a telecom company, we are interested infinding all outgoing calls longer than 2 minutes • New Applications • Sensor networks • Network monitoring and traffic engineering • Telecom call records • Network security • Financial applications • Manufacturing processes • Web logs and clickstreams • Massive data sets
32 32 13 16 32 5 13 7 Data Stream Management System (DSMS) User/Application Query Query Results Stream Query Processor Summary data (sum, count, Variance…) (Limited Memory and/or Disk)
Sliding Window Model Time Increases Timestamps 7 6 5 4 3 2 1 ….2 14 14 15 11 6 7 4 3 47 14 15 7 5 10 11 4 1 21 1 4 7 … Window Size = N Expired data Future data Current Time • 1. When N is large ( many hours, days and months) , we cannot buffer • the entire sliding window in memory. • O(NlogR) bits of memory is required, where R is the upper bound • on the absolute value of the data. • So we cannot compute the sum, count, variance exactly at every instant. • Approximately compute variance over sliding window, and use as small memory as possible. •
Review (1) Mean (2) Variance (3) Relative estimation error
The Concept of Buckets(1/2) Time Increases Timestamps 9 8 7 6 5 4 3 2 1 Elements ….2 14 14 15 11 6 7 4 3 47 14 15 7 5 10 11 4 1 ... 2 14 14 15 11 6 7 4 3 47 14 15 B4 B3 B2 B1 Bucket Timestamps 9 6 3 1 Suffix buckets
The Concept of Buckets (2/2) Bm Bm-1 B3 B2 B1 Time Bm* Window size N • For each bucket Bi, maintain Proof later
Estimated Variance Bm Bm-1 B3 B2 B1 Time Bm* Window size N Error!!!
Lemma 1 Proof. Define δi=μi -μi,jδj=μj -μi,j
When a new xt element arrives.. • create a new bucket for xt. The new bucket becomes B1 with V1=0, μ1= xt, n1 =1. An old bucket Bi becomes Bi+1. • if tm > N, delete the bucket. Bucket Bm-1becomes the new oldest bucket. update Bm-1*
Bucket Merge • Invariant 1For every bucket Bi, • Ensures that the relative error is ≤ε • Invariant 2For each i<1, for every bucket Bi, • This invariant insures that the total number of buckets is small O((1/ε2)log NR2)
Number of Buckets • Lemma 2: The number of buckets maintained at any point in time by an algorithm that preserves Invariant 2 isO(1/ε2logNR2) where R is an upper bound on the absolute value of the data elements. <proof.> • From the merge rule : the variance of the union of two buckets is no less then the sum of the individual variances. • By invariant 2, the variance of the suffix bucket Bi* doubles after every O(1/ε2) buckets. • Total number of buckets: no more then O(1/ε2 logV) where V is the variance of the last N points. V is no more than NR2. O(1/ε2 log NR2) V3,4 V3* V5*
Space Complexity (1) By lemma 2,the number of buckets maintained at any point in time by an algorithm isO(1/ε2logNR2) (2) Each bucket requires constant space : ==> Overall memory is O(1/ε2logNR2) But………… • Timestamps : O(logN) • Bucket size : • Mean: • Variance: O(logV) = O(logNR2)
Estimation Error • Estimated Variance: • Actual Variance: • Error: (3) (1) (2)