330 likes | 608 Views
Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream. Xuemin Lin, Hongjun Lu, Jian Xu, Jeffrey Xu Yu ICDE2004. Outline. motivation Problem definition Quantile Sketch Sliding window model n of N model Conclusion. Motivation.
E N D
Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream Xuemin Lin, Hongjun Lu, Jian Xu, Jeffrey Xu Yu ICDE2004
Outline • motivation • Problem definition • Quantile Sketch • Sliding window model • n of N model • Conclusion
Motivation • Data elements seen early could be outdated and quantile summaries for the most recently seen data elements are more important. • Example: • The top ranked Web pages among most recently assessed N pages should produce more accurate webpages accessed so far as users’ interests are changing.
Problem Definitions • -Quantile:A -quantile ((0,1]) of an ordered sequence of N data elements is the element with rank N . • Quantile Query: Given , find the data element with rank N among all elements in the stream. • Variation: N recent elements (sliding window model).
N = 16 sort 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12 0.5 quantile returns element ranked 8 ( 0.5*16) which is 8 0.75 quantile returns element ranked 12 (0.75*16) which is 10
Three Different Models • Data stream model • Computing ψ-quantile for all data items seen so far 0.5-quantile returns 10 at time t11 0.5-quantile returns 8 at time t15
Three Different Models (contd.) • Sliding window model • Computing ψ-quantile against the N most recent elements in a data stream seen so far Window size = 12 , 0.5-quantile returns 10 at time t11 0.5-quantile returns 6 at time t15
Three Different Models (contd.) • n-of-N model • For any n ≦ N, computing ψ-quantile among the n most recent elements in a data stream seen so far N = 12, 0.5-quantile returns 8 at time t11 for n = 8, 0.5-quantile returns 3 at time t15 for n = 4
ε- approximate • A quantile summary for a data sequence is ε- approximate if, for any given rank r, it returns a value whose rank r’ is guaranteed to be within the interval [r -εN , r + εN ] • 0.25-approximate 0.5-quantile returns one of the elements in {4,5,6,7,8,9,10}. Example : A data stream with 100 elements, 0.5 – quantile with ε= 0.1 returns a value v. The true rank of v is within [40,60]
Quantile Sketch • Data structure • { (vi , ri– ,ri+) : 1 ≦ i ≦ m} • A value vi is one of the element seen so far • ri–is the lower bound on the rank of vi • ri+is the upper bound on the rank of vi • vi ≦ vi+1 , for1 ≦ i ≦ m - 1 • ri– ≦ ri+1– , for 1 ≦ i ≦ m – 1 • ri– ≦ ri ≦ ri+, where riis the rank of vi
The Summary Data Structure • Given gi = ri–- ri-1–and Δi = ri+- ri– • ri–= ji gj • ri+= ji gj +Δi • v1 and vm always correspond to the minimum and the maximum elements seen so far.
Example?? Quantile sketch consisting of 6 tuples {(1,1,1), (2,2,9), (3,3,10), (5,4,10), (10,10,10), (12,16,16)}
ε- approximate sketch • Theorem • 1. r1+≦εN + 1, • 2.rm–≧ (1-ε)N, • 3. for 2≦ i ≦ m, • Sketch S isε- approximate, That is for each ψ (0,1] , there is a (vi , ri– ,ri+) in S such that
Query Quantile sketch consisting of 6 tuples ε= 0.25 {(1,1,1), (2,2,9), (3,3,10), (5,4,10), (10,10,10), (12,16,16)} 0.5 – quantile return the viof rank 8 , εN = 4 Find the first tuple to satisfy the rule, and return vi (5,4,10) => return 5
Dilemma • Memory is bounded • GK-algorithm - space requirement
One-Pass summary for sliding windows • Continuously divide a stream into the buckets based on the arrival ordering of data elements • The capacity of each bucket is • For each bucket, we maintain an -approximate continuously by GK-algorithm • Once a bucket is full its - approximate sketch is compressed into an - approximate sketch • The oldest bucket is expired if currently the total number of elements is N+1
the most recent N elements Current bucket …. expired bucket GK Compressed - approximate sketch in each bucket
Current bucket Current bucket Current bucket -approximate sketch -approximate sketch -approximate sketch -approximate sketch Expire Example N = 8 , ε= 1 , = 4 1 2 3 4 5 6 7 8 9 Full , compress
Compress • Compress an - approximate sketch intoε- approximate sketch • Memory space is most
Merge • There are h data stream Di,and each Dihas Ni data elements. Suppose each Si is an ε- approximate sketch of Di. • Smerge is a sketch of • |Smerge| = • Suppose each Si is an ε- approximate sketch. Then, Smerge is also an ε- approximate sketch
1, 2, 3, 4, 5, 6, 7, 8, 9 Current Expired ε=1 and N = 8 Another Problem Approximate sketch The first tuple inSmerge is , but the rank of 5 is 4. Smerge is not an - approximate sketch
Lift • To solve the pervious problem, we use a “lift” operation to lift the value of by for each tuple i • If S is an - approximate sketch, then Slift is an ε -approximatesketch
… Smerge Query Step1. merge the local sketch Current bucket Step2. lift Smerge lift Slift Step3. for a given rank r = ,find the first tuple in Slift such that , return vi
One-Pass Summary under n-of N • EH partitioning Technique • EH maintains at most +1 “i-buckets” for each i e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e4 e1 e2 e3 For N elements, the number of buckets in EH is always e6 e1 e2 e3 e4 e5 1-bucket = 4 , merge 1-bucket 2-bucket = 4 , merge 2-bucket e1 e2 e3 e4 e5 e6 e7 e8 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10
Sketch Construction • Use the EH technique to partition a data stream • Maintain a sketch Sbfor each bucket b • Choose λ= • Maintain an approximate sketch for each Sb
2-bucket 1-bucket Example • Construct a sketch Sbfor each bucket b to summarize the data element from the earliest element in b up to now 4-bucket 2-bucket 1-bucket 4-bucket 2-bucket 1-bucket f e d c b a f e d c b a g Sf Sf Se Se Sd Sd Sc Sc λ= 1/2 Sb Sb Sa Sa Sg
n-of-N Query 4-bucket 2-bucket 1-bucket Step1. f e d c b a Sf Se Sd Sc Sb Sa n
n-of-N Query n Step2. Se Lift by Slift Step3. for a given rank r, find the first tuple in Slift such that , return vi
Conclusions • The work presented is among the attempts to develop space efficient, one pass, deterministic quantile summary algorithms with performance guarantees under the sliding windowmodel of data streams