490 likes | 598 Views
Online Computation and Continuous Maintaining of Quantile Summaries. Tian Xia Database Lab @ CCIS Northeastern University April 16, 2004. References. M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. In SIGMOD , pages 58-66, 2001.
E N D
Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database Lab @ CCIS Northeastern University April 16, 2004
References • M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. In SIGMOD, pages 58-66, 2001. • X. Lin, H. Lu, J. Xu, and J. X. Yu. Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream. In ICDE, pages 362-373, 2004
Outline of this talk • Quantile Estimation Overview • GK-quantile Summary Algorithm • Data Structure • Operations • Space Complexity Analysis • Sliding Window Model
Problem Definitions • -Quantile:A -quantile ((0,1]) of an ordered sequence of N data elements is the element with rank N . • Quantile Query: Given , find the data element with rank N among all elements in the stream. • Variation: N recent elements (sliding window model). • (-approximate):Find the element with rank r within the interval [r-N, r+N].
t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 t7 9 t8 6 t9 7 t10 8 t11 11 t12 4 t13 5 t14 2 t15 3 Example of A Quantile Query • The sorted order of the sequence is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12. • 0.5-quantile returns the element ranked 8, which is 8. • 0.25-approximate 0.5-quantile returns one of the elements in {4,5,6,7,8,9,10}.
Why Approximation? • Munro and Paterson (Theoretical Computer Science, 1980) showed that any algorithm which exactly computes -quantile of N data elements in p passes, requires a space of . • Approximate quantile techniques are necessary to achieve sub-linear space efficiency.
Quantile Summary • Quantile Summary:A small number of objects from the input data sequence, which could be used (by quantile estimator) to answer quantile queries. • Other summary methods of large data sets include average, standard deviation, histogram, counting sketch (FM-sketch), etc.
Properties of A Good Quantile Estimator • Provide tunable and explicit a priori guarantees on the precision of the approximation, e.g. it is -approximate. • Data independent. • Use as small a memory footprint as possible, which includes temporary storage.
Previous Work • Manku, Rajagopalan, and Lindsay (SIGMOD, 1998) proposed a single-pass algorithm that constructs an -approximate quantile summary. • Space complexity: log2N. • It requires an advance knowledge of N, the size of data set. Won’t work in data stream environment.
Outline of this talk • Quantile Estimation Overview • GK-quantile Summary Algorithm • Data Structure • Operations • Space Complexity Analysis • Sliding Window Model
Contributions of GK-algorithm • Dynamically adjust quantile summary with the growth of N, the total number of data elements in the data stream. • Space complexity is reduced to logN.
Assumptions • A new data element arrives after each unit of time. • n denotes both the number of elements of the data sequence, as well as the current time. • A data element is represented by its value v. • rmin(v) and rmax(v) denote respectively the lower and upper bounds on the actual rank r of v among the elements seen so far.
The Summary Data Structure • GK-algorithm maintains a summary data structure S=S(n) at any point in time n. • S(n) consists of an ordered (non-decreasing) sequence of tuples which corresponds to a subset of the elements seen thus far.
The Summary Data Structure • S = {t0, t1, …, ts-1}, where ti = (vi, gi, Δi). • vi is the value of one of the elements seen so far. • gi = rmin(vi) - rmin(vi-1) • Δi = rmax(vi) - rmin(vi) • v0 and vs-1 always correspond to the minimum and the maximum elements seen so far.
The Summary Data Structure • Given gi = rmin(vi) - rmin(vi-1) and Δi = rmax(vi) - rmin(vi), • rmin(vi)= ji gj • rmax(vi) = ji gj +Δi • gi +Δi -1 isupper bound on the total number of elements that may have fallen between vi-1and vi. • rmin(vs-1) = i gj = n.
t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 t7 9 t8 6 t9 7 t10 8 t11 11 t12 4 t13 5 t14 2 t15 3 Example of A Quantile Summary • {(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is an quantile summary consisting of 6 tuples. • For clarity, re-write the tuples of the above summary in the form ti = (vi, rmin(vi), rmax(vi)) as follows:{(1,1,1), (2,2,9), (3,3,10), (4,4,10), (10,10,10), (12,16,16)}.
Error Rate? • PROPOSITION 1:Given a quantile summary S, a-quantile can always be identified to within an error of maxi(gi+Δi)/2. • COROLLARY 1:If at any time n, the summary S(n) satisfies the property that maxigi+i 2n, than we can answer any -quantile query to within an nprecision.
QUANTILE () • QUANTILE(): To compute an -approximate -quantile from the summary S(n) after n data elements, compute the rank r=n. Find i such that both r rmin(vi) n and rmax(vi) r n, return vi. • i.e. rn rmin(vi) rmax(vi) r n
t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 t7 9 t8 6 t9 7 t10 8 t11 11 t12 4 t13 5 t14 2 t15 3 Example of A Quantile Summary • {(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is 0.25-approximate with respect to the data stream. • An 0.25-approximate 0.5-quantile returns the element (4,1,6) or (10,6,0).
Outline of this talk • Quantile Estimation Overview • GK-quantile Summary Algorithm • Data Structure • Operations • Space Complexity Analysis • Sliding Window Model
How does their algorithm work? • Insert a tuple in the summary corresponding to a new incoming element. • Periodically sweep over the summary to “merge” some of the tuples into their neighbors. • It ensures the space requirement. • At all times maxi (gi +Δi) 2n. • What to merge & How to merge?
INSERT (v) • INSERT(v): Find the smallest i, such that vi-1 vvi, and insert the tuple (v, 1, 2n), between ti-1 and ti. Increment s. As a special case, if v is the new minimum or the maximum element seen, then insert (v, 1, 0).
Example of INSERT • S={(12, 1, 0)}, n=1 • S={(6, 1, 0), (12, 1, 0)}, n=2 • S={(6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=3 • S={(1, 1, 0), (6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=4 t0 12 t8 6 t3 10 t4 1
Merge • Space will increase with insertions. • Intuitively, two tuples (vi, gi,Δi) and (vj, gj,Δj) can be merged into a new tuple (vk, gk,Δk), as long as gk +Δk 2n. • An individual tuple is full if gk +Δk 2n. • Capacity and Band are introduced.
Capacity and Band • The capacity of a tuple is the maximum numer of elements that can be counted by gi before the tuple become full. (gi 2n i). • The merge phase will free up space by merging tuples with small capacities into tuples with similar or larger capacities. • Bands: Roughly speaking, divide theΔs into bands that lie between elements of (0, ½2n, ¾2n, …, 2i-12i 2n, …, 2n-1, 2n). • The larger the capacity (with smallerΔ), the larger the band.
t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 t7 9 t8 6 t9 7 t10 8 t11 11 t12 4 t13 5 t14 2 t15 3 Example of A Quantile Summary • {(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is an quantile summary consisting of 6 tuples. • (2,1,7) and (3,1,7) are in thelowest band. (1,1,0), (10,6,0) and (12,6,0) are in the highest bands.
Band • Strictly, Given from 1 to log2n, p=2n, band is the set of allΔsuch that p2 (p mod 2)Δp2-1 (p mod 2-1). • If twoΔs are ever in the same band, they never appear in different bands as n increase. • In band0,Δ= 2n . • A tree structure is imposed to facilitate merges between bands.
Tree Representation • Given a summary S = {t0, t1, …, ts-1}, the tree T associated with S contains a node Vi for each ti and a special root node R. • The parent of a node Vi is the node Vj such that j is the least index greater than i with band(ti) > band(tj). Otherwise R is the parent.
R (1,1,0) (2,1,7) (3,1,7) (4,1,6) (10,6,0) (12,6,0) Tree Representation • PROPOSITION 3: The children of any node in T are always arranged in non-increasing order of band in S. • PROPOSITION 4: For any node V, the set of all its descendants arranged in T forms a contiguous segment in S.
Merge Actually • GK-algorithm will merge together a node and all its descendants into either its parent node or into its right sibling. • The tuple that results after the merge must not be full, i.e. gi +i 2n. • The operation is called COMPRESS().
COMPRESS ( ) • The operation COMPRESS tries to merge together a node and all its descendants into either parent node or into its right sibling. • COMPRESS() • for i from s-2 to 0 do • if ((BAND(i, 2n) BAND(i+1, 2n)) && g*gi+1i+1 2n))then • DELETE all descendants of ti and the tuple ti itself; • end if • end for • end COMPRESS g* denotes the sum of g-values of the tuple ti and all its descendants in T.
DELETE (vi) • DELETE(vi): To delete the tuple (vi, gi,Δi) from S, replace (vi, gi,Δi) and (vi+1, gi+1,Δi+1) by the new tuple (vi+1, gi+ gi+1,Δi+1), and decrement s.
Example of COMPRESS and DELETE t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 • S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1), (12, 1, 0)}, s=6, n=6 • Compress tuples (11, 1, 1) and (12, 1, 0) into a new tuple (12, 2, 0). • S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)}, s=5, n=6
Pseudo-Code for the whole algorithm Initial State S; s 0; n 0; Algorithm To add the n+1st element, v, to summary S(n): if (n 0 mod 12)then COMPRESS(); end if INSERT (v); n=n+1;
A Complete Example ( ) t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 • S={(10, 1, 0), (12, 1, 0)}, n=2 • S={(10, 1, 0), (10, 1, 1), (11, 1, 1), (12, 1, 0)}, n=4 • S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1), (12, 1, 0)}, n=6, s=6 • Perform compress when t6 comes. • S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)}, n=6, s=5
A Complete Example ( ) t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 t7 9 t8 6 • S={(1, 1, 0), (9, 1, 3), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 3), (12, 2, 0)}, n=8, s=7 • Perform compress when t8 comes. • S={(1, 1, 0), (10, 2, 0), (10, 1, 1), (10, 1, 2), (12, 3, 0)}, n=8, s=5
A Complete Example ( ) t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 t7 9 t8 6 t9 7 t10 8 t11 11 t12 4 t13 5 t14 2 t15 3 • S={(1, 1, 0), (4, 1, 6), (5, 1, 6), (10, 5, 0), (12, 6, 0)}, n=14, s=5 • Perform compress • S={(1, 1, 0), (4, 1, 6), (10, 6, 0), (12, 6, 0)}, n=14, s=4 • Finally • S={(1, 1, 0), (2, 1, 7), (3, 1, 7), (4, 1, 6), (10, 6, 0), (12, 6, 0)}, n=16, s=6
Outline of this talk • Quantile Estimation Overview • GK-quantile Summary Algorithm • Data Structure • Operations • Space Complexity Analysis • Sliding Window Model
Band Property • Observe that the number of band and elements in a band determine the space complexity. • PROPOSITION 2:At any point in time n and for any 1, band(n) contains either 2 or 2-1 distinct values ofΔ. • Since no more than 1 2 elements with any givenΔ are inserted, band is a summary of at most 2 2 elements in the stream.
LEMMAs • LEMMA 3:At any time n and for any given , there are at most 32 nodes in T(n) that have a child with band value of . • Only a small number of nodes can have a child with band . See Proposition 3.
LEMMAs • A full pair of tuples (ti-1, ti): band(ti-1) band(ti). The tuple ti-1 is left partner and ti is a right partner in this full pair. • LEMMA 4:At any time n and for any given , there are at most 4 tuples from band(n) that are right partners in a full tuple pair.
R (1,1,0) (2,1,7) (3,1,7) (4,1,6) (10,6,0) (12,6,0) Full Pair Example • {(2,1,7), (3,1,7)} and is a full pair • {(1,1,0), (2,1,7)} is not a full pair. • (2,1,7) can only be a left partner!
Space Efficiency • Any band(n) node either is a right partner of a full pair, or can only be a left partner. • By Proposition 3, a band(n) node that can only be a left partner only occurs once for every parent of nodes from band(n). • By Lemma 3 and 4, the number of nodes in any band is bounded by 3 2 4 11 2.
Space Efficiency • The number of band is 1. • THEOREM:At any time n, the total number of tuples stored in S(n) is at most (11 2)log(2n). • GK-algorithm’s space complexity is logN.
Outline of this talk • Quantile Estimation Overview • GK-quantile Summary Algorithm • Data Structure • Operations • Space Complexity Analysis • Sliding Window Model
Sliding Window Model • Under sliding window model, a summary is maintained for the most recently seen N data elements. • Eliminate exact out-dated elements requires a space of O(N). • Lin, etc. (ICDE 2004) proposed a space-efficient one-pass summary algorithm for sliding window model. Their underlying summary algorithm is GK-algorithm.
n-of-N Model • A summary is maintained for N most recently seen data elements. However, quantile queries can be issued against any n N. That is, for any (0,1], and any n N, we can return -quantiles among the n most recent elements in a data stream seen so far. • Lin, etc. (ICDE 2004) proposed their one-pass summary algorithm combining EH partitioning technique (Datar, etc. ACM-SIAM 2002) with GK-algorithm, solving n-of-N model.
t0 12 t1 10 t2 11 t3 10 t4 1 t5 10 t6 11 t7 9 t8 6 t9 7 t10 8 t11 11 t12 4 t13 5 t14 2 t15 3 Example of n-of-N model • Assume the sliding window is 16 in an n-of-N model. A quantile query can be answered for any 1 n 16. • 0.5-quantile returns 6 for n=12 and 3 for n=4. • FYI: The sorted order of the sequence is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12.