350 likes | 719 Views
Space-Efficient Online Computation of Quantile Summaries. SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery. Outline. Introduction The summary data structure Operation and algorithm Tree representation Analysis and experimental result Conclusion. Introduction.
E N D
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery
Outline • Introduction • The summary data structure • Operation and algorithm • Tree representation • Analysis and experimental result • Conclusion
Introduction • Space-efficient computation of quantile summaries of very large data sets in a single pass. • Quantile queries: Given a quantile, , return the value whose rank is N
N = 16 sorting 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12 0.5 quantile returns element ranked 8 ( 0.5*16) which is 8 0.75 quantile returns element ranked 12 (0.75*16) which is 10
Requirements • Explicit & tunable a priori guarantees on the precision of the approximation • As small a memory footprint as possible • Online:Single pass over the data • Data Independent Performance: guarantees should be unaffected by arrival order, distribution of values, or cardinality of observations. • Data Independent Setup: no a priori knowledge required about data set (size, range, distribution, order).
ε- approximate • A quantile summary for a data sequence is ε- approximate if, for any given rank r, it returns a value whose rank r’ is guaranteed to be within the interval [r -εN , r + εN ] Example : A data stream with 100 elements, 0.5 – quantile with ε= 0.1 returns a value v. The true rank of v is within [40,60]
The Summary Data Structure • Let rmin(v) and rmax(v) denote the lower and upper bounds on the rank of v • Each tuple ti = (vi , gi ,Δi)
Example .01, N=1750 {28,7} {10,1} {15,2} 192 204 201 [501,503] [539,540] [529,536]
Query • Sketch S isε- approximate, That is for each ψ (0,1] , there is a (vi , rmin(vi), rmax(vi)) in S such that • vi is our answer for ψ-quantile
Corollary • If at any time n, the summary S(n) satisfies the property that then we can answer any ψ-quantile query to within an εn precision.
Overview of Summary Data Structure = .29 r = N = 522 .01, N=1800 {28,7} {15,2} {10,1} • Quantile = .29? Compute r and choose best vi 192 201 204 [529,536] [539,540] [501,503]
Overview of Summary Data Structure .01, N=1800 {28,7} {15,2} {10,1} • If (rmax(vi+1) - rmin(vi)) ≦ 2N, then -approximate summary. • Our goal: always maintain this property. • Tuple formulation of this rule: gi + I ≦ 2N 2N=36 192 204 201 [529,536] [539,540] [501,503]
197 Overview of Summary Data Structure .01, N=1800 {28,7} {15,2} {10,1} • Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary 2N=36 192 204 201 [539,540] [529,536] [501,503]
Overview of Summary Data Structure .01, N=1800 {28,7} {15,2} {10,1} • Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary 2N=36 197 192 204 201 [502,536] [501,503] [529,536] [539,540]
Overview of Summary Data Structure .01, N=1801 {28,7} {15,2} {1,34} {10,1} • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary • Insert tuple before the ith tuple. gnew = 1; new = gi + I - 1; 2N=36.02 197 192 204 201 [502,536] [530,537] [540,541] [501,503]
Overview of Summary Data Structure .01, N=1801 {28,7} {15,2} {1,34} {10,1} • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary • Delete all “superfluous” entries. 2N=36.02 197 192 204 201 [502,536] [540,541] [530,537] [501,503]
Overview of Summary Data Structure .01, N=1801 {28,7} {15,2} {1,34} {10,1} • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary • Delete all “superfluous” entries. 2N=36.02 192 204 201 [530,537] [540,541] [501,503]
Overview of Summary Data Structure .01, N=1801 {29,7} {15,2} {10,1} • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary • Delete all “superfluous” entries. gi = gi + gi-1 2N=36.02 192 204 201 [530,537] [540,541] [501,503]
Overview of Summary Data Structure .01, N=1801 {29,7} {15,2} {10,1} 2N=36.02 • Insert: gnew = 1; new = gi + I - 1; • Delete: gi = gi + gi-1 192 204 201 [530,537] [540,541] [501,503]
Terminology • Full tuple: A tuple is full if gi + I = 2N • Full tuple pair: A pair of tuples is full if deleting the left-hand tuple would overfill the right one • Capacity: number of observations that can be counted by gi before the tuple becomes full. (=2N - I) General strategy will be to delete tuples with small capacity and preserve tuples with large capacity.
Operations • Insert(v):Find the smallest i, such that , and insert • Delete(vi):to delete from S, replace and by the new tuple • Compress():from right to left, merge all mergeable pair.
GK Algorithm To add the n+1st observation, v, to summary S(n) yes no COMPRESS() INSERT
Tree Representation -range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0 .001, N=7,000 2N=14 • Group tuples with similar capacities into bands • First (least index) node to the right with higher capacity band becomes parent. 0 0 0 3 1 2 1 1 1 0 3 0 1 2 3 1 2 0 1 1 3
Tree Representation -range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0 .001, N=7,000 2N=14 • Group tuples with similar capacities into bands • First (least index) node to the right with higher capacity band becomes parent. 3 3 3 3 0 0 0 1 2 1 1 1 0 0 1 2 1 2 0 1 1
3 3 3 3 2 2 2 0 0 0 1 1 1 1 0 0 1 1 0 1 1 Tree Representation -range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0 .001, N=7,000 • Group tuples with similar capacities into bands • First (least index) node to the right with higher capacity band becomes parent. 2N=14
R 3 3 3 3 2 2 2 1 1 1 1 1 1 1 1 0 0 0 0 0 0 Tree Representation -range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0 .001, N=7,000 2N=14 • Group tuples with similar capacities into bands • First (least index) node to the right with higher capacity band becomes parent.
Operation (compress) General strategy: delete tuples with small capacity and preserve tuples with large capacity. 1) Deletion cannot leave descendants unmerged --- it must delete entire subtrees 2) Deletion can only merge a tuple with small capacity into a tuple with similar or larger capacity. 3) Deletion cannot create an over-full tuple (i.e with g+ > floor(2N))
Analysis • Theorem At any time n, the total number of tuples stored in S(n) is at most
Experimental Result • Measurement: • |S| • Observed (vs. desired ) : max, avg, and for 16 representative quantiles • Optimal max observed • Compared 3 algorithms • MRL • Preallocated (1/3 number of stored observations as MRL) • Adaptive: allocate a new quantile only when observed error is about to exceed desired
Conclusion • Better worst-case behavior than previous algorithms • It does not require a priori knowledge of the parameter N