Space-Efficient Online Computation of Quantile Summaries

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery

Outline • Introduction • The summary data structure • Operation and algorithm • Tree representation • Analysis and experimental result • Conclusion

Introduction • Space-efficient computation of quantile summaries of very large data sets in a single pass. • Quantile queries: Given a quantile, , return the value whose rank is N

N = 16 sorting 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12 0.5 quantile returns element ranked 8 ( 0.5*16) which is 8 0.75 quantile returns element ranked 12 (0.75*16) which is 10

Requirements • Explicit & tunable a priori guarantees on the precision of the approximation • As small a memory footprint as possible • Online:Single pass over the data • Data Independent Performance: guarantees should be unaffected by arrival order, distribution of values, or cardinality of observations. • Data Independent Setup: no a priori knowledge required about data set (size, range, distribution, order).

ε- approximate • A quantile summary for a data sequence is ε- approximate if, for any given rank r, it returns a value whose rank r’ is guaranteed to be within the interval [r -εN , r + εN ] Example : A data stream with 100 elements, 0.5 – quantile with ε= 0.1 returns a value v. The true rank of v is within [40,60]

The Summary Data Structure • Let rmin(v) and rmax(v) denote the lower and upper bounds on the rank of v • Each tuple ti = (vi , gi ,Δi)

Example .01, N=1750 {28,7} {10,1} {15,2} 192 204 201 [501,503] [539,540] [529,536]

Query • Sketch S isε- approximate, That is for each ψ (0,1] , there is a (vi , rmin(vi), rmax(vi)) in S such that • vi is our answer for ψ-quantile

Corollary • If at any time n, the summary S(n) satisfies the property that then we can answer any ψ-quantile query to within an εn precision.

Overview of Summary Data Structure  = .29 r = N = 522 .01, N=1800 {28,7} {15,2} {10,1} • Quantile  = .29? Compute r and choose best vi 192 201 204 [529,536] [539,540] [501,503]

Overview of Summary Data Structure .01, N=1800 {28,7} {15,2} {10,1} • If (rmax(vi+1) - rmin(vi)) ≦ 2N, then -approximate summary. • Our goal: always maintain this property. • Tuple formulation of this rule: gi + I ≦ 2N 2N=36 192 204 201 [529,536] [539,540] [501,503]

197 Overview of Summary Data Structure .01, N=1800 {28,7} {15,2} {10,1} • Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary 2N=36 192 204 201 [539,540] [529,536] [501,503]

Overview of Summary Data Structure .01, N=1800 {28,7} {15,2} {10,1} • Goal: always maintain -approximate summary(rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary 2N=36 197 192 204 201 [502,536] [501,503] [529,536] [539,540]

Overview of Summary Data Structure .01, N=1801 {28,7} {15,2} {1,34} {10,1} • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary • Insert tuple before the ith tuple. gnew = 1; new = gi + I - 1; 2N=36.02 197 192 204 201 [502,536] [530,537] [540,541] [501,503]

Overview of Summary Data Structure .01, N=1801 {28,7} {15,2} {1,34} {10,1} • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary • Delete all “superfluous” entries. 2N=36.02 197 192 204 201 [502,536] [540,541] [530,537] [501,503]

Overview of Summary Data Structure .01, N=1801 {28,7} {15,2} {1,34} {10,1} • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary • Delete all “superfluous” entries. 2N=36.02 192 204 201 [530,537] [540,541] [501,503]

Overview of Summary Data Structure .01, N=1801 {29,7} {15,2} {10,1} • Goal: always maintain -approximate summary (rmax(vi+1) - rmin(vi)) = (gi + I) ≦ 2N • Insert new observations into summary • Delete all “superfluous” entries. gi = gi + gi-1 2N=36.02 192 204 201 [530,537] [540,541] [501,503]

Overview of Summary Data Structure .01, N=1801 {29,7} {15,2} {10,1} 2N=36.02 • Insert: gnew = 1; new = gi + I - 1; • Delete: gi = gi + gi-1 192 204 201 [530,537] [540,541] [501,503]

Terminology • Full tuple: A tuple is full if gi + I = 2N • Full tuple pair: A pair of tuples is full if deleting the left-hand tuple would overfill the right one • Capacity: number of observations that can be counted by gi before the tuple becomes full. (=2N - I) General strategy will be to delete tuples with small capacity and preserve tuples with large capacity.

Operations • Insert(v)：Find the smallest i, such that , and insert • Delete(vi)：to delete from S, replace and by the new tuple • Compress()：from right to left, merge all mergeable pair.

GK Algorithm To add the n+1st observation, v, to summary S(n) yes no COMPRESS() INSERT

Tree Representation -range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0 .001, N=7,000 2N=14 • Group tuples with similar capacities into bands • First (least index) node to the right with higher capacity band becomes parent. 0 0 0 3 1 2 1 1 1 0 3 0 1 2 3 1 2 0 1 1 3

Tree Representation -range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0 .001, N=7,000 2N=14 • Group tuples with similar capacities into bands • First (least index) node to the right with higher capacity band becomes parent. 3 3 3 3 0 0 0 1 2 1 1 1 0 0 1 2 1 2 0 1 1

3 3 3 3 2 2 2 0 0 0 1 1 1 1 0 0 1 1 0 1 1 Tree Representation -range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0 .001, N=7,000 • Group tuples with similar capacities into bands • First (least index) node to the right with higher capacity band becomes parent. 2N=14

R 3 3 3 3 2 2 2 1 1 1 1 1 1 1 1 0 0 0 0 0 0 Tree Representation -range Capacity Band0-7 8-15 38-11 4-7 212-13 2-3 114 1 0 .001, N=7,000 2N=14 • Group tuples with similar capacities into bands • First (least index) node to the right with higher capacity band becomes parent.

Operation (compress) General strategy: delete tuples with small capacity and preserve tuples with large capacity. 1) Deletion cannot leave descendants unmerged --- it must delete entire subtrees 2) Deletion can only merge a tuple with small capacity into a tuple with similar or larger capacity. 3) Deletion cannot create an over-full tuple (i.e with g+ > floor(2N))

Analysis • Theorem At any time n, the total number of tuples stored in S(n) is at most

Experimental Result • Measurement: • |S| • Observed  (vs. desired ) : max, avg, and for 16 representative quantiles • Optimal max observed  • Compared 3 algorithms • MRL • Preallocated (1/3 number of stored observations as MRL) • Adaptive: allocate a new quantile only when observed error is about to exceed desired 

Conclusion • Better worst-case behavior than previous algorithms • It does not require a priori knowledge of the parameter N

Any Question ?

Space-Efficient Online Computation of Quantile Summaries

Space-Efficient Online Computation of Quantile Summaries

Presentation Transcript

Space-Efficient Online Computation of Quantile Summaries

Quantile Curiosa

Efficient Computation of Trade-Off Skylines

Quantile Regression

Quantile Regression

Efficient Non-Interactive Secure Computation

Use of Quantile Functions

Efficient Computation of Reverse Skyline Queries

On the limitations of efficient computation

Efficient computation of photohadronic interactions

Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

Efficient Computation of Temporal Aggregates with Range Predicates

Summaries of

Quantile Regression

Efficient computation of diverse query results

Quantile

Efficient Skyline Computation in MapReduce

Efficient Computation of Diverse Query Results

EFFICIENT SIMULTANEOUS MULTI-SCALE COMPUTATION OF FFTS

The Limits of Efficient Computation