420 likes | 643 Views
How to Summarize the Universe: Dynamic Maintenance of Quantiles. By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss. Quantiles. Median, quartiles, … The general case: Uses Statistics Estimating result set size Partitioning …. Computing static quantiles.
E N D
How to Summarize the Universe:Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss
Quantiles • Median, quartiles, … • The general case: • Uses • Statistics • Estimating result set size • Partitioning • …
Computing static quantiles • Blum, Floyd, Pratt, Rivest & Tarjan • Find the i’th element • Comparison based • Similar to QuickSort • O(n) – worst case time
Problems with massive data sets • O(n) time – not good enough… • O(n) space – usually not affordable • Dynamic environment • Cancellations are especially troublesome • Usually recomputed periodically • May be very inaccurate until recomputed Some kind of approximation is the only choice !…
Common approaches • Deterministically chosen sample • Randomization – probability of failure • Maintaining a backing sample • Wavelets • Most of the above approaches work well for the incremental case, but deletions may cause inaccuracy.
GK – Greenwald-Khanna (‘01) • Fill the available memory with values • Maintain rank ranges on values is memory. • When a new value is inserted, kick a value out of memory. • Insert-only algorithm • Can be extended to support deletes (“GK2”). • Maintain two instances – one for insertions and one for deletions.
Maintenance of Equi-Depth Histograms (using a backing sample) • Gibbons, Matias, Poosala –’97 • Scan the dataset and choose values for the sample using the “reservoir” method. • Treat insertions as a “continuous” scan. • When a deletion from the sample is necessary – rescan only if number of items drops below a specified minimum. • Works well for a mostly-insertions enviornment.
The authors’ main result • The RSS algorithm • RSS– Random Subset Sum • Space – polylogarithmic in universe size • Proportional time • A priori guarantee of accuracy within a user specified error ε, with a user specified probability of failure δ.
Some formalism… • The universe: U = {0, …, |U |-1} • Number of tuples in data set: ||A||=N • Data set can be thought of as an array:A[i] – number of tuples with value i • Our goal for computing Ф-quantiles – find a jk such that:
Some assumptions • The universe’s size is known • Later we’ll throw that assumption away • Update = Delete + Insert
Computing quantiles • Let’s say A[i] is known for every i. • Easy to maintain through updates • Summing up array items ? • Not a very good complexity…
Computing quantiles (cont.) • We need a method of reducing summation overhead. • We should be able to compute any sum of items in A in logarithmic time. • The solution: Keeping computed sums of intervals.
Dyadic intervals - defined • Atomic dyadic interval – a single point. • Ij,k = [k*2log(|U|)-j,(k+1)*2log(|U|)-j-1] • j – resolution level • Example: I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7) 0 1 2 3 4 5 6 7 I(2,0) I(2,1) I(2,2) I(2,3) I(1,0) I(1,1) I(0,0)
Computing an arbitrary interval • Let’s say we have sums for all dyadic intervals as in the above example. • We want to compute A[0,6]. • A[0,6] = I(1,0) + I(2,2) + I(3,6) I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7) 0 1 2 3 4 5 6 7 I(2,0) I(2,1) I(2,2) I(2,3) I(1,0) I(1,1) I(0,0)
Dyadic intervals - observations • Log(|U|) + 1 resolution levels • 2|U| - 1 dyadic intervals altogether • O(|U|) space needed to keep them all • O(log(|U|)) time needed to compute any arbitrary interval.
Computing quantiles (Cont.) • We can now efficiently compute any arbitrary interval in A. • A ф-quantile for any k can be computed thus: • We need a jk s.t.:A[0,jk) < kФN < a[0,jk+1) • Use binary search to find it !
But… • Keeping O(|U|) of data presents a real space complexity problem. • We need a way of estimating A[i] on demand. • … And also of estimating any dyadic interval on demand.
Introducing random sets • Let S be a random set of values from U. • Each value has a probability of ½ of being in S. • Expectation of the number of items in S is ½|U|.
Random subset sums • Define ||AS|| as the number of items in A with values in S. • Expectation of ||AS|| is ½||A||=½N. • Now consider only subsets S containing a certain value i.
Random subset sums (cont.) • Suppose we keep a number of random sets S, each containing random values from U– each with probability ½. • We maintain ||AS|| for each such set. • Easy to maintain during updates. • How can we now estimate A[i] ?
Random subset sums (cont.) • We can estimate A[i] for any i with:A[i] = 2||AS|| - ||A|| • Proof: • The authors prove that repeating the process O(1/ε2) times yields the required accuracy.
Random subset sums (cont.) • We can also estimate any dyadic interval Ij,k using the same method. • Improvement: We can compute the sums for dyadic intervals from a certain level. • We can now estimate any arbitrary interval in the universe…
Space Considerations • Keeping a set of expected size ½|U| is still O(|U|). • We need a method of “keeping” a set without actually keeping it… • The technique: instead of sets, keep random seeds of size o(log|U|) bits and compute whether a given iєS on demand.
Extended Hamming Code • Used for generating the random sets. • Provides sufficient “randomness” • For example: • |U| = 8 • Seed size: log|U|+1 = 4 • G(seed, i) = seed X i’th column
RSS Algorithm Summary • To compute a dyadic interval. • Compute 2||AS|| - ||A|| for sets containing the given dyadic interval. • To compute an arbitrary interval. • Write it as a disjoint union of dyadic intervals, estimate them and take a median over possible results (simplified). • To compute the quantiles. • Use binary search and compute the intervals until found.
Algorithm Complexity Claim • The RSS algorithm’s space complexity (for t quantile queries): • Time complexity for inserts, deletes and computing each quantile on demand is proportional to the space used.
Proof Outline • Declare random variable • Xk=2||AIk|| if Ik is in S and 0 otherwise • X – Sum of all Xk’s in a certain set • Y – Sum of all X’s in a given interval • Z – A number of repetitions of X.
Proof Outline (Cont.) • In a similar fashion to previous slides, show that Y and ||A|| can be used to compute ||AI||. • Compute the variance. • Use Chebyshev’s and then Chernoff’s inequalities, together with the computed variance, to achieve the required result.
What If U Is Unknown ? • In practice, the universe U is not always known. • Predict a range [0, u-1] for U. • Given an inserted (or updated) value i s.t. (i > u-1), add another instance of RSS with range [u, u2-1], and so on… • Estimating dyadic intervals can be done in a single instance of RSS. • Increased cost factor: log2log(|U|).
Some RSS Properties • RSS may return as a quantile a value which is not really in the dataset. • Order of insertions and deletions does not affect result and accuracy. • Can be parallelized quite easily (as long as random subsets are pre-agreed).
Experimental Results • Experiments • Static artificial dataset • Dynamic artificial dataset • Dynamic real dataset • Participants • Naïve[l] • RSS[l] • GK • GK2 – an improvement for GK
Static Artificial Dataset • |U| = 220 • Compute 15 quantiles at position (1/16)k for k = 1,2,…,15. • 3 different distributions • Uniform • Zipf • Normal[m,v] • Algorithm used: RSS[7] (11K footprint).
Dynamic Artificial Dataset • Insert N=104,858 items from uniform dist. D1=Uni[1,U], U=220. • Insert αN more items from uniform dist. D2=Uni[U/2-U/32, U/2+U/32]. • Delete all values from the first insertion. • Parameter α controls the mass of the second insertion with respect to the first.
Dynamic Real Dataset • Based on true Call Detail Records (CDRs) from AT&T. • Dataset used includes 4.42 million CDRs covering a period of 18 hours. • Objective: find the median length of current calls. • Probe for estimates every 10,000 records. • Algorithm used: RSS[6] (4K footprint).
Conclusions – RSS • Algorithm for maintaining dynamic quantiles. • Works well (within a user-defined precision) both for insertions AND deletions. • Polylogarithmic (in universe size) in space and time complexities.