800 likes | 1.03k Views
Approximate Medians and other Quantiles in One Pass and with Limited Memory . Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos, 2003. Lecture Structure. Problem Definition A Deterministic Algorithm Proof Complexity analysis Comparison to other algorithms
E N D
Approximate Medians and other Quantiles in One Pass and with Limited Memory Researchers: G. Singh, S.Rajagopalan & B. Lindsey Lecturer: Eitan Ben Amos, 2003
Lecture Structure • Problem Definition • A Deterministic Algorithm • Proof • Complexity analysis • Comparison to other algorithms • A randomized solution. • Pros & cons of randomized solution.
Problem Definition • When given a large data set (N), design an algorithm for computing approximate quantiles () in a single pass. • Approximation guarantee is an input (). • Algorithm should be applicable to any distribution of values & arrival. • Compute multiple values with no extra cost. • Low memory requirements.
Quantiles • Given a stream of N values, the -quantile, for [0,1], is the value located in position *N in the sorted input stream. • When =0.5 the it is the median. • is approximate -quantile if its rank in the sorted input stream is between (-)*N and (+)*N. • There can be several values in this range.
Database Applications • Used for query optimizations. • Used by parallel DB systems in order to split the inserted data among the servers into approximately equal parts. • Distributed parallel sorting uses quantiles to split the ranges between the machines.
Algorithm Framework • An algorithm is parameterized by 2 integers: b,k. • It will use b buffers, each stores k elements. • Memory usage is b*k + C • Every buffer (x) is associated with a positive integer weight, w(x). • The weight denotes the number of input elements represented by an element in x.
Algorithm Framework (cont’d) • Buffers are labeled either “Empty” or “Full”. • Initially all buffers are “Empty”. • The values of b,k are calculated so that they enforce the approximation guarantee () and minimize memory requirement: b*k • It must be able to process N elements.
Framework Basic Operations (1) NEW • Takes an empty buffer as input. Populates the buffer with the next k elements from the input stream. • Assigns the “Full” buffer a weight of 1. • If there are less than k elements, an equal number of + & - are added to fill the buffer. • The input stream with the additional ± elements is called “augmented stream”.
Quantile in augmented stream • Length of augmented stream is *N, >=1 • ’ = (2*+-1)/(2*) • The -quantile in the original stream is the ’ quantile in the augmented stream. • Proof: (-1)*N elements were added, ½ of which appear before in the sorted stream. • ’N=*N+(-1)*N/2= (N/2)*(2+-1)
Basic Operations (Cont’d) (2) COLLAPSE • Takes c 2 “Full” input buffers X1,….Xc & outputs a buffer Y (all of size k). • All input buffers are marked “Empty”, output buffer Y is marked “Full”. • Weight of Y is the sum of weights of all input buffers: W(Y) = w(Xi)
Collapsing Buffers • Make w(Xi) copies of each element in Xi • Sort elements from all buffers together. • Elements of Y are k equally-spaced elements from the sorted elements. • w(Y) is odd elements are j*w(Y)+(w(Y)+1)/2 , j=0,….,k-1 • w(Y) is even elements are j*w(Y)+w(Y)/2 or j*w(Y)+(w(Y)+2)/2
Collapsing Buffers (Cont’d) • For 2 successive COLLAPSE with even w(Y) alternate between the choices. • Define offset(Y)=(w(Y)+z)/2 , z{0,1,2} • Y has the elements : j*w(y)+offset(Y) • Collapsing buffers does not require the creation of multiple copies of elements. A single scan of the elements in a manner similar to merge-sort will do.
Lemma 1 • C = Number of COLLAPSE operations made by the algorithm. • W = Sum of weights of output buffers produced by these COLLAPSE operations. • Lemma: sum of offsets of all COLLAPSE operations is at least (W+C-1)/2
Proof C=Codd+Ceven (Number of COLLAPSE operations with w(Y) being odd & even respectively). Ceven= Ceven1+ Ceven2 (Number of COLLAPSE operations with offset(Y) being w(Y)/2 & (w(Y)+2)/2 respectively). Sum of all offsets is (W+Codd+2Ceven2)/2
Proof (Cont’d) • Since COLLAPSE alternates between the 2 offset choices for even w(Y): • If Ceven1=Ceven2 Ceven=2Ceven2 • If Ceven1=Ceven2+1 Ceven=Ceven2+1+Ceven2. • In any case : Ceven2 (Ceven-1)/2 • Sum-of-offsets (W+C-1)/2
Basic Operations (Cont’d) (3) OUTPUT • OUTPUT is performed only once, just before termination. • Takes c 2 “Full” input buffers X1,….Xc of size k. • Outputs a single element corresponding to the ’ quantile of the augmented stream.
OUTPUT (Cont’d) • Makes w(Xi) copies of each element in buffer Xi, sorts all input buffers together. • Outputs the element in position ’kW where W=w(X1)+….+w(Xc)
COLLAPSE policies • Different COLLAPSE policies mean different criteria for when to use the NEW/COLLAPE operations. • Munro & Pateson • Alsabti, Ranka & Singh • New Algorithm.
Munro & Pateson • If there are empty buffers, invoke NEW. Otherwise, invoke COLLAPSE on 2 buffers having the same weight. • Following is an example of operations for b=6.
Alsabti, Ranka & Singh • Fill b/2 “Empty” buffers by invoking NEW & then invoke COLLAPSE on them. • Repeat this b/2 times. • Invoke OUTPUT on b/2 resulting buffers. • Following is an example of operations for b=10.
New Algorithm • Associate with every buffer X an integer l(X) denoting its level. • Let l = minimum among all levels of currently “Full” buffers. • If there’s exactly one “Empty” buffer, invoke NEW & assign it level l. • If there are at least 2 “Empty” buffers, invoke NEW on each & assign them level 0.
New Algorithm (Cont’d) • If there are no “Empty” buffers invoke COLLAPSE on the set of buffers with level l. Assign the output buffers level (l+1). • Following is an example of operations for b=5, h=4.
Tree representation • Sequence of operations can be seen as a tree. • Vertices (except root) represent the set of all logical buffers (initial, intermediate, final). • Leaves correspond to initial buffers which are populated from the input stream by the NEW operation.
Tree representation (Cont’d) • An edge is drawn from every input buffer to its output buffer (created by COLLAPSE). • The root corresponds to the final OUTPUT operation. • The children of the root are the final buffers that are produced (by COLLAPSE operations). Broken edges are drawn toward the children of the root.
Definitions • User Specified: • N Size of input stream • Quantile to be computed. • Approximation Guarantee • Others: • b Number of buffers • k size of each buffer • ’ Quantile in the augmented stream
Definitions (Cont’d) • More Others • C Number of COLLAPSE operations • W sum of weights of all COLLAPSE • wmax weight of heaviest COLLAPSE • L Number of leaves in tree • h height of tree
Approximation Guarantees • We will prove the following: The difference in rank between the true -quantile of the original dataset & the output of the algorithm is at most wmax+(W-C-1)/2
Lemma 2 • Lemma: The sum of weights of the top buffers (the children of the root) is L, the number of leaves
Proof • Every buffers that is filled by NEW has a weight of 1. • COLLAPSE of buffers creates a buffer with a weight that is the sum of weights of input buffers. • Looking at the tree of operations, every node weighs exactly like the weight of all its children. • Recursively applying this from the top buffers towards the root we can see that the weight of a top buffer is identical to the number of leaves in the sub-tree root at it.
Definitely Small/Large • Let Q be the output of the algorithm. • An element in the input stream is DS(DL) if it is smaller(larger) than Q. • In order to identify all the DS(DL) elements we will start from the top buffers (children of root) and move towards the leaves. • Mark elements of top buffers as DS(DL) if they are smaller(larger) than Q.
Definitely Small/Large (Cont’d) • When going from a parent to its children, mark as DS(DL) all elements in the child buffers that are smaller(larger) than the DS(DL) elements in their parent. • We will pursue a way of showing how many DS(DL) elements exists.
Weighted DS/DL bound • Weight of element is the weight of the buffer it is in. • Weighted DS(DL) adds w(X) for every element in buffer X that is DS(DL) • Let DStop(DLtop)denote the weighted sum of DS(DL) elements among the top buffers.
Lemma 3 ’kL - wmax DStop ’kL - 1 Right side: OUTPUT gives the element at position ’kL of the weighted buffers & so there’s obviously less than that number of elements which are smaller.
Lemma 3 (Cont’d) Left side: Surrounding Q there are w(Xi)-1 elements that are copies of Q. if we had asked a quantile that is just a bit different we would have just got a different copy of Q as the output, although it would have been a different element in the input stream. Error can be as large as w(Xi) which is bound by wmax. Reducing the number of copies from the position of Q, all others are DS for sure.
Lemma 3 (Cont’d) kL - ’kL - wmax + 1 DLtop kL - ’kL Right side: there are a total of kL elements in the augmented stream. Q is in position ’kL. So there are kL - ’kL elements after the position of Q, of which some might be copies of Q.
Lemma 3 (Cont’d) • Left Side: there are kL - ’kL elements after the position of Q. of these there are at most (wmax -1) copies of Q after (wmax including Q) which all elements are DL.
Weighted DS • Consider node Y of the tree corresponding to a COLLAPSE operation. • Let Y have s 0 DS elements. • Consider the largest element among these DS elements. It appears in position (s-1)*w(Y)+offset(Y) in the sorted sequence of elements of its children with each element being duplicated as the weight of the buffer it originates from.
Weighted DS (Cont’d) • Therefore, the weighted sum of DS elements among children of Y is (s-1)*w(Y) + offset(Y) which is equivalent to s*w(Y)-(w(Y)-offset(Y)).
Weighted DL • Similarly, let Y have l 0 DL elements. • Consider the smallest element among these DL elements. It appears in position (l-1)*w(Y) + [w(Y)-offset(Y)] in the sorted sequence of elements of its children with each element being duplicated as the weight of the buffer it originates from (when counting from end of stream towards its beginning).
Weighted DL (Cont’d) • the weighted sum of DL elements among children of Y is (l-1)*w(Y) + [w(Y)-offset(Y)] which is equivalent to l*w(Y)-offset(Y) which can also be written as l*w(Y)-(w(Y)-offset(Y)).
DS/DL Conclusion • The weighted sum of DS(DL) among the children of a node Y is smaller by at most w(Y)-offset(Y) than the weighted sum of DS(DL) elements in Y itself. • So we can count DS(DL) from the top buffers towards the leaves, reducing w(Y)-offset(Y) for each COLLAPSE on the way.
How many leaves ? • Let DSleaves (DLleaves) denote the number of definitely-small(large) elements among the leaf buffers of the operations tree. • Weight of a leaf is 1 DSleaves (DLleaves) are, in fact, the number of definitely-small(large) elements in the augmented stream.
Lemma 4 • DSleaves DStop - (W-C+1)/2 • DLleaves DLtop - (W-C+1)/2
Lemma 4 – Proof • Starting at the top buffers, the initial weighted sum of DS(DL) elements is DStop (DLtop) • Each COLLAPSE that creates node Y diminishes the weighted sum by at most w(Y)-offset(Y). • Traveling down to the leaves we do this for all COLLAPSE operations.
Lemma 4 – Proof (Cont’d) • W(Y) on all COLLAPSE operations is W. • offset(Y) on all COLLAPSE operations is at least (W+C-1)/2 by lemma 1. • Reducing these 2 from DStop (DLtop) yields Lemma 4.
Lemma 5 The difference in rank between the true -quantile of the original input stream & that of the output of the algorithm is at most (W-C-1)/2+wmax.