290 likes | 409 Views
How to Summarize the Universe: Dynamic Maintenance of Quantiles. Gilbert, Kotidis, Muthukrishnan, Strauss. Presented by Itay Malinger December 2003. Problem Definition. The Universe: U = {0, …, | U | -1} Number of records in data set: ||A||= N
E N D
How to Summarize the Universe:Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003
Problem Definition • The Universe: U = {0, …, |U |-1} • Number of records in data set: ||A||=N • Data set can be thought of as an array:A[i] – number of records with value i • AS – number of records with values in S • The Ф-quantile of an ordered sequence of N data items are the value with rank • Our goal is computing ε-approximate Ф-quantiles – find a jk such that:
Transactions • Insert(i): A[i] A[i] + 1 • Delete(i): A[i] A[i] – 1 • Let • ASSUME: The Universe size |U| is known
The Main Algorithmic Result • The RSS Algorithm • Space Complexity • Update In every transaction in O(space) time • Estimation On demand in O(space) time • One Time pass
Dyadic Intervals • Log(|U|)+1 resolution levels j • 2|U|-1 Dyadic intervals I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7) 0 1 2 3 4 5 6 7 I(2,0) I(2,1) I(2,2) I(2,3) I(1,0) I(1,1) I(0,0)
Arbitrary intervals • Any Interval can be displayed as a disjoint union of at most log(|U|) dyadic intervals • For example A[0,6] = I(1,0)+I(2,2)+I(3,6) • Intervals starting at 0 will not use the same resolution twice I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7) 0 1 2 3 4 5 6 7 I(2,0) I(2,1) I(2,2) I(2,3) I(1,0) I(1,1) I(0,0)
Computing quantiles • Assuming we have the number of records in each dyadic interval, We can efficiently compute any arbitrary interval in A. • To compute the ф-quantile for any k, we need a jk s.t.: A[0,jk) < kФN < A[0,jk+1) • Use binary search to find it. • Keeping all intervals is costly (O(|U|))
Random Subset Sums • In case j = log(|U|) • Let S be a subset of U • Each uU has p=½ of being in S • E(|S|)= ½|U| • Define: • E(|AS|)=½||A||=½N
Improvement • Instead of keeping sets of point dyadic sets, Keep random sets of all resolutions • We need a method of keeping a Random set of j-resolution dyadic intervals (keeping it explicitly is o(|U|) • Instead of keeping the sets keep a small representation of them
Pseudorandom set generator • We need to keep a small representation of a random set S (UiS with p= ½) • Given a seed of size log(|U|)+1 • Represent a set S of size o(|U|) • Quickly test if iS or not • Use Extended Hamming Code
Extended Hamming Code • Given a seed, tells whether the iS • For example: • |U| = 8 • Seed size: log|U|+1 = 4 • G(seed, i) = seed X i’th column mod 2 • Efficient to compute • 3-wise disjoint
The Data Structure • For each resolution level j keep num_copies random subsets S of all dyadic intervals in that level (we only keep the representation seed) • Keep • Maintain N = ||A|| • We got S1,…,Snum_copies per level
Upon Transactions • Insert(i) / Delete(i) • For Each resolution level j • Locate the single Ij,k into which i falls (high order binary bits) • Determine all Sℓ containing Ij,k • For Each Sℓ increase/Decrease ||ASℓ|| by 1
Estimating Quantiles: Dyadic Intervals • Given a dyadic interval I=Ij,k • There are num_copies sets of resolution j GE • Quickly test each Sℓ and check if ISℓ and if so estimate • Group all estimations into G groups of E elements • For each group g calculate the average of all estimations Ag,j,k
Estimating Quantiles:Arbitrary intervals • Given an interval I, Write it as a disjoint union of at most log(|U|) dyadic intervals Ij,k • Form G groups and calculate each group’s sum of all dyadic interval’s Ag,j,k for all Ij,k comprising I. • Take the median of all G groups as the final estimate of AI • Its more convenient to refer to the result as an overestimate |AI|≤|AI|~≤|AI|+εN
SUM SUM SUM SUM 3 dyadic intervals E = 4 Elements per group AVERAGE MEDIAN The Interval’s Estimate G = 3 Groups
Analysis • Lemma: The algorithm estimates each quantile to within εN with p>1-δ • Proof: • For a fixed resolution level j, Let • Then:
Analysis (cont.) • We take G copies of Z and take the median. • By the Chernoff inequality, • The binary search looked for a jk such that • We made log|U| checks in the binary search • The probability any of them failed is log|U| times what we achieved, i.e δ
RSS Properties • The algorithm may return a quantile value which was not seen in the input • Changing the order of insertions and deletions doesn’t affect results • The RSSs are composable: U can be split to many disjoint ranges and some pre-agreed common random subsets
Extension: U is unknown • Predict a range [0, u-1] for U. • Upon insertion of i > u-1, add another instance of RSS with range [u, u2-1], and so on… • Because RSS is composable, we only have to join the result upon query • Increased cost factor: log2log(|U|).
Experiments • What is the median length of all active AT&T calls ? • When call • Starts: Add timestamp • Ends: Delete start timestamp • 4 KB used for RSS • Compared • RSS • GK • GK2