1 / 29

How to Summarize the Universe: Dynamic Maintenance of Quantiles

How to Summarize the Universe: Dynamic Maintenance of Quantiles. Gilbert, Kotidis, Muthukrishnan, Strauss. Presented by Itay Malinger December 2003. Problem Definition. The Universe: U = {0, …, | U | -1} Number of records in data set: ||A||= N

vienna
Download Presentation

How to Summarize the Universe: Dynamic Maintenance of Quantiles

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Summarize the Universe:Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

  2. Problem Definition • The Universe: U = {0, …, |U |-1} • Number of records in data set: ||A||=N • Data set can be thought of as an array:A[i] – number of records with value i • AS – number of records with values in S • The Ф-quantile of an ordered sequence of N data items are the value with rank • Our goal is computing ε-approximate Ф-quantiles – find a jk such that:

  3. Transactions • Insert(i): A[i]  A[i] + 1 • Delete(i): A[i]  A[i] – 1 • Let • ASSUME: The Universe size |U| is known

  4. The Main Algorithmic Result • The RSS Algorithm • Space Complexity • Update In every transaction in O(space) time • Estimation On demand in O(space) time • One Time pass

  5. Dyadic Intervals • Log(|U|)+1 resolution levels j • 2|U|-1 Dyadic intervals I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7) 0 1 2 3 4 5 6 7 I(2,0) I(2,1) I(2,2) I(2,3) I(1,0) I(1,1) I(0,0)

  6. Arbitrary intervals • Any Interval can be displayed as a disjoint union of at most log(|U|) dyadic intervals • For example A[0,6] = I(1,0)+I(2,2)+I(3,6) • Intervals starting at 0 will not use the same resolution twice I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7) 0 1 2 3 4 5 6 7 I(2,0) I(2,1) I(2,2) I(2,3) I(1,0) I(1,1) I(0,0)

  7. Computing quantiles • Assuming we have the number of records in each dyadic interval, We can efficiently compute any arbitrary interval in A. • To compute the ф-quantile for any k, we need a jk s.t.: A[0,jk) < kФN < A[0,jk+1) • Use binary search to find it. • Keeping all intervals is costly (O(|U|))

  8. Random Subset Sums • In case j = log(|U|) • Let S be a subset of U • Each uU has p=½ of being in S • E(|S|)= ½|U| • Define: • E(|AS|)=½||A||=½N

  9. Estimating A[i]

  10. Improvement • Instead of keeping sets of point dyadic sets, Keep random sets of all resolutions • We need a method of keeping a Random set of j-resolution dyadic intervals (keeping it explicitly is o(|U|) • Instead of keeping the sets keep a small representation of them

  11. Pseudorandom set generator • We need to keep a small representation of a random set S (UiS with p= ½) • Given a seed of size log(|U|)+1 • Represent a set S of size o(|U|) • Quickly test if iS or not • Use Extended Hamming Code

  12. Extended Hamming Code • Given a seed, tells whether the iS • For example: • |U| = 8 • Seed size: log|U|+1 = 4 • G(seed, i) = seed X i’th column mod 2 • Efficient to compute • 3-wise disjoint

  13. The Data Structure • For each resolution level j keep num_copies random subsets S of all dyadic intervals in that level (we only keep the representation seed) • Keep • Maintain N = ||A|| • We got S1,…,Snum_copies per level

  14. Upon Transactions • Insert(i) / Delete(i) • For Each resolution level j • Locate the single Ij,k into which i falls (high order binary bits) • Determine all Sℓ containing Ij,k • For Each Sℓ increase/Decrease ||ASℓ|| by 1

  15. Estimating Quantiles: Dyadic Intervals • Given a dyadic interval I=Ij,k • There are num_copies sets of resolution j GE • Quickly test each Sℓ and check if ISℓ and if so estimate • Group all estimations into G groups of E elements • For each group g calculate the average of all estimations Ag,j,k

  16. Estimating Quantiles:Arbitrary intervals • Given an interval I, Write it as a disjoint union of at most log(|U|) dyadic intervals Ij,k • Form G groups and calculate each group’s sum of all dyadic interval’s Ag,j,k for all Ij,k comprising I. • Take the median of all G groups as the final estimate of AI • Its more convenient to refer to the result as an overestimate |AI|≤|AI|~≤|AI|+εN

  17. SUM SUM SUM SUM 3 dyadic intervals E = 4 Elements per group AVERAGE MEDIAN The Interval’s Estimate G = 3 Groups

  18. Analysis • Lemma: The algorithm estimates each quantile to within εN with p>1-δ • Proof: • For a fixed resolution level j, Let • Then:

  19. Analysis (cont.)

  20. Analysis (cont.) • We take G copies of Z and take the median. • By the Chernoff inequality, • The binary search looked for a jk such that • We made log|U| checks in the binary search • The probability any of them failed is log|U| times what we achieved, i.e δ

  21. RSS Properties • The algorithm may return a quantile value which was not seen in the input • Changing the order of insertions and deletions doesn’t affect results • The RSSs are composable: U can be split to many disjoint ranges and some pre-agreed common random subsets

  22. Extension: U is unknown • Predict a range [0, u-1] for U. • Upon insertion of i > u-1, add another instance of RSS with range [u, u2-1], and so on… • Because RSS is composable, we only have to join the result upon query • Increased cost factor: log2log(|U|).

  23. Experiments • What is the median length of all active AT&T calls ? • When call • Starts: Add timestamp • Ends: Delete start timestamp • 4 KB used for RSS • Compared • RSS • GK • GK2

  24. Number of Active Phone Calls Over Time

  25. Error in Computation of Median Over Time

  26. Average Error for Last 50 Snapshots, For Deciles

  27. The End

More Related