320 likes | 524 Views
CS 361 Lecture 5. Approximate Quantiles and Histograms. 9 Oct 2002. Gurmeet Singh Manku (manku@cs.stanford.edu). Frequency Related Problems. Top-k most frequent elements. Find elements that occupy 0.1% of the tail. Mean + Variance?. Median?.
E N D
CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku (manku@cs.stanford.edu)
Frequency Related Problems ... Top-k most frequent elements Find elements that occupy 0.1% of the tail. Mean + Variance? Median? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% What is the frequency of element 3? What is the total frequency of elements between 8 and 14? How many elements have non-zero frequency?
Types of Histograms ... Count for bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values • V-Optimal Histograms • Idea: Select buckets to minimize frequency variance within buckets Count for bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values • Equi-Depth Histograms • Idea: Select buckets such that counts per bucket are equal
Histograms: Applications • One Dimensional Data • Database Query Optimization [Selinger78] • Selectivity estimation • Parallel Sorting [DNS91] [NowSort97] • Jim Gray’s sorting benchmark • [PIH96] [Poo97] introduced a taxonomy, algorithms, etc. • Multidimensional Data • OLTP: not much use (independent attribute assumption) • OLAP & Mining: yeah
Finding The Median ... • Exact median in main memory O(n) [BFPRT 73] • Exact median in one pass n/2 [Pohl 68] • Exact median in p passes O(n^(1/p)) [MP 80] 2 passes O(sqrt(n)) How about an approximate median?
Approximate Medians & Quantiles -approximate median Typical = 0.01 (1%) Multiple equi-spaced -approximate quantiles = Equi-depth Histogram -Quantileelement with rank N 0 < < 1 ( = 0.5 means Median) -Approximate-quantileany element with rank ( ) N 0 < < 1
Plan for Today ... Greenwald-Khanna Algorithm for arbitrary length stream Sampling-based Algorithms for arbitrary length stream Munro-Paterson Algorithm for fixed N Generalization Randomized Algorithm for fixed N Randomized Algorithm for arbitrary length stream
Data distribution assumptions ... Input sequence of ranks is arbitrary. e.g., warehouse data
Munro-Paterson Algorithm [MP 80] b buffers, each of size k Memory = bk Munro-Paterson [1980] 4 3 3 2 2 2 2 1 1 1 1 1 1 1 1 b = 4 b log ( N) k 1/ log ( N) Memory = bk = Input: N and How do we collapse two sorted buffers into one? Merge Pick alternate elements Minimize bk subject to following constraints: Number of elements in leaves = k 2^b > N Max relative error in rank = b/2k <
Error Propagation ... x “?” elements 2x+1 “?” elements S S SS S SS S SS ? ? ? ?? ? ? ? ? L L LL L LL L L L L Depth d+1 Depth d+1 S S S S ? ? ? ? ? ? L L L L L S S SS SS ? ? ? L L LL LL Number of “?” elements <= 2x+1 Top-down analysis Depth d S S S S S ? ? ? ? L L L L L L
Error Propagation at Depth 0 ... Depth 0 S S S S S S S M L L L L L L L S S SS S S S S S SS SS S SM L L L L L L L L L L L L L L Depth 1 Depth 1 S S S S M L L L L L L L L L L S S SS S SS SS S S L L L L
Error Propagation at Depth 1 ... Depth 1 S S S S S S S S S S L L L L L S S SS S SS S SS S S S SS S S SS S ? L L LL L LL L L Depth 2 Depth 2 S S S S S S S S ? L L L L L L S S SS S SS SS S S S L LL
Error propagation at Depth 2 ... Depth 2 S S S S S S S S ? L L L L L L S S SS S SS S SS S S S SS S ? ? ? L L LL L LL L L L L Depth 3 Depth 3 S S S S S S S S ? ? L L L L L S S SS SS S S ? L L LL LL
Error Propagation ... x “?” elements 2x+1 “?” elements S S SS S SS S SS ? ? ? ?? ? ? ? ? L L LL L LL L L L L Depth d+1 Depth d+1 S S S S ? ? ? ? ? ? L L L L L S S SS SS ? ? ? L L LL LL Number of ? elements <= 2x+1 Depth d S S S S S ? ? ? ? L L L L L L
Error Propagation level by level b buffers, each of size k Memory = bk Munro-Paterson [1980] 0 1 1 Depth d = 2 2 2 2 2 3 3 3 3 3 3 3 3 b = 4 Fractional error in rank at depth 0 is 0. Max depth = b So, total fractional error is <= b/2k Constraint 2: b/2k < Number of elements at depth d = k 2^d Let sum of “?” elements at depth d be X Then fraction of “?” elements at depth d f = X / (k 2^d) Sum of “?” elements at depth d+1 is at most 2X+2^d Then fraction of “?” elements at depth d+1 f’ <= (2X + 2^d) / (k 2^(d+1)) = f + 1/2k Increase in fractional error in rank is 1/2k per level
Generalized Munro-Paterson [MRL 98] How do we collapse Buffers with different weights? b = 5 Each buffer has a ‘weight’ associated with it.
Generalized Collapse ... • Weight 6 • 6 10 15 27 35 • 5 56 68 8 8 10 10 1012 1213 13 13 15 16 19 19 19 25 27 28 31 3135 35 3537 37 31 31 37 37 6 6 12 12 5 5 10 10 10 35 35 35 8 8 8 19 19 19 13 13 13 28 15 16 25 27 • 28 15 16 25 27 • 31 37 6 12 5 10 35 8 19 13 • Weight 2 • Weight 3 • Weight 1 • k = 5
Analysis of Generalized Munro-Paterson Munro-Paterson Generalized Munro-Paterson - But smaller constant
Reservoir Sampling [Vitter 85] Maintain a uniform sample of size s Input Sequence of length N Sample of size s Approximate median = median of sample If s = , then with probability at least 1-, answer is an -approximate median
“Non-Reservoir” Sampling Choose 1 out of every N/s successive elements A B D B A B D F A S C D D B A B D F A T X Y D B A X T F A S X Z D B A B D T G H N/s elements At end of stream, sample size is s Approximate median = median of sample If s = , then with probability at least 1-, answer is an -approximate median
Non-uniform Sampling ... A B D B A B D F A S C D D B A B D F A T X Y D B A X T F A S X Z D B A B D T G H ... s out of 4s elements Weight = 4 s out of s elements Weight = 1 s out of 2s elements Weight = 2 s out of 8s elements Weight = 8 At end of stream, sample size is O(s log(N/s)) Approximate median = weighted median of sample If s = , then with probability at least 1-, answer is an -approximate median
Sampling + Generalized Munro-Paterson [MRL 98] Stream of unknown length, and Stream of known length N, and Advance knowledge of N Reservoir Sampling Maintain samples. “1-in-N/s” Sampling Choose s = samples. Generalized Munro-Paterson Compute -approximate median of samples Memory required = Compute exact median of samples. Output is an -approximate median with probability at least 1-. Memory required: Memory required:
Unknown-N Algorithm [MRL 99] Stream of unknown length, and Non-uniform Sampling Modified Deterministic Algorithm For Approximate Medians Output is an -approximate median with probability at least 1-. Memory required:
Non-uniform Sampling ... A B D B A B D F A S C D D B A B D F A T X Y D B A X T F A S A B D E X Z D B A B D T. … s out of 4s elements Weight = 4 s out of s elements Weight = 1 s out of 2s elements Weight = 2 s out of s elements Weight = 1 s out of 8s elements Weight = 8 At end of stream, sample size is O(s log(N/s)) Approximate median = weighted median of sample If s = , then with probability at least 1-, answer is an -approximate median
Modified Deterministic Algorithm ... b buffers, each of size k Compute approximate median of weighted samples. L h+3 h+2 h+1 h Height Sample Input s elements with W = 8 s elements with W = 4 s elements with W = 2^(L-h) 2s elements with W = 1 s elements with W = 2 L = highest level h = height of tree
Modified Munro-Paterson Algorithm Compute approximate median of weighted samples. b buffers, each of size k H b+3 b+2 b+1 b Height Weighted Samples s elements with W = 8 s elements with W = 4 s elements with W = 2^(H-b) 2s elements with W = 1 s elements with W = 2 H = highest level b = height of tree
Error Analysis ... Increase in fractional error in rank is 1/2k per level Total fractional error <= b buffers, each of size k b+h = total height b = height of small tree b+h b+3 b+2 b+1 b Weighted Samples s elements with W = 8 s elements with W = 4 s elements with W = 2^(H-b) 2s elements with W = 1 s elements with W = 2
Error Analysis contd... Minimize bk subject to following constraints: Number of elements in leaves = k 2^b > s where s = Max fractional error in rank = b/k < (1-) b O(log ( s)) k O(1/ log ( s)) Memory = bk = Almost the same as before
Summary of Algorithms ... • Reservoir Sampling [Vitter 85] • Probabilistic • Munro-Paterson [MP 80] • Deterministic • Generalized Munro-Paterson [MRL 98] • Deterministic • Sampling + Generalized MP [MRL98] • Probabilistic • Non-uniform Sampling + GMP [MRL 99] • Probabilistic • Greenwald & Khanna [GK 01] • Deterministic Require advance knowledge of n.
List of papers ... [Hoeffding63] W Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables”, Amer. Stat. Journal, p 13-30, 1963 [MP80] J I Munro and M S Paterson, “Selection and Sorting in Limited Storage”, Theoretical Computer Science, 12:315-323, 1980. [Vit85] J S Vitter, “Random Sampling with a Reservoir”,ACM Trans. on Math. Software, 11(1):37-57, 1985. [MRL98] G S Manku, S Rajagopalan and B G Lindsay, “Approximate Medians and other Quantiles in One Pass and with Limited Memory”, ACM SIGMOD 98, p 426-435, 1998. [MRL99] G S Manku, S Rajagopalan and B G Lindsay, “Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets”, ACM SIGMOD 99, pp 251-262, 1999. [GK01] M Greenwald and S Khanna, “Space-Efficient Online Computation of Quantile Summaries”, ACM SIGMOD 2001, p 58-66, 2001.