150 likes | 300 Views
Flexible Approximate Counting. Scott A. Mitchell, and David M. Day Sandia National Laboratories Scott – presenter IDEAS’11 15th International Database Engineering & Applications Symposium.
E N D
Flexible Approximate Counting Scott A. Mitchell, and David M. Day Sandia National Laboratories Scott – presenter IDEAS’11 15th International Database Engineering & Applications Symposium Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Outline • What is approximate counting? • What’s new? • Functional form • Increment decision strategies • Speed it up! • Random number and bit generators • Inverse problem • Find function given how high you want to count (Focus on red since that’s what’s significant)
What is approximate counting? • Approximate counter C • Trade decreased memory for decreased accuracy • Standard (unsigned) integer or bit field, but C represents some bigger number N • Normal integers use log2N bits to represent 0..N • Counter C can use log2(log2N) bits to represent 1..log2N • Accurate to within a factor of 2 • “Count” to 2^(28) using 8 bits N=φ(C) function Count using only the exponent 100 110 100110 unary binary floating point
What is approximate counting? ? p=1/(32-16) • Count occurrences of datastream objects, pairs of IP addresses • Problem • Object arrives, decide whether to increment • N+1 = ? if you only stored C? • C=4, N=16. Choose 16+1 = 16 or 16+1 = 32 • Solution • Coin flipping. 16+1 = 32 with probability p = 1/(32-16) • Flajolet papers prove expected value and error are reasonable, 1985-2004+ • Two sources of error • Unavoidable: intermediate numbers not representable. Constant-factor approximation. • Datastream: can’t view all the data at once, random decisions. Expected error bounds.
Motivation • Old idea (memory-accuracy) with some new uses • Morris 1978, one small register on a CPU • Today big data, lots of counters • Data-summarization • Approximate Counting useful by itself, for counting all objects • Database merge • Choose most efficient algorithm, pre-allocate memory • May be combined with other techniques • Bloom filters • Replace 1-bit with a small counter, Van Durme & Lall 2009 • Spread counter into multiple bits of a Bloom filter, Talbot 2009 vary the number of bits for skewed data,
Generalize Functionq-ary counting and Floating Point AC • ΔN = 2C. Why base 2? • p=2-CUse fast random-bit source for increment decisions • Csűrös2010 • Treat counter as binary-exponent floating point number • Exponent gives powers-of-two increment probabilities • Significand gives better accuracy than base 2 • Stair-step approximationto “q-ary” counting: • I.e. Restricted to 9choices for 8-bit counters • First contribution Get these advantages… …without these restrictions 8-d bitsexponent d-bits signficand 0100 0110
Our Flexible AC • Flexible AC • Perfect counting below a threshold T, then • ΔN = aC-T. p=1/aC-T, a is any floating point value. • a small (<2) since 255 = log2(5.7e76) • Round ΔN to integer • Still get prior speedups Round all ΔN to powers-of-two If speed(RandomBit) < ½ speed(RandomNumber)
Random Bit Generator • Many well-tested random numbergenerators • Fewer random bitgenerators • Knuth vol. 2 eq 10 – very simple (fast!) A = x0102010081010101 //64-bit constant X = X << 1 //shift left If overflow X = X xor A RandomBit = X & 1 // lowest bit of X • A is your choice of primitive polynomial mod 2 with many one-bits: 8 out of 64, Rajski & Tyszer 2003 • Every length-64 bit-sequence occurs once before repetition • Consider accuracy in terms of intended use.What matters for our application • k one-bits in a row occurs 1 in 2^k times • Generated 2^47 bits, 42 one-bits in a row occurs 1 in 2^42 times verified experimentally
Speed Comparison • If this is embedded in a datastream application, speed may be important. • Random number generator is the bottleneck (goal is incrementing a counter!) if RandNumber < p increment //p = 2^{-k} if k RandomBits in a row increment
Random Countdown Speedup • Why generate a random number every time? • Set countdown counter P P = number of times in a row RandNumber > p [no increment] • This is the definition of a geometric distribution • Need one countdown counter per counter value (1..255)not per counter (billions) • Calculating P is (relatively) very expensive • Fast on average if P is large p is small • Hybrid algorithm • RandNumber < p? or RandomBit for small p • Random Countdown for large p • “small” means <10 or <22
Fixed Countdown Speedup • Why generate a random number at all? • Increment “1 in Δφ” times deterministically Slightly different value to get correct expected value Best possible accuracy if only one item Fastest Relies on randomness of stream • E.g. alternating items bad counts
Punchline Speed: RandomCountFixedCount RandomCount = 1.5x Fixed Count for Δφ=255 Random Count = ¼x RandomBit for Δφ=172
How High Do You Want to Count?Inverse problem (David M. Day) • Find a, never discussed in approximate counting literature • For some applications, determine by hand ahead of time • Our run-time solution • Inverse geometric sum const Find root >1 for r(a) Initial guess depends on scompared to K. I.e. aK+1 vs. savs.(s-1) tricky case
Inverse Problem Alternatives • We’re only approximately counting, • So accuracymay not be important • We only calculate function once, • So efficiency may not be important (Application dependent) • Use the initial guesses • Use binary search or lookup table • Use N=φ(C) function with easier inverse • E.g. exponential + linear function,but increments are too small for small C
Conclusion • Flexible Approximate Counting provides • Customization of functional form • At run-time, for maximum value to count to • Fast decisions of whether to increment • If datastream is sufficiently random • Use fixed countdown • Else • Switch to random countdown for large increments • If speed is more important than accuracy for small increments • Use random bits and power-of-two increments • Random generator accuracy limits • Consider the intended use • RandNumber Min r : probability(u<r) ≈ r • RandomBit Max k: probability(k one-bits in row) ≈ 2-k • Thank you • Have a safe trip home