Flexible Approximate Counting

Flexible Approximate Counting Scott A. Mitchell, and David M. Day Sandia National Laboratories Scott – presenter IDEAS’11 15th International Database Engineering & Applications Symposium Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Outline • What is approximate counting? • What’s new? • Functional form • Increment decision strategies • Speed it up! • Random number and bit generators • Inverse problem • Find function given how high you want to count (Focus on red since that’s what’s significant)

What is approximate counting? • Approximate counter C • Trade decreased memory for decreased accuracy • Standard (unsigned) integer or bit field, but C represents some bigger number N • Normal integers use log2N bits to represent 0..N • Counter C can use log2(log2N) bits to represent 1..log2N • Accurate to within a factor of 2 • “Count” to 2^(28) using 8 bits N=φ(C) function Count using only the exponent 100 110 100110 unary  binary  floating point

What is approximate counting? ? p=1/(32-16) • Count occurrences of datastream objects, pairs of IP addresses • Problem • Object arrives, decide whether to increment • N+1 = ? if you only stored C? • C=4, N=16. Choose 16+1 = 16 or 16+1 = 32  • Solution • Coin flipping. 16+1 = 32 with probability p = 1/(32-16)  • Flajolet papers prove expected value and error are reasonable, 1985-2004+ • Two sources of error • Unavoidable: intermediate numbers not representable. Constant-factor approximation. • Datastream: can’t view all the data at once, random decisions. Expected error bounds.

Motivation • Old idea (memory-accuracy) with some new uses • Morris 1978, one small register on a CPU • Today big data, lots of counters • Data-summarization • Approximate Counting useful by itself, for counting all objects • Database merge • Choose most efficient algorithm, pre-allocate memory • May be combined with other techniques • Bloom filters • Replace 1-bit with a small counter, Van Durme & Lall 2009 • Spread counter into multiple bits of a Bloom filter, Talbot 2009 vary the number of bits for skewed data,

Generalize Functionq-ary counting and Floating Point AC • ΔN = 2C. Why base 2? • p=2-CUse fast random-bit source for increment decisions • Csűrös2010 • Treat counter as binary-exponent floating point number • Exponent gives powers-of-two increment probabilities • Significand gives better accuracy than base 2 • Stair-step approximationto “q-ary” counting: • I.e. Restricted to 9choices for 8-bit counters • First contribution Get these advantages… …without these restrictions 8-d bitsexponent d-bits signficand 0100 0110

Our Flexible AC • Flexible AC • Perfect counting below a threshold T, then • ΔN = aC-T. p=1/aC-T, a is any floating point value. • a small (<2) since 255 = log2(5.7e76) • Round ΔN to integer • Still get prior speedups Round all ΔN to powers-of-two If speed(RandomBit) < ½ speed(RandomNumber)

Random Bit Generator • Many well-tested random numbergenerators • Fewer random bitgenerators • Knuth vol. 2 eq 10 – very simple (fast!) A = x0102010081010101 //64-bit constant X = X << 1 //shift left If overflow X = X xor A RandomBit = X & 1 // lowest bit of X • A is your choice of primitive polynomial mod 2 with many one-bits: 8 out of 64, Rajski & Tyszer 2003 • Every length-64 bit-sequence occurs once before repetition • Consider accuracy in terms of intended use.What matters for our application • k one-bits in a row occurs 1 in 2^k times • Generated 2^47 bits, 42 one-bits in a row occurs 1 in 2^42 times verified experimentally

Speed Comparison • If this is embedded in a datastream application, speed may be important. • Random number generator is the bottleneck (goal is incrementing a counter!) if RandNumber < p increment //p = 2^{-k} if k RandomBits in a row increment

Random Countdown Speedup • Why generate a random number every time? • Set countdown counter P P = number of times in a row RandNumber > p [no increment] • This is the definition of a geometric distribution • Need one countdown counter per counter value (1..255)not per counter (billions) • Calculating P is (relatively) very expensive • Fast on average if P is large  p is small • Hybrid algorithm • RandNumber < p? or RandomBit for small p • Random Countdown for large p • “small” means <10 or <22

Fixed Countdown Speedup • Why generate a random number at all? • Increment “1 in Δφ” times deterministically Slightly different value to get correct expected value  Best possible accuracy if only one item  Fastest  Relies on randomness of stream • E.g. alternating items bad counts

Punchline Speed: RandomCountFixedCount RandomCount = 1.5x Fixed Count for Δφ=255 Random Count = ¼x RandomBit for Δφ=172

How High Do You Want to Count?Inverse problem (David M. Day) • Find a, never discussed in approximate counting literature • For some applications, determine by hand ahead of time • Our run-time solution • Inverse geometric sum const Find root >1 for r(a) Initial guess depends on scompared to K. I.e. aK+1 vs. savs.(s-1) tricky case

Inverse Problem Alternatives • We’re only approximately counting, • So accuracymay not be important • We only calculate function once, • So efficiency may not be important (Application dependent) • Use the initial guesses • Use binary search or lookup table • Use N=φ(C) function with easier inverse • E.g. exponential + linear function,but increments are too small for small C

Conclusion • Flexible Approximate Counting provides • Customization of functional form • At run-time, for maximum value to count to • Fast decisions of whether to increment • If datastream is sufficiently random • Use fixed countdown • Else • Switch to random countdown for large increments • If speed is more important than accuracy for small increments • Use random bits and power-of-two increments • Random generator accuracy limits • Consider the intended use • RandNumber Min r : probability(u<r) ≈ r • RandomBit Max k: probability(k one-bits in row) ≈ 2-k • Thank you • Have a safe trip home

Flexible Approximate Counting

Flexible Approximate Counting

Presentation Transcript

Counting:

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection

Approximate Dates

Approximate Counting

Counting

APPROXIMATE INTEGRATION

Approximate Counting of Cycles in Streams

Approximate Networking

Counting

Counting

Counting

Approximate Counting of Frequent Query Patterns over XQuery Stream

Sampling and Approximate Counting for Weighted Matchings

Counting

Approximate Networking

APPROXIMATE COST

Approximate Counting via Correlation Decay in Spin Systems

Approximate Knapsack

Counting