260 likes | 437 Views
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps. Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio State University Hakan Ferhatosmanoglu – The Ohio State University Ali Saman Tosun – University of Texas at San Antonio.
E N D
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio State University Hakan Ferhatosmanoglu – The Ohio State University Ali Saman Tosun – University of Texas at San Antonio
Presentation Outline • Motivation • Goal • Approximate Bitmaps (AB) encoding • AB example • Theoretical analysis • Experiments and Results • Conclusion
Motivation • Bitmap indices • Data warehouses • Scientific data • Visualization applications • Bitwise operations • Bitmap Compression • Run-length encoders • Word Aligned Hybrid (WAH) • Byte-aligned Bitmap Code (BBC)
Motivation • The row numbers do not longer correspond to the bit position in the bitmap • Queries over few particular rows • As expensive as queries asking for all the rows • Commonly, users are only interested in a small subset of the dataset at a time. • For example: • A query over the transactions of the last 7 days • Spatial queries over objects in a specific geographical area
Motivation • Visualization applications • Millions of different readings ordered by their geographic location • Users ask range queries over some of the readings for a given area • The answers are highlighted in the screen • Several degrees of resolution make approximate answers acceptable
Our Goal • Enable direct access over any subset of the bitmap • Achieve effective compression • Maintain bitwise operations for query execution • Trade-off efficiency vs. accuracy • No false negatives
The approach • Our solution is inspired by Bloom Filters • A 2m bit array indexed using k independent hash functions • A data object is inserted by setting the k positions in the array corresponding to the hash values of the object • False positives can happen, but false negatives cannot
Approximate Bitmaps (AB) • A bloom filter-like structure • Only the set bits are inserted into the AB • Three levels of encoding: • Per table, per attribute, per bitmap column • Parameters: • The hash string mapping function, F • The k hash functions, {H1(x),…,Hk(x)} • The size of the AB, n = αs = 2m • Precision in terms of α and k, ~(1-(1-e-k/α)k)
AB Example • A bitmap table for a dataset with 8 rows and 3 attributes. Each attribute is divided into 3 categories. • Bitmap Table Size: 72 bits • Number of set bits = 24. • F(i,j) = concatenate(i,j) = x • H1(x) = x mod 32 • m = 5 • AB Size: 25 = 32 bits
AB Example - Insertion • Initially all bits in the AB are zero • To insert set bit in (1,1)
AB Example - Insertion • To insert set bit in (1,1) • x = 11 • H(11) = 11 mod 32 = 11 • AB(11) = 1
AB Example - Insertion • To insert set bit in (5,4) • x = 54 • H(54) = 54 mod 32 = 22 • AB(22) = 1
AB Example - Insertion • After all insertions
AB Example - Analysis • Estimated Precision: • α = ABSize/Set Bits • α = 32/24 = 1.33 • k = 1 • FP = (1-e-k/α) • P = 1-FP • P = 1-(1-e-1/1.33) • P = 47% • The underlined positions are false positives • Only 8 out of the 48 zeros are set in the AB
AB Example - Retrieval • Row 4: • (4,7): H(47) = 15 • AB(15)=0 • (4,8): H(48) = 16 • AB(16)=1 • Row 5: • (5,7): H(57) = 25 • AB(25)=1 • Stop • Consider this query, asking for 4 rows • This a range query over 4 rows, where the third attribute falls into C1 or C2
AB Example - Retrieval • Row 6: • (6,7): H(67) = 3 • AB(67)=1 • Stop • Approx Query Answer: • {1,1,1,0} • Exact Answer: • {0,1,1,0} • Consider this query, asking for 4 rows
Approximate Bitmaps (AB) – Mapping Function F • F maps each cell in the bitmap table to a unique string (the hashing string) • For one AB per table and one AB per attribute, the bit in row i column j is identified by • F(i,j) = i << w || j, where w is large enough to accommodate all j • For one AB per column, the bit in row i is identified by • F(i,j) = i
Approximate Bitmaps (AB) – Hash Functions • Single Hash Function • Called once and the result is divided into pieces. • Each piece considered as the value of a different hash function. • Secure Hash Algorithm (SHA), developed by National Institute of Standards and Technology (NIST) • Multiple Hash Functions • Independent hash functions • For large number, similar performance Hash Function H0 H1 H2 ... H9 Bits 159..144 143..128 127..112 ... 15..0 SHA Output 0100100010001010 1000010100100001 0111100011100010 ... 0000010101110011
Approximate Bitmaps (AB) – FP Rate • FP Rate: Probability that all k bits are set by another data object • n is the size of the AB • s is the number of set bits • n = αs, α = n/s
Approximate Bitmaps (AB) – Size • In terms of α: • n = αs • m = ceil(log2(αs)) • One AB per dataset: • s = |A|*N • One AB per attribute: • s = N • One AB per column: • s depends on the data distribution
Experimental Setup • Three datasets: • Query by sampling (randomly selecting the columns queried) • Varying the number of rows queried from 100 to 10K
Experimental Results - Size • Always use the max α that produces a smaller or comparable AB than WAH
Experimental Results - Precision • As αincreases, the precision increases steadily and is very close to 1 for larger α • Precision increases as k increases up to the optimum point • Because large number of hash functions produces more collisions
Experimental Results – Exec Time • Execution time of the AB depends on the number ofrows queried, not in the number of rows in the dataset • For queries over less than 10%~15% of the rows, AB execution is up to 3 orders of magnitude faster than WAH
Conclusion • AB encoding approximates the bitmaps using multiple hashing of the set bits • Allows efficient retrieval of any subset of rows and columns • Trade-off between bitmap size and precision • Three levels of encoding • Approximate query answers are given without database access
Questions and Comments • Thank you! Email: canahuat@cse.ohio-state.edu