190 likes | 362 Views
Efficient Bitmap Indexing Techniques for Very Large Datasets. Kesheng John Wu Ekow Otoo Arie Shoshani. Problem Statement. Main objective: maps logical requests to qualified objects A logical request: 20001015<=eventTime & 200<energy<300 … Objects: Set of object ids;
E N D
Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani
Problem Statement • Main objective: maps logical requests to qualified objects • A logical request: • 20001015<=eventTime & 200<energy<300 … • Objects: • Set of object ids; • Set of files containing the objects; • Offsets within the files, …
Application: STAR A portion of the STAR tag dataset: 3 events with 12 attributes from millions of events with 502 attributes.
Application: Combustion • Direct numerical simulation of auto-ignition process (solution of complex partial differential equations) • A dozen or more variables are computed at each time step and each grid point • Number of grid points: 2D 600 X 600 >>> 3D 1000 X 1000 X 1000 • Time steps: 100 >>> 1000s • Data size: 1 GB >>> 10 TB • Task: identify features and track them across time steps • E.G. Find flame front across time Find “600<temp<700” for 1 billion points per time step, and discover overlap between time steps • Use compressed bitmaps to accelerate both feature extraction and feature tracking 1000 X 1000 X 1000
property 2 property 1 property n 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . Building a Bitmap Index • Partition each property into bins (binning) • e.g. for 0<NLb<4000, 20 equal size bins: [0, 200)[200,400)… • Generate a bit vector for each bin (encoding) • Bit i of bit vector j is 1 iff NLb[i] is in bin j • Compress each bit vector
Advantages of Bitmap Index • Bitmap index: specialized index that takes advantage • Read-mostly data: data produced from scientific experiments can be appended in large groups • Fast operations • “Predicate queries” can be performed with bitwise logical operations • Predicate ops: =, <, >, <=, >=, range, • Logical ops: AND, OR, XOR, NOT • They are well supported by hardware • Easy to compress, potentially small index size • Each individual bitmap is small and frequently used ones can be cached in memory
Operation-efficient Compression Methods • Best known: byte-aligned bitmap code (BBC) • Uses run-length encoding (next slide) • Byte alignment, optimized for space efficiency • Decoding on bit level, not optimal for operations • Used in oracle • We developed a new word-aligned scheme: WAH • Uses run-length encoding • Word alignment • Designed for minimal decoding to gain speed
Operation-efficient Compression Methods Based on variations of Run Length Compression Uncompressed: 0000000000001111000000000 ......0000001000000001111111100000000 .... 000000 Compressed: 12, 4, 1000,1,8,1000 Store very short sequences as-is Advantage: Can perform: AND, OR, COUNT operations on compressed data
speed better BBC gzip PacBits ExpGol space Trade-off of Compression Schemes uncompressed WAH
Information About the Test Machines • Hardware and system • Sun enterprise 450 (Ultrasparc II 400mhz) • 4GB RAM • VARITAS volume manager (stripped disk) • Real application data from STAR • Above 2 million objects, 12 attributes • Synthetic data • 100 million objects, 10 attributes • Terms • Compression ratio: ratio of compressed bitmaps size and uncompressed bitmaps size • Time reported are wall clock time in seconds
Encoding Schemes – Main Idea Interval encoding Range encoding Equality encoding 12 bins 1 2 3 4 5 6 7 8 9 10 11 12 Interval, Range encoding:operates on 2 bins only!
Total Effect of Compression and Encoding Schemes • Bottom line on queries • Compression scheme determines efficiency of logical operations • Encoding scheme determines number of operations • Range & interval – only one logical operation over 2 bitmaps • Equality – many operations depending on number of bins • But, space may be a consideration • What is the trade-off?
Interval Encoding Is Better Overall(WAH Compression) Points on the graphs represent: 10, 20, 30, 50, 100 Bins. Average time for random range queries
Summary • Compressed bitmap indices are effective for range queries • Better compression scheme • 50% more space, but 12 time faster !!! • Among the different encoding schemes • The interval encoding is the overall winner
Future Work • Support NULL value and categorical values • On-line update: add new data and update index without interrupting request processing • Recovery mechanism for robustness • Potential new applications: climate, astrophysics, biology (microarrays) • Study non-uniform binning strategies • Study more encoding schemes • Integrate with conventional database system: to better handle metadata, to provide more versatile front-end
Edge bin Edge bin Range(x) . . . . . . . . Range(y) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How Many Bins for Continuous Domains? More bins Less objects in edge bins Searching edge bins:skip-scan over “attribute vertical partition”