50 likes | 285 Views
Indexing Scientific Data With FastBit. Motivating Examples Find the collision events with the most distinct signature of Quark Gluon Plasma Find the ignition kernels in a combustion simulation Track a layer of exploding supernova These are not typical database searches:
E N D
Indexing Scientific Data With FastBit • Motivating Examples • Find the collision events with the most distinct signature of Quark Gluon Plasma • Find the ignition kernels in a combustion simulation • Track a layer of exploding supernova • These are not typical database searches: • Large high-dimensional data sets (1000 time steps X 1000 X 1000 X 1000 cells X 100 variables) • Most data records never modified, i.e., append-only data • Multi-dimensional queries: 500 < Temp < 1000 && CH3 > 10-4 && … • Large answers (hit thousands or millions of records) • Seek collective features e.g., regions of interest, not average and sum operations • New searching technology needed
A Good Candidate: Bitmap Index Data values b0 b1 b2 b3 b4 b5 • First commercial version • Model 204, P. O’Neil, 1987 • Take less time to build than B-trees • Efficient for querying: only bitwise logical operations • A < 2 b0 OR b1 • A > 2 b3 OR b4 OR b5 • Efficient for multi-dimensional queries • Use bitwise operations to combine the partial results • Size may be large: one bit per distinct value per row • Definition: Cardinality == number of distinct values • Compact for low cardinality attributes, say, cardinality < 100 • Worst case: cardinality = N, number of rows; index size: N*N bits • First commercial version • Model 204, P. O’Neil, 1987 • Take less time to build than B-trees • Efficient for querying: only bitwise logical operations • A < 2 b0 OR b1 • A > 2 b3 OR b4 OR b5 • Efficient for multi-dimensional queries • Use bitwise operations to combine the partial results • Size may be large: one bit per distinct value per row • Definition: Cardinality == number of distinct values • Compact for low cardinality attributes, say, cardinality < 100 • Worst case: cardinality = N, number of rows; index size: N*N bits =0 =1 =2 =3 =4 =5 0 1 5 3 1 2 0 4 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 A < 2 A < 2 2 < A
31 bits 31 bits (62 groups skipped) … 31 bits Merge neighboring groups with identical bits 0 0 31 literal bits 31 literal bits 1 0 31-bit count=63 Encode each group using one 32-bit word 32 bits Compression Makes It Better Example: 2015 bits 10000000000000000000011100000000000000000000000000000……………….00000000000000000000000000000001111111111111111111111111 Main Idea: Use run-length-encoding, but... partition bits into 31-bit groups [not 32 bit] on 32-bit machines • Name: Word-Aligned Hybrid (WAH) code • Key features: • Compressed indices typically 30% of raw data • 10X faster in answering queries than the most competitive bitmap index • Worst case index size 4N words, not N*N
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Handling Collective Features:Regions of Interest • FastBit has been used in • GridCollector for High-Energy Physics Experiment STAR • Dexterous Data Explorer (DEX) for query driven visualization • Dynamic histograming for network traffic analysis • On the right is an illustration of our region-growing approach FastBit Data Query Region Growing Index Region Tracking 2-D connected regions identified with line segments (in green) Line segments come out of FastBit compressed bitmaps
Future Plans • Software development • Release FastBit under LGPL (John, March ’07) • Fastbit Integration with ROOT (John, Sept ’07) • Fastbit Integration with HDF5 for Particle Physics (Kurt) • Finding Regions of Interest • Existing work only dealt with data on regular meshes • Working on extensions to AMR mesh (Kurt), GTC mesh (John), and tetrahedral mesh (Rishi) • New features (research) • Parallel version • Table groups / partitions • Range join