370 likes | 614 Views
Encoded Bitmap Indexing and Compressed Bitmaps. Yashvardhan Sharma Faculty, CS&IS BITS-Pilani. Outline. Problems with Simple Bitmap Indexes Encoded Bitmap Indexes Compression of Bitmaps Byte Aligned Bitmap Code (BBC) Word Aligned Hybrid Code (WAH). Pure Bitmap Index.
E N D
Encoded Bitmap Indexingand Compressed Bitmaps Yashvardhan Sharma Faculty, CS&IS BITS-Pilani Data Warehousing
Outline • Problems with Simple Bitmap Indexes • Encoded Bitmap Indexes • Compression of Bitmaps • Byte Aligned Bitmap Code (BBC) • Word Aligned Hybrid Code (WAH) Data Warehousing
Pure Bitmap Index • Consists of a collection of bitmap vectors each created to represent a distinct value. • More than one conditions in a query can be replied by boolean operation on the respective bitmaps. Properties: • Suited for low cardinality column. • Utilizes bitwise operation. • Easy to build and add new indexed value. • Whole bitmap segment is locked at index updating. • Less space for storing indexes. More indexes can be cached in memory. Data Warehousing
Problems with Bitmap Indexes • Space inefficient for attributes with high cardinality (sparsity of bitmap vectors) • Increase the complexity of the software Solution: • Bitmap Encoding • Bitmap Compression Data Warehousing
Encoded Bitmap Index • Consists of a set of bitmap vectors,a lookup table, a set of retrieval boolean function. • Each distinct value of a column is encoded using a number of bits, each of which is stored in a bitmap vector. • Lookup table stores mapping between column values and there encoded representation. Properties: • Uses space efficiently. • Efficient with wide range queries. • Difficult to find a good encoding scheme. • Inefficient with equality queries. Data Warehousing
Simple Bitmap Indexing Data Warehousing
Simple Bitmap Indexing Advantages and disadvantages • Dynamic • Performance and stability under update • Costs less time and space than B-trees • Can work efficiently together • Required bytes • As cardinality increase space and time complexity increases rapidly Data Warehousing
Encoded Bitmap Indexes • Why shifted from SBI to EBI? • The restriction in SBI is that they are not suitable for low cardinality attributes. • Advantage of a drastic reduction in space requirements. • The main idea of EBI is to encode the attribute domain. • Let us see through an example…… Data Warehousing
Encoded Bitmap Indexes • We assume that our attribute domain is given by the table T is {a, b, c}. • The encoding schema of EBI is stored in a separate table called mapping table and simply encodes the values from a SBI by means of Huffman encoding. • Therefore reduces the number of bitmaps vectors. In particular, we use only ceil( log² 3)= 2 Encoded Bitmap vectors instead of 3 simple bitmap vectors. • This means that 2 bits are used to encode the domain {a, b, c}. Data Warehousing
Encoded bitmap indexing Data Warehousing
Encoded Bitmap Indexes • We assume that we have a fact table SALES with N tuples and a dimension table PRODUCT with 12,000 different products. • If we build a simple bitmap index (SBI)on PRODUCT, It will require 12,000 bitmap vectors of N bits in length. • However, if we use encoded bitmap indexing (EBI) we only need ceil( log² 12.000)= 14 bitmap vectors plus a mapping table which is a very significant reduction of the space complexity. Data Warehousing
Encoded Bitmap Indexing • Retrieval function k variable minterm XY = X AND Y X + Y = X OR Y =B’ Data Warehousing
Maintenance of Encoded Bitmap Indexes • Updates without domain expansion • Updates with domain expansion Data Warehousing
Encoded bitmap indexing • Null ,not exist ,reserve zero for nonexisting Data Warehousing
Applications and variations of encoding indexing • Hierarchy encoding • Total order preserving • Using encoding indexes for range encoding Data Warehousing
Compression Of Bitmap Index • Cardinalities of many queried attribute is very high. • Basic Bitmap index generates too many bitmaps and operation take too long. • Data structures used to represent bitmap should be designed to provide efficient search operations. • Compression is the best technique to improve the effectiveness of basic bitmap. • Possible Methods can be LZ77(gzip), BBC,WAH,WBC. Data Warehousing
WAH(Word Aligned Hybrid Code) • Hybrid between RLE and Literal scheme. • Stores compressed data in words. • MSB of a word is used to distinguish between a literal word(0) and a fill word(1). • Lower bits of LW contains bit values from bitmap. • Second MSB of a fill word is fill bit and lower bits store the fill length. • Word alignment requires all fill length to be integer multiples of no. of bits. Data Warehousing
WAH Encoding Example Data Warehousing
BBC(Byte aligned Bitmap Code) • Based on idea of run length encoding that represents consecutive identical bits(fill or gap) by their bit value and their length. • First the bit sequence is divided into bytes and then bytes are grouped into runs. • Run consists of a fill lenght followed by a tail of literal bytes,fill length is represented in terms of no. of bytes • Byte alignment limits length to be an integer multiple of bytes. Data Warehousing
BBC • Based on basic idea of run length encoding • Organizes bits into bytes • Runs of BBC are of form [fill] [tail] • Fill can be zero or one fill depending on all bits been zero or one. • Two types one-sided or two-sided Data Warehousing
BBC Variants • Type-1 Run • Type-2 Run • Type-3 Run • Type-4 Run Data Warehousing
Type-1 Run • 0-3 bytes in fill and 0-15 literal bytes • Header is of only byte • Eight bits of header are 1[fill bit] [fill length(2 bits)] [tail length(4 bits)]. • Literal tail follows header byte Data Warehousing
Example of Type-1 run Hexadecimal representation of bitmap 00 00 8A 37 BBC Type-1 run representation header 1 [0] [10] [0010] 8A 37 hex A2 8A 37 Data Warehousing
Type-2 Run • 0-3 bytes in fill and tail of single byte with one bit different from fill bit • Header is of only byte • Eight bits of header are 0 1[fill bit] [fill length(2 bits)] [odd bit position(3 bits)]. • Single byte tail is not stored Data Warehousing
Example of Type-2 run Hexadecimal representation of bitmap 00 00 00 02 BBC Type-1 run representation header 0 1 [0] [11] [001] hex 59 Data Warehousing
Type-3 Run • More than 3 bytes in fill and 0-15 literal bytes • Multibyte counter is used to represent fill length • Header is of only byte • Eight bits of header are 0 0 1[fill bit] [tail length(4 bits)]. • Bytes in multibyte counter follows header byte • Literal tail follows header byte • Each mutilbyte counter byte is of form 1 [(7 bits) significant information] except last byte • Actual no. bytes in fill is literal value plus 4. Data Warehousing
Example of Type-3 run Hexadecimal representation of bitmap 00 00 00 00 00 00 00 00 00 F3 BBC Type-1 run representation header 0 0 1 [0] [0001] hex 21 05 F3 Data Warehousing
Type-4 Run • More than 3 bytes in fill and tail of single byte with one bit different from fill bit • Single byte tail is not stored • Eight bits of header are 0 0 0 1 [fill bit] [odd bit position(3 bits)]. • Bytes in multibyte counter follows header byte • Each mutilbyte counter byte is same as of type-3 run Data Warehousing
Example of Type-4 run Hexadecimal representation of bitmap 00 00 00 00 00 00 00 01 BBC Type-1 run representation header 0 0 0 1 [0] [001] hex 11 03 Data Warehousing
Improvement achieved and Comparisons BBC : • Can perform bitwise logical operation efficiently compared to other compression. • It compresses almost as well as gzip. • Most suited for range queries. • Suitable for OLAP applications. Data Warehousing
WAH: • Performs logical operations about 12 times faster and uses only 60% more space compared to BBC. • Compared to uncompressed scheme WAH is faster while still using less space. • All the features of BBC are available in this scheme Data Warehousing
Factors for performance difference between BBC and WAH • In WAH one test is sufficient to determine type of word while in BBC more than three tests are required to decide run type. • WAH accesses whole words while BBC accesses bytes hence it needs time to load data. • BBC can encode shorter fills more compactly than WAH but BBC starts new run for a short fill. Data Warehousing