350 likes | 530 Views
Variant Indexes. Specialized Indexes?. Data warehouses are large databases with data integrated from many independent sources. Queries are often complex. Updated in a batch fashion. Goal : develop specialized indexes to speed up query evaluation. Bitmap Indexes.
E N D
Specialized Indexes? • Data warehouses are large databases with data integrated from many independent sources. • Queries are often complex. • Updated in a batch fashion. • Goal: develop specialized indexes to speed up query evaluation.
Bitmap Indexes • Value-list index – same row may appear under many key values in index • RID-list – list a key value once at head, which is followed by a long list of RIDs for rows with this key value. • Bitmap – alternate form for representing RID-lists in a Value-list index • Bitmaps are more space efficient than RID-lists when the number of key values is low (#1)
Creating Bitmap Index • For n rows of T = {r1,…,rn}, do 1-1 mapping from T to Z[M}, the 1st M positive integers. • For a given row r with row # j, the table page number accessed to retrieve row r is j/p and the page slot is j%p.
A Bitmap B is a sequence of M bits. Each row r with row # j that has property P, we set bit j in B to 1, all other bits are zero. • A Bitmap index for a column C with values v1,…,vk is a B+ tree with entries having these key values and associated data portions that contain Bitmaps for the properties C=v1,…,C=vk.
Dense Bitmaps – if the proportion of 1-bits in the Bitmap is large. • A Bitmap index for a column with 32 values will have average density 1/32. Space needed for Bitmap is comparable to disk space needed for RID-list index, which requires 32 bits for each RID present. • For a column index with a very small # of values, the Bitmaps will have high densities and disk savings is enormous.
Boolean operations are extremely fast. • Can use a bunch of for loops to execute. For long Bitmaps, take fragments out from disk. (#2) • Loops for Bitmap AND, OR, NOT, or COUNT are extremely fast compared to loop operations on RID lists. (#3)
Segmentation • To optimize Bitmap index access, can be broken into fragments of equal sizes • RIDs can be small # bytes in length, they specify position in segment. At beginning, high order bits can be specified. (#4) • If one of predicate indexes has no pointer to a Bitmap Fragment for a segment, then the segment fragments for the other indexes can be ignored as well, can save a lot of I/O.
Bitmap representations and RID-list representations are interchangeable, when Bitmap representations are dense, Bitmaps are much more efficient than RID-lists, for storage use and Boolean operations. When too sparse, Bitmap index can contain RID-list.
Projection Indexes • Say C is a column of a table T; then the Projection index on C consists of a stored sequence of column values from C, in order by row number in T from which the values are extracted.
Projection index very efficient when column values must be retrieved from all rows of a found set. • For example, say column values are 4 bytes in length, 1000 values will fit on a 4KB page. Say each row is 200 bytes, then using index requires 1/20 I/0s of getting whole rows.
Like vertical partitioning, good when small # of columns are retrieved by most SELECT statements, bad when most queries retrieve most of the columns. (#5)
Bit-Sliced Indexes • A Bit-Sliced index stores a set of “Bitmap slices” which are orthogonal to the data held in a Projection index. Good for calculating aggregates in a found set. • Each individual Bitmap Bi is called a Bit-slice of the column. There is one Bitmap for Bnn for “not null” and one Bitmap for Bn for “null” (#6)
Comparing Indexes for Aggregate Evaluation • SELECT SUM (dollar_sales) FROM SALES WHERE condition
Say found set Bf (2 million rows) is already determined. • SALES table has 100 million rows, which are each 200 bytes in length, stored 20 rows on a 4KB disk page.
Query Plan 1 • Direct access to rows to calculate SUM • Each disk page contains 20 rows, have 5,000,000 disk pages occupied by SALES table (100 millon rows / 20 rows). Since Bf represents only 1/50 rows of SALES table, # disk pages occupies is: 5,000,000(1 – e^-2,000,000/5,000,000) = 1,648,4000.
This means that, if one disk arm retrieves 100 disk pages per second, you will need 4 hours of disk arm use! (1,648,400/100 = 16,484 seconds = 4 hours)
Query Plan 2 • Get SUM with Projection Index • The dollar_sales Projection index has 1000 values per 4KB disk page. Requires 100,000 disk pages. This equals 1000 seconds (17 minutes)
Query Plan 3 • Get SUM with Value-list index on dollar_sales. • Can calculate SUM(dollar_sales) for our Foundset by ranging through all possible values in the index and determining the rows with each value, the determining how many rows with each value are in the Foundset, and finally multiplying that count by the value and adding to the SUM.
IF (COUNT (Bf AND Bnm) == 0) Return null SUM = 0.00 For each non-null value v in index for C { Designate set of rows with value v as Bv SUM += v * COUNT(Bf AND Bv); } return SUM;
Good if entire Bitmaps are in memory. • Value-list index evaluation of SUM ( C ) requires 10,000 Bitmap ANDs and 10,000 COUNTS. • Requires 103,125 I/Os and a loop of length 100,000,000. • Requires 1,340,640 disk pages
Query Plan 4 • Get SUM with Bit-Sliced index • IF (COUNT (Bf AND Bnn) == 0) Return null SUM = 0.00 For I = 0 to N SUM += 2 (i power) * COUNT (Bi AND Bf); return SUM
Perform 21 ANDs and 21 COUNTs of 100,000,000 bit Bitmaps. Each Bitmap is 12.5 Mbytes in length, requiring 3125 I/Os, but we assume that Bf can remain in memory. Thus need to read 22 Bitmaps from disk – 22*3125 = 68,750 I/Os. For CPU, need to AND 21 pairs of Bitmaps, total # of loop passes: 21*(100,000,000/32) = 65,625, 000. Need to perform 21 COUNTs, with 131,250,000 passes.
Comparing Algorithm Performance METHOD I/0 CPU contributions Add from rows 1,342 K I/O + 2M*(25 ins) Projection index 100 K I/O + 2M*(10 ins) Value-List index 103 K I/O + 100M*(10 ins) Bit-Sliced index 69 K I/O + 197M*(1 ins)
Aggregate Functions & Index Types for Evaluation Aggregate Value-List Projection Bit-Sliced COUNT Not needed Not needed Not needed SUM Not Bad Good Best AVG (SUM/COUNT) Not Bad Good Best MAX & MIN Best Slow Slow MEDIAN, N-TILE Best No Sometimes Column-Product Slow Best Slow
Range Predicates • Consider: SELECT target-list FROM T WHERE C-range AND <condition>; Want to further restrict Foundset Bf so that the compound predicate “C-range AND <condition>” holds. This will be represented by BF.
Using Projection Index • Can create BF by accessing each C value in the index corresponding to a row number in BF and testing whether it lies within the specified range.
Value-List Index • Br = the empty set For each entry v in the Index for C that satisfies the range specified Designate the set of rows with the value v as Bv Br = Br OR Bv BF = Bf AND Br (Br needs to be in memory all times for efficiency)
Bit-Sliced Index • The algorithm is efficient, it comprises a bunch of Boolean operations for each bit-slice of a column C.
Range Evaluation Performance by Index Type Range Evaluation Value-List Projection Bit-Sliced Narrow Range Best Good Good Wide Range Not Bad Good Best
Join Indexes • A Join index is an index on one table for a quantity that involves a column value of a different table through a commonly encountered join. • Join indexes can be used to avoid actual joins of tables, or to greatly reduce the volume of data that must be joined, by performing restrictions in advance.
Calculating Groupset Aggregates • Once the Foundset has been computed from the Where clause, the bits in Foundset must be partitioned into Groupsets. Any aggregate functions are then evaluated separately over these different Groupsets. • How to compute Groupset aggregates using our different index types.
Projection Indexes • Assume Projection indexes exist for each of the group-by columns. • For each Foundset, read appropriate Projection indexes. Read values of the columns to be aggregated from index, and aggregate the result into the proper cell of the memory-resident array.
Value-List Indexes • Grouping by columns D1.A, D2.B using Value-List index: For each entry v1 in the Value-List index for D1.A For each entry v2 in the Value-List index for D2.B Bg = Bv1 AND Bv2 AND Bf Evaluate AGG( F.C ) on Bg
Conclusion • For read-only systems, we could use more complex index structures to speed up queries. The Bitmap, Projection, and Bit-Slice indexes serve this purpose.