CS4432: Database Systems II

CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner lecture #10 - indexing & hashing

Chapter 4 – INDEXING Wrap-up • B+-tree Odds and Ends • Hashing (briefly) record ? value lecture #10 - indexing & hashing

B+Tree Example n=3 100 Root 120 150 180 30 3 5 11 120 130 180 200 100 101 110 150 156 179 30 35 lecture #10 - indexing & hashing

Less space, so lookup faster Inserts managed by overflow area Requires temporary restructuring Unpredictable performance ComparisonB-tree vs. indexed seq. file • Consumes more space, so lookup slower • Each insert/delete potentially restructures • Build-in restructuring • Predictable performance lecture #10 - indexing & hashing

B-trees better … • DBA does not know when to reorganize • DBA does not know how full to load pages of new index lecture #10 - indexing & hashing

A la buffering… Is LRU a good policy for B+tree buffers? Of course not!  Should try to keep root in memory at all times (and perhaps some nodes from second level) lecture #10 - indexing & hashing

Interesting problem: For B+tree, how large should n be? … n is number of keys / node lecture #10 - indexing & hashing

assumptions:n children per node and N records in database • Time to read B-Tree node from disk is (tseek + tread*n) msec. • Once in main memory, use binary search to locate key, (a + b log_2 n) msec • Need to search (read) log_n (N) tree nodes • t-search = (tseek + tread*n + (a + b*log_2(n)) * log n (N) lecture #10 - indexing & hashing

Can get:f(n) = time to find a record f(n) noptn  FIND nopt by f’(n) = 0 • What happens to nopt as: • Disk gets faster? CPU get faster? … lecture #10 - indexing & hashing

Bulk Loading of B+ Tree • For large collection of records, create B+ tree. • Method 1: Repeatedly insert records  slow. • Method 2: Bulk Loading more efficient. lecture #10 - indexing & hashing

3* 6* 9* 10* 11* 12* 13* 23* 31* 36* 38* 41* 44* 4* 20* 22* 35* Bulk Loading of B+ Tree • Initialization: • Sort all data entries • Insert pointer to first (leaf) page in new (root) page. Root Sorted pages of data entries; not yet in B+ tree lecture #10 - indexing & hashing

Bulk Loading (Contd.) Root 10 20 • Index entries for leaf pages always entered into right-most index page • When this fills up, it splits. (Split may go up right-most path to root.) Faster than repeated inserts, especially when one considers locking! Data entry pages 6 12 23 35 not yet in B+ tree 3* 6* 9* 10* 11* 12* 13* 23* 31* 36* 38* 41* 44* 4* 20* 22* 35* Root 20 10 Data entry pages 35 not yet in B+ tree 6 23 12 38 lecture #10 - indexing & hashing 3* 6* 9* 10* 11* 12* 13* 23* 31* 36* 38* 41* 44* 4* 20* 22* 35*

Summary of Bulk Loading • Method 1: multiple inserts. • Slow. • Does not give sequential storage of leaves. • Method 2:Bulk Loading • Has advantages for concurrency control. • Fewer I/Os during build. • Leaves will be stored sequentially (and linked) • Can control “fill factor” on pages. lecture #10 - indexing & hashing

Hashing key  h(key) <key> Buckets (typically 1 disk block) . . . lecture #10 - indexing & hashing

Example hash function • Key = ‘x1 x2 … xn’ n byte character string • Have b buckets • h: add x1 + x2 + ….. xn • compute sum modulo b lecture #10 - indexing & hashing

 This may not be best function …  Read Knuth Vol. 3 if you really need to select a good function. Good hash  Expected number of function: keys/bucket is the same for all buckets lecture #10 - indexing & hashing

Within a bucket: • Do we keep keys sorted? • Yes, if CPU time critical & Inserts/Deletes not too frequent lecture #10 - indexing & hashing

Next: example to illustrate inserts, overflows, deletes h(K) lecture #10 - indexing & hashing

d a e c b EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 0 1 2 3 h(e) = 1 lecture #10 - indexing & hashing

d maybe move “g” up EXAMPLE: deletion Delete:ef 0 1 2 3 a b d c c e f g lecture #10 - indexing & hashing

If < 50%, wasting space • If > 80%, overflows significant depends on how good hash function is & on # keys/bucket Rule of thumb: • Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit lecture #10 - indexing & hashing

Extensible hashing • Others … How do we cope with growth? • Overflows and reorganizations • Dynamic hashing lecture #10 - indexing & hashing

Extensible hashing : idea 1 (a) Use i of b bits output by hash function b h(K)  use i grows over time…. 00110101 lecture #10 - indexing & hashing

Extensible hashing : idea 2 (b) Use directory h(K)[i ] to bucket . . . . . . lecture #10 - indexing & hashing

i = 2 00 01 10 11 1 1 2 1010 New directory 2 1100 Example: h(k) is 4 bits; 2 keys/bucket 1 0001 i = 1 0 1 1001 1100 Insert 1010 lecture #10 - indexing & hashing

2 0000 0001 2 0111 2 2 Example continued i = 2 00 01 10 11 1 0001 0111 1001 1010 Insert: 0111 0000 1100 lecture #10 - indexing & hashing

i = 3 000 001 010 011 100 101 110 111 3 1001 1001 2 1001 1010 3 1010 1100 2 Example continued 0000 2 0001 i = 2 00 01 10 11 0111 2 Insert: 1001 lecture #10 - indexing & hashing

Extensible hashing: deletion • Merge blocks and cut directory if possible (Reverse insert procedure) lecture #10 - indexing & hashing

Indirection (Not bad if directory in memory) Directory doubles in size (Now it fits, now it does not) - - Summary Extensible hashing + Can handle growing files - with less wasted space - with no full reorganizations lecture #10 - indexing & hashing

Indexing vs Hashing • Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5 lecture #10 - indexing & hashing

Indexing vs Hashing • INDEXING (Including B Trees) good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 lecture #10 - indexing & hashing

The BIG picture…. • Chapters 2 & 3: Storage, records, blocks... • Chapter 4 & 5: Access Mechanisms - Indexes - B trees - Hashing - Multi key • Chapter 6 & 7: Query Processing lecture #10 - indexing & hashing

CS4432: Database Systems II