Hash-Based Indexes

Hash-Based Indexes R&G Chapter 10 Lecture 18

Administrivia • The new Homework 3 now available • Due 1 week from Sunday • Homework 4 available the week after • Midterm exams available here

Review • Last time discussed Tree Indexes • ISAM – static, efficient if data never changes • B-Trees – dynamic, good for changing data • Both good for range queries, o.k. for equality • Today: Hash Indexes • Useless for range queries • Best for equality queries

Review (2) • As for any index, 3 alternatives for data entries k*: • Data record with key value k • <k, rid of data record with search key value k> • <k, list of rids of data records with search key k> • Choice orthogonal to the indexing technique • Hash-based indexes are best for equalityselections. Cannot support range searches.

Today: Hashing • Static and dynamic hashing techniques exist; trade-offs similar to ISAM vs. B+ trees. • Static Hashing • Good if data never changes (like ISAM) • Extendable Hashing • Uses directory to handle changing data • Linear Hashing • Avoids directory, usually faster

Static Hashing • # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed. • h(k) mod M = bucket to which data entry withkey k belongs. (M = # of buckets) 0 h(key) mod N 2 key h N-1 Primary bucket pages Overflow pages

Static Hashing (Contd.) • Buckets contain data entries. • Hash fn works on search key field of record r. Must distribute values over range 0 ... M-1. • h(key) = (a * key + b) usually works well. • a and b are constants; lots known about how to tune h. • Long overflow chains can develop and degrade performance. • Extendible and LinearHashing: Dynamic techniques to fix this problem.

Extendible Hashing • Situation: Bucket (primary page) becomes full. • Why not re-organize file by doubling # of buckets? • Reading and writing all pages is expensive! • Idea: Use directory of pointers to buckets, • double # of buckets by doubling the directory, • splitting just the bucket that overflowed! • Directory much smaller than file, so doubling much cheaper. • Nooverflowpages! • Trick lies in how hash function is adjusted!

Extendible Hashing Details • Need directory with pointer to each bucket • Need hash function to incrementally double range • can just use increasingly more LSBs of h(key) • Must keep track of global “depth” • how many times directory doubled so far • Must keep track of local “depth” • how many time each bucket has been split

LOCAL DEPTH 2 Bucket A 16* 4* 12* 32* Example GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 • Directory is array of size 4. • To find bucket for r, take last `global depth’ # bits of h(r); we denote r by h(r). • If h(r) = 5 = binary 101, it is in bucket pointed to by 01. 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES • Insert: If bucket is full, splitit (allocate new page, re-distribute). • If necessary, double the directory. (As we will see, splitting a • bucket does not always require doubling; we can tell by • comparing global depth with local depth for the split bucket.)

Insert h(r)=20 (Causes Doubling) 2 LOCAL DEPTH 3 LOCAL DEPTH Bucket A 16* 32* 32* 16* GLOBAL DEPTH Bucket A GLOBAL DEPTH 2 2 2 3 Bucket B 5* 21* 13* 1* 00 1* 5* 21* 13* 000 Bucket B 01 001 2 10 2 010 Bucket C 10* 11 10* Bucket C 011 100 2 2 DIRECTORY 101 Bucket D 15* 7* 19* 15* 7* 19* Bucket D 110 111 2 3 Bucket A2 4* 12* 20* DIRECTORY 12* 20* Bucket A2 4* (`split image' of Bucket A) (`split image' of Bucket A)

Now Insert h(r)=9 (Causes Split Only) 3 3 LOCAL DEPTH LOCAL DEPTH 32* 32* 16* 16* Bucket A Bucket A GLOBAL DEPTH GLOBAL DEPTH 2 3 3 3 1* 5* 21* 13* 1* 9* 000 000 Bucket B Bucket B 001 001 2 2 010 010 10* 10* Bucket C Bucket C 011 011 100 100 2 2 101 101 15* 15* 7* 7* 19* 19* Bucket D Bucket D 110 110 111 111 3 3 DIRECTORY 12* 12* 20* 20* Bucket A2 Bucket A2 4* 4* (`split image' DIRECTORY of Bucket A) 3 5* 21* 13* Bucket B2

Points to Note • 20 = binary 10100. Last 2 bits (00) tell us r belongs in A or A2. Last 3 bits needed to tell which. • Global depth of directory: Max # of bits needed to tell which bucket an entry belongs to. • Local depth of a bucket: # of bits used to determine if splitting bucket will also double directory • When does bucket split cause directory doubling? • If, before insert, local depth of bucket = global depth. • Insert causes local depth to become > global depth; • directory is doubled by copying it over and `fixing’ pointer to split image page. • (Use of least significant bits enables efficient doubling via copying of directory!)

Directory Doubling • Why use least significant bits in directory? • Allows for doubling via copying! 2 2 3 16* 4* 12* 32* 16* 4* 12* 32* 000 2 2 2 001 5* 1* 21* 13* 010 00 5* 1* 21* 13* 011 01 2 2 100 10 10* 10* 11 101 110 2 2 111 15* 7* 19* 15* 7* 19*

Deletion • Delete: If removal of data entry makes bucket empty, can be merged with `split image’. If each directory element points to same bucket as its split image, can halve directory. 2 1 16* 16* 4* 12* 32* 4* 12* 32* 2 2 2 2 Delete 10 00 00 5* 5* 1* 21* 13* 1* 21* 13* 01 01 2 10 10 10* 11 11 2 2 15* 15*

Deletion (cont) • Delete: If removal of data entry makes bucket empty, can be merged with `split image’. If each directory element points to same bucket as its split image, can halve directory. 1 2 16* 4* 12* 32* 00 1 16* 4* 12* 32* 01 Delete 15 1 10 5* 1* 21* 13* 2 2 11 00 5* 1* 21* 13* 01 10 1 1 11 16* 4* 12* 32* 0 1 2 1 15* 5* 1* 21* 13*

Comments on Extendible Hashing • If directory fits in memory, equality search answered with one disk access; else two. • 100MB file, 100 bytes/rec, 4K pages contains 1,000,000 records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory. • Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large. • Biggest problem: • Multiple entries with same hash value cause problems! • If bucket already full of same hash value, will keep doubling forever!

Linear Hashing • This is another dynamic hashing scheme, an alternative to Extendible Hashing. • LH handles the problem of long overflow chains without using a directory, and handles duplicates. • Idea: Use a family of hash functions h0, h1, h2, ... • hi(key) = h(key) mod(2iN); N = initial # buckets • h is some hash function (range is not 0 to N-1) • If N = 2d0, for some d0, hi consists of applying h and looking at the last di bits, where di = d0 + i. • hi+1 doubles the range of hi (similar to directory doubling)

Linear Hashing Example • Let’s start with N = 4 Buckets • Start at “round” 0, “next” 0, have 2round buckets • Each time any bucket fills, split “next” bucket • If (O ≤ hround(key) < Next), use hround+1(key) instead Start Add 9 Next 16* 16* 4* 12* 32* 32* Add 20 9* 5* 5* 1* 21* 13* Next 1* 21* 13* 16* 32* 10* 10* 15* 15* 9* 5* Next 1* 21* 13* 4* 12* 10* 15* 20* 4* 12*

Linear Hashing Example (cont) • Overflow chains do exist, but eventually get split • Instead of doubling, new buckets added one-at-a-time Add 6 Add 17 16* 16* 32* 32* 9* 5* Next 1* 21* 13* 1* 13* 10* 6* 10* 6* Next 15* 15* 4* 12* 20* 4* 12* 20* 5* 21* 9*

Linear Hashing (Contd.) • Directory avoided in LH by using overflow pages, and choosing bucket to split round-robin. • Splitting proceeds in `rounds’. Round ends when all NRinitial (for round R) buckets are split. Buckets 0 to Next-1 have been split; Next to NR yet to be split. • Current round number also called Level. • Search:To find bucket for data entry r, findhround(r): • If hround(r) in range `Next to NR’, r belongs here. • Else, r could belong to bucket hround(r) or bucket hround(r) + NR; must apply hround+1(r) to find out.

Overview of LH File • In the middle of a round. Buckets split in this round: Bucket to be split If ( h search key value ) Level Next is in this range, must use h ( search key value ) Level+1 Buckets that existed at the to decide if entry is in beginning of this round: `split image' bucket. this is the range of h Level `split image' buckets: created (through splitting of other buckets) in this round

Linear Hashing (Contd.) • Insert: Find bucket by applying hround / hround+1: • If bucket to insert into is full: • Add overflow page and insert data entry. • (Maybe) Split Next bucket and increment Next. • Can choose any criterion to `trigger’ split. • Since buckets are split round-robin, long overflow chains don’t develop! • Doubling of directory in Extendible Hashing is similar; switching of hash functions is implicit in how the # of bits examined is increased.

Another Example of Linear Hashing • On split, hLevel+1 is used to re-distribute entries. Round=0, N=4 Round=0 PRIMARY h h OVERFLOW h h PRIMARY PAGES 0 0 1 1 PAGES PAGES Next=0 32* 32* 44* 36* 000 00 000 00 Next=1 Data entry r 9* 5* 9* 5* 25* 25* with h(r)=5 001 001 01 01 30* 30* 10* 10* 14* 18* 14* 18* Primary 10 10 010 010 bucket page 31* 35* 7* 31* 35* 7* 11* 11* 43* 011 011 11 11 (This info is for illustration only!) (The actual contents of the linear hashed file) 100 44* 36* 00

Example: End of a Round Round=1 PRIMARY OVERFLOW h h PAGES 0 1 PAGES Next=0 Round=0 00 000 32* PRIMARY OVERFLOW PAGES h PAGES h 1 0 001 01 9* 25* 32* 000 00 10 010 50* 10* 18* 66* 34* 9* 25* 001 01 011 11 35* 11* 43* 66* 10 18* 10* 34* 010 Next=3 100 00 44* 36* 43* 11* 7* 31* 35* 011 11 101 11 5* 29* 37* 44* 36* 100 00 14* 22* 30* 110 10 5* 37* 29* 101 01 14* 30* 22* 31* 7* 11 111 110 10

LH Described as a Variant of EH • The two schemes are actually quite similar: • Begin with an EH index where directory has N elements. • Use overflow pages, split buckets round-robin. • First split is at bucket 0. (Imagine directory being doubled at this point.) But elements <1,N+1>, <2,N+2>, ... are the same. So, need only create directory element N, which differs from 0, now. • When bucket 1 splits, create directory element N+1, etc. • So, directory can double gradually. Also, primary bucket pages are created in order. If they are allocated in sequence too (so that finding i’th is easy), we actually don’t need a directory! Voila, LH.

Summary • Hash-based indexes: best for equality searches, cannot support range searches. • Static Hashing can lead to long overflow chains. • Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.) • Directory to keep track of buckets, doubles periodically. • Can get large with skewed data; additional I/O if this does not fit in main memory.

Summary (Contd.) • Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages. • Overflow pages not likely to be long. • Duplicates handled easily. • Space utilization could be lower than Extendible Hashing, since splits not concentrated on `dense’ data areas. • Can tune criterion for triggering splits to trade-off slightly longer chains for better space utilization. • For hash-based indexes, a skewed data distribution is one in which the hash values of data entries are not uniformly distributed!

Hash-Based Indexes