Hash-Based Indexes

Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006

Introduction • Hash-based indexes are best for equalityselections. Cannot support range searches. • Basic idea: static hashing. • Two dynamic hashing techniques: • Extendible Hashing • Linear Hashing

0 1 M-1 Primary bucket pages Motivation • B+-tree needs logBN I/Os to get one key. Can we do faster? • If we use key as bucket number, then one I/O is good enough! • Problem: too many buckets! • What if (key % M)? • Got the idea, but should use a Hashing function on key.

Static Hashing • # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed. • h(k) mod M = bucket to which data entry withkey k belongs. (M = # of buckets) 0 h(key) mod M 1 key h M-1 Primary bucket pages Overflow pages

Static Hashing (Contd.) • Buckets contain data entries. • Hash fn works on search key field of record r. Must distribute values over range 0 ... M-1. • h(key) = (a * key + b) usually works well. • Long overflow chains can develop and degrade performance. • Extendible and LinearHashing: Dynamic techniques to fix this problem.

Extendible Hashing • Situation: Bucket (primary page) becomes full. Why not re-organize file by doubling # of buckets? • Problem 1: Reading and writing all pages is expensive! • Problem 2: May have many under-utilized pages! • Idea: Use directory of pointers to buckets, double # of buckets by doubling the directory, splitting just the bucket that overflowed! • Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. Nooverflowpage! • Trick lies in how hash function is adjusted!

LOCAL DEPTH 2 Example Bucket A 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 • Directory is array of size 4. • To find bucket for r, take last `global depth’ # bits of h(r); we denote r by h(r). • If h(r) = 5 = binary 101, it is in bucket pointed to by 01. 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES • Insert: If bucket is full, splitit (allocate new page, re-distribute). • If necessary, double the directory. (As we will see, splitting a • bucket does not always require doubling; we can tell by • comparing global depth with local depth for the split bucket.)

LOCAL DEPTH 2 Example Bucket A 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 • Insert 14. • 14%4 = 2. • Does not overflow. 2 10 Bucket C 14* 10* 11 2 DIRECTORY • Insert 20. • 20%4 = 0. • Overflow! Bucket D 15* 7* 19* DATA PAGES • 32%8 = 0, 16%8 = 0 -- stay in bucket 000 • 4 % 8 = 4, 12%8 = 4, 20%8 = 4 -- move to bucket 100

Insert h(r)=20 (Causes Doubling) 2 LOCAL DEPTH 3 LOCAL DEPTH Bucket A 16* 32* 32* 16* GLOBAL DEPTH Bucket A GLOBAL DEPTH 2 2 2 3 Bucket B 5* 21* 13* 1* 00 1* 5* 21* 13* 000 Bucket B 01 001 2 10 2 010 Bucket C 10* 11 10* Bucket C 011 100 2 2 DIRECTORY 101 Bucket D 15* 7* 19* 15* 7* 19* Bucket D 110 111 2 3 Bucket A2 4* 12* 20* DIRECTORY 12* 20* Bucket A2 4* (`split image' of Bucket A) (`split image' of Bucket A)

Points to Note • When does bucket split cause directory doubling? • Before insert, local depth of bucket = global depth. Insert causes local depth to become > global depth; • How to implement the doubling of directory? • Copying it over! (Use of least significant bits enables this!) • `fixing’ pointer to split image page.

Comments on Extendible Hashing • Good: no overflow page. If directory fits in memory, equality search answered with one disk access guaranteed; (else two). • Since directory contains only key + pointer, it should generally be small enough to fit in memory. • Bad: if the distribution of hash values is skewed, directory can grow large. • Multiple entries with same hash value cause problems!

Sample quiz question • Is it possible for an insertion to cause the directory to double more than once? • If yes, show an example.

Deletion on Extendible Hashing • Find the record and delete. • Rule 1: If removal of data entry makes bucket empty, can be merged with `split image’. • Rule 2: If each directory element points to the same bucket as its split image, can halve directory.

Linear Hashing • This is another dynamic hashing scheme, an alternative to Extendible Hashing. • LH handles the problem of long overflow chains without using a directory, and handles duplicates. • Idea: • Keep a pointer which alternate through each bucket. • Whenever a new overflow page is generated, split the bucket the pointer points to, and add a new bucket at the end of the file. • Observation: the number of buckets is doubled after splitting all bucket!

Example of Linear Hashing • Let there be eight records: • 32, 44, 14, 18 • 9, 25, 31, 35 • Let each disk page hold up to four records. • Consider inserting 10, 30, 36, 7, 5, 11…

If insert 22 …… Example of Linear Hashing Let hLevel=(a*key+b) % 4 hLevel+1=(a*key+b) % 8 • On split, hLevel+1 is used to re-distribute entries. Level=0, N=4 Level=0 PRIMARY h h OVERFLOW h h PRIMARY PAGES 0 0 1 1 PAGES PAGES Next=0 32* 32* 44* 36* 000 00 000 00 Next=1 Data entry r 9* 5* 9* 5* 25* 25* with h(r)=5 001 001 01 01 30* 30* 10* 10* 14* 18* 14* 18* Primary 10 10 010 010 bucket page 31* 35* 7* 31* 35* 7* 11* 11* 43* 011 011 11 11 (This info is for illustration only!) (The actual contents of the linear hashed file) 100 44* 36* 00

Example of Linear Hashing Let hLevel=(a*key+b) % 4 hLevel+1=(a*key+b) % 8 Level=0 h OVERFLOW h PRIMARY 0 1 PAGES PAGES Important notice: after buckets 010 and 011 are split, Next moves back To 000! 32* 000 00 9* 25* 001 01 Next=2 30* 10* 14* 18* 22* 10 010 31* 35* 7* 11* 43* 011 11 100 44* 36* 00 101 5* 01

Example: End of a Round Level=1 Level=0 PRIMARY OVERFLOW h h PAGES PRIMARY OVERFLOW 0 1 PAGES Next=0 PAGES h PAGES h 1 0 00 000 32* 32* 000 00 001 01 9* 25* 9* 25* 001 01 10 010 50* 10* 18* 66* 34* 66* 10 18* 10* 34* 010 Next=3 011 11 35* 11* 43* 43* 11* 7* 31* 35* 011 11 44* 36* 100 100 00 44* 00 36* 5* 37* 29* 101 01 101 11 5* 29* 37* 14* 30* 22* 110 10 14* 22* 30* 110 10 31* 7* 11 111

LH File • In the middle of a round. Buckets split in this round: Bucket to be split If ( h search key value ) Level Next is in this range, must use h ( search key value ) Level+1 Buckets that existed at the to decide if entry is in beginning of this round: `split image' bucket. this is the range of h Level `split image' buckets: created (through splitting of other buckets) in this round

Linear Hashing (Contd.) • Insert: Find bucket by applying hLevel / hLevel+1: • If bucket to insert into is full: • Add overflow page and insert data entry. • (Maybe) Split Next bucket and increment Next. • Can choose any criterion to `trigger’ split. • Since buckets are split round-robin, long overflow chains don’t develop! • Doubling of directory in Extendible Hashing is similar; switching of hash functions is implicit in how the # of bits examined is increased.

Summary • Hash-based indexes: best for equality searches, cannot support range searches. • Static Hashing can lead to long overflow chains. • Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.) • Directory to keep track of buckets, doubles periodically. • Can get large with skewed data; additional I/O if this does not fit in main memory.

Summary (Contd.) • Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages. • Overflow pages not likely to be long. • Duplicates handled easily. • Space utilization could be lower than Extendible Hashing, since splits not concentrated on `dense’ data areas. • Can tune criterion for triggering splits to trade-off slightly longer chains for better space utilization. • For hash-based indexes, a skewed data distribution is one in which the hash values of data entries are not uniformly distributed!

LOCAL DEPTH 2 Bucket A 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 2 10 Bucket C 14* 10* 11 2 Bucket D 15* 7* 19* Sample Quiz Question • Min/Max # insertions to increase global depth by 1? • What about by 2?

Sample Quiz Question • Min/Max # insertions to cause Next to point to bucket 011? Level=0 h OVERFLOW h PRIMARY 0 1 PAGES PAGES 32* 000 00 • What about for Next to point to 000? 9* 25* 001 01 Next=2 30* 10* 14* 18* 22* 10 010 31* 35* 7* 11* 43* 011 11 100 44* 36* 00 101 5* 01

Hash-Based Indexes