220 likes | 233 Views
This chapter introduces hash-based indexes, including static hashing, which uses a fixed number of buckets, and extendible hashing, which dynamically adjusts the number of buckets.
E N D
Hash-Based Indexes Chapter 11
Introduction : Hash-based Indexes • Best for equalityselections. • Cannot support range searches. • Static and dynamic hashing techniques exist: Trade-offs similar to ISAM vs. B+ trees.
Static Hashing • h(k) mod N = bucket to which data entry withkey k belongs. (N = # of buckets) 0 h(key) mod N 2 key h N-1 Primary bucket pages Overflow pages
Static Hashing : h(k) mod N • # primary pages fixed (N = # of buckets) • allocated sequentially • never de-allocated • overflow pages if needed. 0 h(key) mod N 2 key h N-1 Primary bucket pages Overflow pages
Static Hashing • h(k) mod N = bucket to which data entry withkey k belongs with N = # of buckets • Hash function works on search key of record r. • h() must distribute values over range [0 ... N-1]. • For example, h(key) = (a * key + b) • a and b are constants • lots known about how to tune h.
Static Hashing Cons • Primary pages fixed space static structure. • Fixed # buckets is the problem: • Rehashing can be done Not good for search. • In practice, instead use overflow chains. • Long overflow chains degrade performance. • Solution: Employ dynamic techniques : • Extendible hashing, or • LinearHashing
Extendible Hashing • Problem: Bucket (primary page) becomes full. • Solution: Re-organize file by doubling # of buckets? • But Reading and writing all pages is expensive!
Extendible Hashing • Ideas : • Use directory of pointers to buckets instead of buckets • Details : • Double # of buckets by doubling the directory • Split just the bucket that overflowed! • Trick : • How hash function is adjusted!
Example LOCAL DEPTH 2 Bucket A 16* 4* 12* 32* GLOBAL DEPTH • Directory : array[4] • To find bucket for r, take last `global depth’ # bits of function h(r) 2 2 Bucket B 00 5* 1* 21* 13* 01 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES
Example LOCAL DEPTH 2 Bucket A 16* 4* 12* 32* GLOBAL DEPTH • If h(r) = 5 = 101, • it is in bucket pointed to by 01. • If h(r) = 4 = 100, • it is in bucket pointed to by 00. 2 2 Bucket B 00 5* 1* 21* 13* 01 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES
LOCAL DEPTH 2 Insertion Bucket A 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 • Insert: If bucket is full, splitit • (allocate new page, re-distribute content). 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES • Splitting may : • double the directory, or • simply link in a new page. • To tell what to do : • Compare global depth with local depth for split bucket.
Bucket A LOCAL DEPTH 2 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 2 Bucket C 10 10* 11 Bucket D 2 DIRECTORY 15* 7* 19* DATA PAGES Insert h(r)= 6 = binary 110
Bucket A Bucket A LOCAL DEPTH LOCAL DEPTH 2 2 16* 16* 4* 12* 32* 4* 12* 32* GLOBAL DEPTH GLOBAL DEPTH 2 2 2 2 Bucket B Bucket B 00 00 5* 5* 1* 21* 13* 1* 21* 13* 01 01 2 2 Bucket C Bucket C 10 10 6* 10* 10* 11 11 Bucket D Bucket D 2 2 DIRECTORY DIRECTORY 15* 7* 19* 15* 7* 19* DATA PAGES DATA PAGES Insert h(r)= 6 = binary 110
Bucket A LOCAL DEPTH 2 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 2 Bucket C 10 10* 11 Bucket D 2 DIRECTORY 15* 7* 19* DATA PAGES Insert h(r)=20 = binary 10100
Bucket A LOCAL DEPTH 2 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 2 Bucket C 10 10* 11 Bucket D 2 DIRECTORY 15* 7* 19* DATA PAGES Insert h(r)=20 Split Bucket A into two buckets A1 and A2. 3 4* 12* Bucket A2 20* 3 16* 32* Bucket A1
3 4* 12* Bucket A2 20* 3 Bucket A1 32* 16* Insert h(r)=20 Bucket A LOCAL DEPTH 2 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 2 Bucket C 10 10* 11 Bucket D 2 DIRECTORY 15* 7* 19* DATA PAGES
Insert h(r)=20 2 LOCAL DEPTH 3 LOCAL DEPTH Bucket A 16* 32* 32* 16* GLOBAL DEPTH Bucket A GLOBAL DEPTH 2 2 2 3 Bucket B 5* 21* 13* 1* 00 1* 5* 21* 13* 000 Bucket B 01 001 2 10 2 010 Bucket C 10* 11 10* Bucket C 011 100 2 2 DIRECTORY 101 Bucket D 15* 7* 19* 15* 7* 19* Bucket D 110 111 2 3 Bucket A2 4* 12* 20* DIRECTORY 12* 20* Bucket A2 4* (`split image' of Bucket A) (`split image' of Bucket A)
Points to Note • 20 = binary 10100. • Last 2 bits (00) tell us if r belongs in A • Last 3 bits needed to tell if r belongs into A1 or A2
More Points to Note Bits: • Global depth of directory: Max # of bits needed to tell which bucket an entry belongs to • Local depth of a bucket: Actual# of bits used to determine if an entry belongs to this bucket. • When does bucket split cause directory doubling? • Before insert, local depth of bucket = global depth. • Insert causes local depth to become > global depth; • Directory is doubled by copying it over and `fixing’ pointer to split the one “over-full page”.
Extendible Hashing : Delete • Delete: • If removal of data entry makes bucket empty, can be merged with `split image’. • If each directory element points to same bucket as its split image, can halve directory.
Comments on Extendible Hashing • If directory fits in memory, then equality search answered with one disk access; else with two. • 100MB file, 100 bytes/rec, 4K pages contain 1,000,000 records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory. • Directory grows in spurts. • If the distribution of hash values is skewed, directory can grow large.
Summary • Hash-based indexes: best for equality searches, cannot support range searches. • Static Hashing can lead to long overflow chains. • Extendible Hashing avoids overflow pages by splitting full bucket when new data to be added • Directory to keep track of buckets, doubles periodically