240 likes | 456 Views
Database Systems ( 資料庫系統 ). November 8, 2004 Lecture #9 By Hao-hua Chu ( 朱浩華 ). Announcement. Midterm exam: November 20 (Sat): 2:30 PM in CSIE 101/103 Assignment #6 is available on the course homepage. It is due on 11/24 It is very difficult Suggest you do it before midterm exam
E N D
Database Systems(資料庫系統) November 8, 2004 Lecture #9 By Hao-hua Chu (朱浩華)
Announcement • Midterm exam: November 20 (Sat): 2:30 PM in CSIE 101/103 • Assignment #6 is available on the course homepage. • It is due on 11/24 • It is very difficult • Suggest you do it before midterm exam • Assignment #7 will be available on the course homepage later this afternoon. • It is due 11/16 (next Tuesday). • It is easy. • It will help you prepare midterm exam.
Cool Ubicomp ProjectCounter Intelligence (MIT) • Smart kitchen & kitchen wares • Talking Spoon • Salty, sweet ,hot? • Talking Cultery • Bacteria? • Smart fridge & counters • RFID tags • Tracking food from fridge to your month
Hash-Based Indexing Chapter 11
Introduction • Recall that Hash-based indexes are best for equalityselections. • Cannot support range searches. • Equality selections are useful for join operations. • Static and dynamic hashing techniques exist • Trade-offs similar to ISAM vs. B+ trees. • Static hashing technique • Two dynamic hashing techniques • Extendible Hashing • Linear Hashing
Static Hashing • # primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed. • h(k) mod N = bucket to which data entry withkey k belongs. (N = # of buckets) 0 h(key) mod N 2 key h N-1 Primary bucket pages Overflow pages
Static Hashing (Contd.) • Buckets contain data entries. • Hash function works on search key field of record r. • Ideally uniformly distribute values over range 0 ... N-1 • h(key) = (a * key + b) usually works well. • a and b are constants; lots known about how to tune h. • Cost for insertion/delete/search • two/two/one disk page I/Os (no overflow chains). • Long overflow chains can develop and degrade performance. • Why poor performance? Scan through overflow chains linearly. • Extendible and Linear Hashing: Dynamic techniques to fix this problem.
Simple Solution • Avoid creating overflow pages: • When a bucket (primary page) becomes full, double # of buckets & re-organize the file. • What’s wrong with this simple solution? • High cost concern: reading and writing all pages is expensive!
Extendible Hashing • The basic Idea (another level of abstraction): • Use directory of pointers to buckets (hash to the directory entry) • Double # of buckets by doubling the directory • Splitting just the bucket that overflowed! • Directory much smaller than file, so doubling it is much cheaper. • Only one page of data entries is split • The page that overflows, rehash that page to two pages. • Trick lies in how hash function is adjusted! • Before doubling directory, h(r) -> 0..N-1 buckets. • After doubling directory, h(r) -> 0 .. 2N-1
Example • Directory is array of size 4. • To find bucket for r, take last global depth # bits of h(r); • Example: If h(r) = 5 = binary 101, it is in bucket pointed to by 01. • Global depth: # of bits used for hashing directory entries. • Local depth of a bucket: # bits for hashing a bucket. • When can global depth be different from local depth? LOCAL DEPTH 2 Bucket A 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES
Insert h(r)=20 (Causes Doubling) 2 LOCAL DEPTH 3 LOCAL DEPTH Bucket A 4 12 16* 32* 32* 16* GLOBAL DEPTH Bucket A GLOBAL DEPTH 2 2 2 3 Bucket B 5* 21* 13* 1* 00 1* 5* 21* 13* 000 Bucket B 01 001 2 10 2 010 Bucket C 10* 11 10* Bucket C 011 100 2 2 DIRECTORY 101 Bucket D 15* 7* 19* 15* 7* 19* Bucket D 110 111 4: 0000 0100 12: 0000 1100 20: 0001 0100 16: 0001 0000 32: 0010 0000 3 DIRECTORY 4* 12* 20* 12* 20* Bucket A2 4* (`split image' of Bucket A)
Extensible Hashing Insert • Check if the bucket is full. • If no, done! • Otherwise, check if local depth = global depth • if no, rehash the entries and distribute them into two buckets + increment the local depth • if yes, double the directory -> rehash the entries and distribute into two buckets • Directory is doubled by copying it over and `fixing’ pointer to split image page. • You can do this only by using the least significant bits in the directory.
1: 0000 0001 5: 0000 0101 21: 0001 0101 13: 0000 1101 9: 0000 1001 Insert 9 3 LOCAL DEPTH 3 32* 16* Bucket A LOCAL DEPTH GLOBAL DEPTH 32* 16* Bucket A GLOBAL DEPTH 3 3 2 3 1* 9* 000 Bucket B 1* 5* 21* 13* 000 001 Bucket B 2 010 001 2 10* Bucket C 010 011 10* Bucket C 100 011 2 100 101 2 15* 7* 19* Bucket D 101 110 15* 7* 19* Bucket D 111 110 3 111 12* 20* Bucket A2 4* DIRECTORY 3 (`split image' DIRECTORY of Bucket A) 12* 20* Bucket A2 3 4* (`split image' 13* 21* Bucket B2 5* of Bucket A) (`split image' of Bucket B)
Directory Doubling • Why use least significant bits in directory? • Allows for doubling via copying! 6 = 110 6 = 110 3 3 000 000 001 001 2 2 010 010 00 00 1 1 011 011 6* 01 10 0 0 100 100 6* 6* 10 01 1 1 101 101 6* 11 11 6* 6* 110 110 111 111 vs. Least Significant Most Significant
Comments on Extendible Hashing • If directory fits in memory, equality search answered with one disk access; else two. • 100MB file, 100 bytes/rec, you have 1M data entries. • A 4K page (a bucket) can contain 40 data entries. You need about 25,000 directory elements; chances are high that directory will fit in memory. • If the distribution of hash values is skewed (concentrates on a few buckets), directory can grow large. • Delete: If removal of data entry makes bucket empty, can be merged with `split image’. If each directory element points to same bucket as its split image, can halve directory.
Linear Hashing (LH) • This is another dynamic hashing scheme, an alternative to Extendible Hashing. • LH fixes the problem of long overflow chains (in static hashing) without using a directory (in extendible hashing). • Basic Idea: Use a family of hash functions h0, h1, h2, ... • Each function’s range is twice that of its predecessor. • Pages are split when overflows occur –but not necessarily the overflowing page. (Splitting occurs in turn, in a round robin fashion.) • Buckets are added gradually (one bucket at a time). • When all the pages at one level (the current hash function) have been split, a new level is applied. • Primary pages are allocated consecutively.
Levels of Linear Hashing • Initial Stage. • The initial level distributes entries into N0 buckets. • Call the hash function to perform this h0. • Splitting buckets. • If a bucket overflows its primary page is chained to an overflow page (same as in static hashing). • Also when a bucket overflows, some bucket is split. • The first bucket to be split is the first bucket in the file (not necessarily the bucket that overflows). • The next bucket to be split is the second bucket in the file … and so on until the Nth. has been split. • When buckets are split their entries (including those in overflow pages) are distributed using h1. • To access split buckets the next level hash function (h1) is applied. • h1 maps entries to 2N0 (or N1)buckets.
Levels of Linear Hashing (Cnt) • Level progression: • Once all Ni buckets of the current level (i) are split, the hash function hi is replaced by hi+1. • The splitting process starts again at the first bucket, and hi+2 is applied to find entries in split buckets.
Linear Hashing Example • Initially, the index level equal to 0 and N0 equals 4 (three entries fit on a page). • h0 maps index entries to one of four buckets. • h0 is used and no buckets have been split. • Now consider what happens when 9 (1001) is inserted (which will not fit in the second bucket). • Note that next indicates which bucket is to split next. (Round Robin) h0 00 01 10 11
Linear Hashing Example 2 • The page indicated by next is split (the first one). • Next is incremented. • An overflow page is chained to the primary page to contain the inserted value. • If h0 maps a value from zero to next – 1 (just the first page in this case) h1 must be used to insert the new entry. • Note how the new page falls naturally into the sequence as the fifth page. 000 01 10 11 100
Linear Hashing Example 3 • Assume inserts of 8, 7, 18, 14, 111, 32, 162, 10, 13, 233 • After the 2nd. split the base level is 1 (N1 = 8), use h1. • Subsequent splits will use h2 for inserts between the first bucket and next-1.
Linear Hashing vs. Extendable Hashing • What is the similarity? • One round of RR of splitting in LH is the same as 1-step doubling of directory in EH • What are the differences? • Directory overhead vs. none • Overflow pages vs. none • Gradual splitting (of pages) vs. one-step doubling (of directory) • Pages are allocated in order vs. not in order • Splitting non-overflowing pages vs. splitting overflowing pages
Summary • Hash-based indexes: best for equality searches, cannot support range searches. • Static Hashing can lead to long overflow chains. • Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.) • Directory to keep track of buckets, doubles periodically. • Can get large with skewed data; additional I/O if this does not fit in main memory. • a skewed data distribution is one in which the hash values of data entries are not uniformly distributed!
Summary (Contd.) • Linear Hashing avoids directory by splitting buckets round-robin, and using overflow pages. • Overflow pages not likely to be long. • Space utilization could be lower than Extendible Hashing, since splits not concentrated on `dense’ data areas. • Can tune criterion for triggering splits to trade-off slightly longer chains for better space utilization.