330 likes | 437 Views
Yet More on Indexes. Hash Tables. Source: our textbook, slides by Hector Garcia-Molina. Main Memory Hash Tables. A hash function h maps search keys to integers in some range 0 to B-1 B is the number of buckets
E N D
Yet More on Indexes Hash Tables Source: our textbook, slides by Hector Garcia-Molina
Main Memory Hash Tables • A hash functionh maps search keys to integers in some range 0 to B-1 • B is the number of buckets • There is a B-element array, each entry holds a pointer to a linked list • Record with key k is put in the linked list that starts at entry h(k) of B.
Example of Hash Table 15 0 10 B = 5 1 h(k) = k mod 5 2 22 3 4 104 29 34
Changes for Secondary Storage • Bucket array contains blocks, not pointers to linked lists • Records that hash to a certain bucket are put in the corresponding block • If a bucket overflows then start a chain of overflow blocks
Insertion into Static Hash Table • To insert a record with key K: • compute h(K) • insert record into one of the blocks in the chain of blocks for bucket number h(K), adding a new block to the chain if necessary
d a e c b EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 0 1 2 3 h(e) = 1
Deletion from a Static Hash Table • To delete records with key K: • Go to the bucket numbered h(K) • Search for records with key K, deleting any that are found • Possibly condense the chain of overflow blocks for that bucket
d maybe move “g” up EXAMPLE: deletion Delete:ef 0 1 2 3 a b d c c e f g
If < 50%, wasting space • If > 80%, overflows significant depends on how good hash function is & on # records/bucket Rule of thumb: • Try to keep space utilization between 50% and 80% Utilization = # record used total # records that fit
Efficiency of Static Hash Tables • If the hash table size is large enough and the distribution of keys by the hash function is sufficiently "even", then most buckets have no overflow blocks • In this case lookup typically takes one disk I/O and insertion/deletion take two • Significantly better than sequential indexes and B-trees • (But: hash tables do not support efficient range queries as B-trees do) • What if there are long overflow blocks?
Extensible • Linear How do we cope with growth? • Overflows and reorganizations • Dynamic hashing
Extensible Hash Tables • Each bucket in the bucket array contains a pointer to a block, instead of a block itself • Bucket array can grow by doubling in size • Certain buckets can share a block if small enough • hash function computes a sequence of k bits, but only first i bits are used at any time to index into the bucket array • Value of i can increase (corresponds to bucket array doubling in size)
(b) Use directory h(K)[i ] to bucket . . . . . .
Inserting into Extensible Hash Table • To insert record with key K: • compute h(K) • go to bucket indexed by first i bits of h(K) • follow the pointer to get to block B • if room in B, insert record • else let j be number of bits of hash value used to determine membership in B
Insertion cont'd • Case 1: j < i. • split block B in two • distribute records in B to the 2 new blocks based on value of their (j+1)-st bit • update header of each new block to j+1 • adjust pointers in bucket array so that entries that used to point to B now point to correct block • if still no room in appropriate block for new record then repeat this process
Insertion cont'd • Case 2: j = i. • increment i by 1 • double length of bucket array • entry for w0 and w1 both point to same block that old entry w pointed to (block is shared) • apply case 1 to split block B
i = 2 00 01 10 11 1 1 2 1010 New directory 2 1100 Example: h(k) is 4 bits; 2 keys/bucket 1 0001 i = 1 1001 1100 Insert 1010
2 0000 0001 2 0111 2 2 Example continued i = 2 00 01 10 11 1 0001 0111 1001 1010 Insert: 0111 0000 1100
i = 3 000 001 010 011 100 101 110 111 3 1001 1001 2 1001 1010 3 1010 1100 2 Example continued 0000 2 0001 i = 2 00 01 10 11 0111 2 Insert: 1001
Extensible hashing: deletion • No merging of blocks • Merge blocks and cut directory if possible (Reverse insert procedure)
Indirection (Not bad if directory in memory) Directory doubles in size (Now it fits, now it does not) - - Summary Extensible hashing + Can handle growing files - with less wasted space - with no full reorganizations
Linear Hash Tables • Number of buckets increases more slowly than with extensible hashing • Number of buckets is such that on average each block is x% full (say 80%) -- threshold • Overflow blocks can occur but average number per bucket << 1 • Use the i low-order bits from the result of the hash function to index into the bucket array
Two ideas: b (a) Use ilow order bits of hash 01110101 grows i Linear hashing • Another dynamic hashing scheme (b) Bucket array grows linearly
Inserting into Linear Hash Table • To insert record with key K, with last i bits of h(K) being a1a2…ai : • Let m be the integer represented by a1a2…ai in binary • If m < n (number of buckets), then bucket m exists -- put record in that bucket • If m ≥ n, then bucket m does not (yet) exist, so put record in bucket whose index corresponds to 0a2…ai
Inserting cont'd • If no room in indicated bucket, then create an overflow bucket • Compare # records / # buckets to threshold • If exceeds threshold then add a new bucket and rearrange records • If number of buckets exceeds i, then increment i by 1
0101 • can have overflow chains! If h(k)[i ] m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ]- 2i -1 Rule Exampleb=4 bits, i =2, 2 keys/bucket • insert 0101 00 01 10 11 Future growth buckets 0000 0101 1010 1111 m = 01 (max used block)
0101 • insert 0101 1010 1111 0101 10 11 Exampleb=4 bits, i =2, 2 keys/bucket 00 01 10 11 Future growth buckets 0000 0101 1010 1111 m = 01 (max used block)
3 0101 0101 101 100 0 0 0 0 100 101 110 111 100 101 Example Continued:How to grow beyond this? i = 2 00 01 10 11 0000 0101 1010 1111 0101 . . . m = 11 (max used block)
Can still have overflow chains - Summary Linear Hashing + Can handle growing files - with less wasted space - with no full reorganizations No indirection like extensible hashing +
Comparing Index Approaches • Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5
Indexing vs Hashing • Sequential Indexes and B-trees good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5
Index definition in SQL • Createindex name on rel (attr) • Createuniqueindex name on rel (attr) defines candidate key • Drop INDEX name
Note CANNOT SPECIFY TYPE OF INDEX (e.g. B-tree, Hashing, …) OR PARAMETERS (e.g. Load Factor, Size of Hash,...) ... at least in SQL...