Yet More on Indexes

Yet More on Indexes Hash Tables Source: our textbook, slides by Hector Garcia-Molina

Main Memory Hash Tables • A hash functionh maps search keys to integers in some range 0 to B-1 • B is the number of buckets • There is a B-element array, each entry holds a pointer to a linked list • Record with key k is put in the linked list that starts at entry h(k) of B.

Example of Hash Table 15 0 10 B = 5 1 h(k) = k mod 5 2 22 3 4 104 29 34

Changes for Secondary Storage • Bucket array contains blocks, not pointers to linked lists • Records that hash to a certain bucket are put in the corresponding block • If a bucket overflows then start a chain of overflow blocks

Insertion into Static Hash Table • To insert a record with key K: • compute h(K) • insert record into one of the blocks in the chain of blocks for bucket number h(K), adding a new block to the chain if necessary

d a e c b EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 0 1 2 3 h(e) = 1

Deletion from a Static Hash Table • To delete records with key K: • Go to the bucket numbered h(K) • Search for records with key K, deleting any that are found • Possibly condense the chain of overflow blocks for that bucket

d maybe move “g” up EXAMPLE: deletion Delete:ef 0 1 2 3 a b d c c e f g

If < 50%, wasting space • If > 80%, overflows significant depends on how good hash function is & on # records/bucket Rule of thumb: • Try to keep space utilization between 50% and 80% Utilization = # record used total # records that fit

Efficiency of Static Hash Tables • If the hash table size is large enough and the distribution of keys by the hash function is sufficiently "even", then most buckets have no overflow blocks • In this case lookup typically takes one disk I/O and insertion/deletion take two • Significantly better than sequential indexes and B-trees • (But: hash tables do not support efficient range queries as B-trees do) • What if there are long overflow blocks?

Extensible • Linear How do we cope with growth? • Overflows and reorganizations • Dynamic hashing

Extensible Hash Tables • Each bucket in the bucket array contains a pointer to a block, instead of a block itself • Bucket array can grow by doubling in size • Certain buckets can share a block if small enough • hash function computes a sequence of k bits, but only first i bits are used at any time to index into the bucket array • Value of i can increase (corresponds to bucket array doubling in size)

(b) Use directory h(K)[i ] to bucket . . . . . .

Inserting into Extensible Hash Table • To insert record with key K: • compute h(K) • go to bucket indexed by first i bits of h(K) • follow the pointer to get to block B • if room in B, insert record • else let j be number of bits of hash value used to determine membership in B

Insertion cont'd • Case 1: j < i. • split block B in two • distribute records in B to the 2 new blocks based on value of their (j+1)-st bit • update header of each new block to j+1 • adjust pointers in bucket array so that entries that used to point to B now point to correct block • if still no room in appropriate block for new record then repeat this process

Insertion cont'd • Case 2: j = i. • increment i by 1 • double length of bucket array • entry for w0 and w1 both point to same block that old entry w pointed to (block is shared) • apply case 1 to split block B

i = 2 00 01 10 11 1 1 2 1010 New directory 2 1100 Example: h(k) is 4 bits; 2 keys/bucket 1 0001 i = 1 1001 1100 Insert 1010

2 0000 0001 2 0111 2 2 Example continued i = 2 00 01 10 11 1 0001 0111 1001 1010 Insert: 0111 0000 1100

i = 3 000 001 010 011 100 101 110 111 3 1001 1001 2 1001 1010 3 1010 1100 2 Example continued 0000 2 0001 i = 2 00 01 10 11 0111 2 Insert: 1001

Extensible hashing: deletion • No merging of blocks • Merge blocks and cut directory if possible (Reverse insert procedure)

Indirection (Not bad if directory in memory) Directory doubles in size (Now it fits, now it does not) - - Summary Extensible hashing + Can handle growing files - with less wasted space - with no full reorganizations

Linear Hash Tables • Number of buckets increases more slowly than with extensible hashing • Number of buckets is such that on average each block is x% full (say 80%) -- threshold • Overflow blocks can occur but average number per bucket << 1 • Use the i low-order bits from the result of the hash function to index into the bucket array

Two ideas: b (a) Use ilow order bits of hash 01110101 grows i Linear hashing • Another dynamic hashing scheme (b) Bucket array grows linearly

Inserting into Linear Hash Table • To insert record with key K, with last i bits of h(K) being a1a2…ai : • Let m be the integer represented by a1a2…ai in binary • If m < n (number of buckets), then bucket m exists -- put record in that bucket • If m ≥ n, then bucket m does not (yet) exist, so put record in bucket whose index corresponds to 0a2…ai

Inserting cont'd • If no room in indicated bucket, then create an overflow bucket • Compare # records / # buckets to threshold • If exceeds threshold then add a new bucket and rearrange records • If number of buckets exceeds i, then increment i by 1

0101 • can have overflow chains! If h(k)[i ]  m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ]- 2i -1 Rule Exampleb=4 bits, i =2, 2 keys/bucket • insert 0101 00 01 10 11 Future growth buckets 0000 0101 1010 1111 m = 01 (max used block)

0101 • insert 0101 1010 1111 0101 10 11 Exampleb=4 bits, i =2, 2 keys/bucket 00 01 10 11 Future growth buckets 0000 0101 1010 1111 m = 01 (max used block)

3 0101 0101 101 100 0 0 0 0 100 101 110 111 100 101 Example Continued:How to grow beyond this? i = 2 00 01 10 11 0000 0101 1010 1111 0101 . . . m = 11 (max used block)

Can still have overflow chains - Summary Linear Hashing + Can handle growing files - with less wasted space - with no full reorganizations No indirection like extensible hashing +

Comparing Index Approaches • Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5

Indexing vs Hashing • Sequential Indexes and B-trees good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5

Index definition in SQL • Createindex name on rel (attr) • Createuniqueindex name on rel (attr) defines candidate key • Drop INDEX name

Note CANNOT SPECIFY TYPE OF INDEX (e.g. B-tree, Hashing, …) OR PARAMETERS (e.g. Load Factor, Size of Hash,...) ... at least in SQL...

Yet More on Indexes

Yet More on Indexes

Presentation Transcript

CREDIT SPREADS ON STOCK INDEXES

More on Indexes

Indexes

Indexes

Indexes

Indexes on Sequential Files

Indexes

Indexes

Yet More Analysis of XP1043

Algebraic Simplification, Yet more

Yet More SQL SELECT

Yet More on Clocks

Indexes

Panel on Hedonic Price Indexes

Indexes

Yet More SQL SELECT

Indexes

Yet more on hard disks and drives

Yet More on Indexes

Indexes

Indexes