380 likes | 1.51k Views
Principles of Database Management Systems 4.2: Hashing Techniques. Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and Jennifer Widom). Hashing?. Locating the storage block of a record by the hash value h(k) of its key k Normally really fast
E N D
Principles of Database Management Systems4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina, Jeff Ullman and Jennifer Widom) Notes 4.2: Hashing
Hashing? • Locating the storage block of a record by the hash valueh(k) of its key k • Normally really fast • records (often) located by a single disk access Notes 4.2: Hashing
Hashing key h(key) <key> Buckets (typically 1 disk block) . . . Notes 4.2: Hashing
(1) Hash value determines the storage block directly Two alternatives . . . records key h(key) . . . • to implement a primary index Notes 4.2: Hashing
Two alternatives (2) Records located indirectly via index buckets record key h(key) key 1 Index • for a secondary index Notes 4.2: Hashing
Example hash function • Key = ‘x1 x2 … xn’ n byte character string • Have b buckets • h = (x1 + x2 + … + xn) mod b {0, 1, …, b-1} Notes 4.2: Hashing
Good hash Expected number of function: keys/bucket is the same for all buckets This may not be best function … Read Knuth Vol. 3 if you really need to select a good function. Notes 4.2: Hashing
Next: example to illustrate inserts, overflows, deletes h(K) Notes 4.2: Hashing
d a e c b EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 0 1 2 3 h(e) = 1 Notes 4.2: Hashing
d maybe move “g” up EXAMPLE: deletion Delete:ef 0 1 2 3 a b d c c e f g Notes 4.2: Hashing
If < 50%, wasting space • If > 80%, overflows significant depends on how good hash function is & on # keys/bucket Rule of thumb: • Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit Notes 4.2: Hashing
Extensible • Linear • also others ... How do we cope with growth? • Overflows and reorganizations • Dynamic hashing: # of buckets may vary Notes 4.2: Hashing
Extensible hashing: two ideas (a) Use i of b bits output by hash function b h(K) use i grows over time…. For example, b=32 00110101 Notes 4.2: Hashing
(b) Use directory h(K)[i ] to bucket . . . . . . Directory contains 2i pointers to buckets, and stores i. Each bucket stores j, indicating #bits used for placing the records in this block (j i) Notes 4.2: Hashing
Extensible Hashing: Insertion • If there's room in bucket h(k)[i], place record there; Otherwise … • If j=i, set i=i+1 and double the directory • If j<i, split the block in two, distribute records among them now using j+1 bits of h(k); (Repeat until some records end up in the new bucket); Update pointers of bucket array • See the next example Notes 4.2: Hashing
i = 2 00 01 10 11 1 1 2 1010 New directory 2 1100 Example: h(k) is 4 bits; 2 keys/block (j) 1 0001 i = 1 1001 1100 Insert 1010 Notes 4.2: Hashing
2 0000 0001 2 0111 2 2 Example continued i = 2 00 01 10 11 1 0001 0111 1001 1010 Insert: 0111 0000 1100 Notes 4.2: Hashing
i = 3 000 001 010 011 100 101 110 111 3 1001 1001 2 1001 1010 3 1010 1100 2 Example continued 0000 2 0001 i = 2 00 01 10 11 0111 2 Insert: 1001 Notes 4.2: Hashing
Extensible hashing: deletion • Reverse insert procedure … • Example: Walk thru insert example in reverse! Notes 4.2: Hashing
Indirection (Not bad if directory in memory) Directory doubles in size (First it fits in memory, then it does not -> sudden performance degradation) - - Summary Extensible hashing + Can handle growing files - without full reorganizations Only one data block examined + Notes 4.2: Hashing
Two ideas: b (a) Use ilow order bits of hash 01110101 grows i (b) File grows linearly Linear hashing: grow # of buckets by one -> No bucket directory needed Notes 4.2: Hashing
Linear Hashing: Parameters • n: number of buckets in use • buckets numbered 0…n-1 • i: number of bits of h(k) used to address buckets • r: number of records in hash table • ratio r/n limited to fit an avg bucket in a block • next example: r 1.7n, and block holds 2 records=> AVG bucket occupancy is 1.7/2 = 0.85 of a block Notes 4.2: Hashing
If h(k)[i ] = (a1 … ai)2< n, then look at bucket h(k)[i ]; else look at bucket h(k)[i ]- 2i -1 = (0a2 … ai)2 Rule Example:2 keys/block, b=4 bits, n=2, i =1 • insert 0101 00 01 • now r=4 >1.7n -> get new bucket 10and distribute keys btw buckets 00 and 10 0000 0101 1010 1111 Notes 4.2: Hashing
1010 -> n=3, i =2; distribute keys btw buckets 00 and 10: 000110 0000 0101 1010 1111 Notes 4.2: Hashing
0001 n=3, i =2; insert 0001: • can have overflow chains! 000110 0000 0101 1010 1111 Notes 4.2: Hashing
n=3, i =2 0001 • insert 0111 0111 000110 • bucket 11 not in use-> redirect to 01 0000 0101 1010 1111 • now r=6 > 1.7n-> get new bucket 11 Notes 4.2: Hashing
1111 0111 0001 n=4, i =2; distribute keys btw 01 and 11 0001 0111 00011011 0000 0101 1010 1111 Notes 4.2: Hashing
3 0101 0101 101 100 0 0 0 0 101 110 111 100 101 Example Continued:How to grow beyond this? i = 2 00 01 10 11 0000 0101 1010 1111 0101 . . . m = 11 (max used block) Notes 4.2: Hashing
Summary Linear Hashing + Can handle growing files - without full reorganizations No indirection directory of extensible hashing Can have overflow chains - but probability of long chains can be kept low by controlling the r/n fill ratio (?) + - Notes 4.2: Hashing
Summary Hashing - How it works - Dynamic hashing - Extensible - Linear Notes 4.2: Hashing
Next: • Indexing vs Hashing • Index definition in SQL Notes 4.2: Hashing
Indexing vs Hashing • Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5 Notes 4.2: Hashing
Indexing vs Hashing • INDEXING (Including B-Trees) good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 Notes 4.2: Hashing
Index definition in SQL • Createindex name on rel (attr) • Createuniqueindex name on rel (attr) defines candidate key • Drop INDEX name Notes 4.2: Hashing
Note CANNOT SPECIFY TYPE OF INDEX (e.g. B-tree, Hashing, …) OR PARAMETERS (e.g. Load Factor, Size of Hash,...) ... at least in SQL … Oracle and IBM DB2 UDB provide a PCTFREE clause to inditate the proportion of B-tree blocks initially left unfilled Oracle: “Hash clusters” with built-in or DBA-specified hash function Notes 4.2: Hashing
The BIG picture…. • Chapters 2 & 3: Storage, records, blocks... • Chapter 4: Access Mechanisms - Indexes - B trees - Hashing • Chapters 6 & 7: Query Processing NEXT Notes 4.2: Hashing