620 likes | 753 Views
CS 277: Database System Implementation. Notes 5: Hashing and More. Arthur Keller. Hashing. key h(key). <key>. Buckets (typically 1 disk block). Two alternatives. records. (1) key h(key). Two alternatives. record. (2) key h(key). key 1. Index.
E N D
CS 277: Database System Implementation Notes 5: Hashing and More Arthur Keller Notes 5
Hashing key h(key) <key> Buckets (typically 1 disk block) . . . Notes 5
Two alternatives . . . records (1) key h(key) . . . Notes 5
Two alternatives record (2) key h(key) key 1 Index • Alt (2) for “secondary” search key Notes 5
Example hash function • Key = ‘x1 x2 … xn’ n byte character string • Have b buckets • h: add x1 + x2 + ….. xn • compute sum modulo b Notes 5
This may not be best function … Read Knuth Vol. 3 if you really need to select a good function. Good hash Expected number of function: keys/bucket is the same for all buckets Notes 5
Within a bucket: • Do we keep keys sorted? • Yes, if CPU time critical & Inserts/Deletes not too frequent Notes 5
Next: example to illustrate inserts, overflows, deletes h(K) Notes 5
d a e c b EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 0 1 2 3 h(e) = 1 Notes 5
d maybe move “g” up EXAMPLE: deletion Delete:ef 0 1 2 3 a b d c c e f g Notes 5
If < 50%, wasting space • If > 80%, overflows significant depends on how good hash function is & on # keys/bucket Rule of thumb: • Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit Notes 5
Extensible • Linear How do we cope with growth? • Overflows and reorganizations • Dynamic hashing Notes 5
Extensible hashing: two ideas (a) Use i of b bits output by hash function b h(K) use i grows over time…. 00110101 Notes 5
(b) Use directory h(K)[i ] to bucket . . . . . . Notes 5
i = 2 00 01 10 11 1 1 2 1010 New directory 2 1100 Example: h(k) is 4 bits; 2 keys/bucket 1 0001 i = 1 1001 1100 Insert 1010 Notes 5
2 0000 0001 2 0111 2 2 Example continued i = 2 00 01 10 11 1 0001 0111 1001 1010 Insert: 0111 0000 1100 Notes 5
i = 3 000 001 010 011 100 101 110 111 3 1001 1001 2 1001 1010 3 1010 1100 2 Example continued 0000 2 0001 i = 2 00 01 10 11 0111 2 Insert: 1001 Notes 5
Extensible hashing: deletion • No merging of blocks • Merge blocks and cut directory if possible (Reverse insert procedure) Notes 5
Deletion example: • Run thru insert example in reverse! Notes 5
Indirection (Not bad if directory in memory) Directory doubles in size (Now it fits, now it does not) - - Summary Extensible hashing + Can handle growing files - with less wasted space - with no full reorganizations Notes 5
Two ideas: b (a) Use ilow order bits of hash 01110101 grows i (b) File grows linearly Linear hashing • Another dynamic hashing scheme Notes 5
0101 • can have overflow chains! If h(k)[i ] m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ]- 2i -1 Rule Exampleb=4 bits, i =2, 2 keys/bucket • insert 0101 00 01 10 11 Future growth buckets 0000 0101 1010 1111 m = 01 (max used block) Notes 5
0101 • insert 0101 1010 1111 0101 10 11 Exampleb=4 bits, i =2, 2 keys/bucket 00 01 10 11 Future growth buckets 0000 0101 1010 1111 m = 01 (max used block) Notes 5
3 0101 0101 101 100 0 0 0 0 100 101 110 111 100 101 Example Continued:How to grow beyond this? i = 2 00 01 10 11 0000 0101 1010 1111 0101 . . . m = 11 (max used block) Notes 5
When do we expand file? • If U > threshold then increase m (and maybe i ) • Keep track of: # used slots total # of slots = U Notes 5
Can still have overflow chains - Summary Linear Hashing + Can handle growing files - with less wasted space - with no full reorganizations No indirection like extensible hashing + Notes 5
Example: BAD CASE Very full Very empty Need to move m here… Would waste space... Notes 5
Summary Hashing - How it works - Dynamic hashing - Extensible - Linear Notes 5
Next: • Indexing vs Hashing • Index definition in SQL • Multiple key access Notes 5
Indexing vs Hashing • Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5 Notes 5
Indexing vs Hashing • INDEXING (Including B Trees) good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 Notes 5
Index definition in SQL • Createindex name on rel (attr) • Createuniqueindex name on rel (attr) defines candidate key • Drop INDEX name Notes 5
Note CANNOT SPECIFY TYPE OF INDEX (e.g. B-tree, Hashing, …) OR PARAMETERS (e.g. Load Factor, Size of Hash,...) ... at least in SQL... Notes 5
Note ATTRIBUTE LIST MULTIKEY INDEX (next) e.g., CREATEINDEX foo ON R(A,B,C) Notes 5
Multi-key Index Motivation: Find records where DEPT = “Toy” AND SAL > 50k Notes 5
Strategy I: • Use one index, say Dept. • Get all Dept = “Toy” records and check their salary I1 Notes 5
Strategy II: • Use 2 Indexes; Manipulate Pointers Toy Sal > 50k Notes 5
Strategy III: • Multiple Key Index One idea: I2 I3 I1 Notes 5
Art Sales Toy Example 10k 15k Example Record Dept Index Salary Index 17k 21k Name=Joe DEPT=Sales SAL=15k 12k 15k 15k 19k Notes 5
For which queries is this index good? Find RECs Dept = “Sales” SAL=20k Find RECs Dept = “Sales” SAL > 20k Find RECs Dept = “Sales” Find RECs SAL = 20k Notes 5
Interesting application: • Geographic Data DATA: <X1,Y1, Attributes> <X2,Y2, Attributes> y x . . . Notes 5
Queries: • What city is at <Xi,Yi>? • What is within 5 miles from <Xi,Yi>? • Which is closest point to <Xi,Yi>? Notes 5
40 30 10 20 20 m g b n h o d e a k c f j l i 10 25 15 35 20 10 20 5 d e h i g f c a b 15 15 • Search points near f • Search points near b n o j k m l Example Notes 5
Queries • Find points with Yi > 20 • Find points with Xi < 5 • Find points “close” to i = <12,38> • Find points “close” to b = <7,24> Notes 5
Many types of geographic index structures have been suggested • Quad Trees • R Trees Notes 5
Two more types of multi key indexes • Grid • Partitioned hash Notes 5
Grid Index Key 2 X1 X2 …… Xn V1 V2 Key 1 Vn To records with key1=V3, key2=X2 Notes 5
CLAIM • Can quickly find records with • key 1 = Vi Key 2 = Xj • key 1 = Vi • key 2 = Xj • And also ranges…. • E.g., key 1 Vi key 2 < Xj Notes 5
V1 V2 V3 X1 X1 X1 Like Array... X2 X2 X2 X3 X3 X3 X4 X4 X4 • How is Grid Index stored on disk? But there is a catch with Grid Indexes! Problem: • Need regularity so we can compute position of <Vi,Xj> entry Notes 5
Solution: Use Indirection X1 X2 X3 Buckets V1 V2 V3 *Grid only V4 contains pointers to buckets Buckets -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Notes 5