150 likes | 244 Views
Hashing (Ch. 14). Goal: to implement a symbol table or dictionary (insert, delete, search) What if you don’t need ordered keys--pred, succ, sort, select? Are O(log n) comparisons necessary? (no) Hashing basic plan: create a big array for the items to be stored
E N D
Hashing (Ch. 14) • Goal: to implement a symbol table or dictionary (insert, delete, search) • What if you don’t need ordered keys--pred, succ, sort, select? • Are O(log n) comparisons necessary? (no) • Hashing basic plan: • create a big array for the items to be stored • use a function to figure out storage location from key (hash function) • a collision resolution scheme is necessary
Hashing Example • Simple Hash function: • Treat the key as a large integer K • h(K) = K mod M, where M is the table size • let M be a prime number. • Example: • Suppose we have 101 buckets in the hash table. • ‘abcd’ in hex is 0x61626364 • Converted to decimal it’s 1633831724 • 1633831724 % 101 = 11 • Thus h(‘abcd’) = 11. Store the key at location 11. • “dcba” hashes to 57. • “abbc” also hashes to 57 – collision. What to do? • If you have billions of possible keys and hundreds of buckets, lots of collisions are possible!
Hashing Strings • h(‘aVeryLongVariableName’)? • Instead of dealing with very large numbers, you can use Horner’s method: • 256 * 97 + 86 = 24918 % 101 = 72 • 256 * 72 + 101 = 18533 % 101 = 50 • 256 * 50 + 114 = 12914 % 101 = 87 • Scramble by replacing 256 with 117 int hash(char *v, int M) { int h, a=117; for (h=0; *v; v++) h = (a*h + *v) % M; return h; }
Collisions • How likely are collisions? • Birthday paradox M sqrt(p M/2) (about 1.25 sqrt(M)) 100 12 1000 40 10000 125 [1.25 sqrt(365) is about 24] • Experiment: generate random numbers 0..100 • 84 35 45 32 89 1 58 16 38 69 5 90 16 16 53 61 … • Collision at 13th number, as predicted • What to do about collisions?
Separate Chaining • Build a linked list for each bucket • Linear search within list 0:1: L A A A2: M X3: N C4:5: E P E E6: 7: G R8: H S9: I10: • Simple, practical, widely used • Cuts search time by a factor of M over sequential search
Separate Chaining 2 • Insertion time? • O(1) • Average search cost, successful search? • O(N/2M) • Average search cost, unsuccessful? • O(N/M) • M large: CONSTANT average search time • Worst case: N (“probabilistically unlikely”) • Keep lists sorted? • insert time O(N/2M) • unsuccessful search time O(N/2M)
Linear Probing • Or, we could keep everything in the same table • Insert: upon collision, search for a free spot • Search: same (ifyou find one, fail) • Runtime? • Still O(1) if tableis sparse • But: as table fills,clustering occurs • Skipping c spotsdoesn’t help…
Clustering • Long clusters tend to get longer • Precise analysis difficult • Theorem (Knuth): • Insert cost: approx. (1 + 1/(1-N/M)2)/2 • (50% full 2.5 probes; 80% full 13 probes) • Search (hit) cost: approx. (1 + 1/(1-N/M))/2 • (50% full 1.5 probes; 80% full 3 probes) • Search (miss): same as insert • Too slow when table gets 70-80% full • How to reduce/avoid clustering?
Double Hashing • Use a second hash function to compute increment seq. • Analysis extremely difficult • About like ideal (random probe) • Thm (Guibas-Szemeredi): Insert: approx 1+1/(1-N/M) Search hit: ln(1+N/M)/(N/M) Search miss: same as insert Not too slow until the table isabout 90% full
Dynamic Hash Tables • Suppose you are making a symbol table for a compiler. How big should you make the hash table? • If you don’t know in advance how big a table to make, what to do? • Could grow the table when it “fills” (e.g. 50% full) • Make a new table of twice the size. • Make a new hash function • Re-hash all of the items in the new table • Dispose of the old table
Table Growing Analysis • Worst case insertion: Q(n), to re-hash all items • Can we make any better statements? • Average case? • O(1), since insertions n through 2n cost O(n) (on average) for insertions and O(2n) (on average) for rehashing O(n) total (with 3x the constant) • Amortized analysis? • The result above is actually an amortized result for the rehashing. • Any sequence of j insertions into an empty table has O(j) average cost for insertions and O(2j) for rehashing. • Or, think of it as billing 3 time units for each insertion, storing 2 in the bank. Withdraw them later for rehashing.
Separate Chaining vs.Double Hashing • Assume the same amount of space for keys, links (use pointers for long or variable-length keys) • Separate chaining: • 1M buckets, 4M keys • 4M links in nodes • 9M words total; avg search time 2 • Double hashing in same space: • 4M items, 9M buckets in table • average search time: 1/(1-4/9) = 1.8: 10% faster • Double hashing in same time • 4M items, average search time 2 • space needed: 8M words (1/(1-4/8) = 2) (11% less space)
Deletion • How to implement delete() with linear chaining? • Simply unlink unwanted item • Runtime? • Same as search() • How to implement delete() with linear probing? • Can’t just erase it. (Why not?) • Re-hash entire cluster • Or mark as deleted? • How to delete() with double hashing? • Re-hashing cluster doesn’t work – which “cluster”? • Mark as deleted • Every so often re-hash entire table to prune “dead-wood”
Comparisons and summary • Separate chaining advantages: • idiot-proof (degrades gracefully) • no large chunks of memory needed (but is this better?) • Why use hashing? • constant time search and insert, on average • easy to implement • Why not use hashing? • No performance guarantees • Too much arithmetic on long keys – high constant • Uses extra space • Doesn’t support pred, succ, sort, etc. – no notion of order • Where did perl “hashes” get their name?
Hashing Summary • Separate chaining: easiest to deploy • Linear probing: fastest (but takes more memory) • Double hashing: least memory (but takes more time to compute the second hash function) • Dynamic (grow): handles any number of inserts • Curious use of hashing: early unix spell checker (back in the days of the 3M machines…) Construction Search Miss RB Chain Probe Dbl Grow RB Chain Probe Dbl Grow 5k 6 1 4 4 3 2 1 0 1 0 50k 74 18 11 12 22 36 15 8 8 8 100k 182 35 21 23 47 84 45 23 21 15 190k 79 106 59 155 144 2194 261 30 200k 407 84 159 186 156 33