Hashing (Ch. 14)

Hashing (Ch. 14) • Goal: to implement a symbol table or dictionary (insert, delete, search) •  What if you don’t need ordered keys--pred, succ, sort, select? • Are O(log n) comparisons necessary? (no) • Hashing basic plan: • create a big array for the items to be stored • use a function to figure out storage location from key (hash function) • a collision resolution scheme is necessary

Hashing Example • Simple Hash function: • Treat the key as a large integer K • h(K) = K mod M, where M is the table size • let M be a prime number. • Example: • Suppose we have 101 buckets in the hash table. • ‘abcd’ in hex is 0x61626364 • Converted to decimal it’s 1633831724 • 1633831724 % 101 = 11 • Thus h(‘abcd’) = 11. Store the key at location 11. • “dcba” hashes to 57. • “abbc” also hashes to 57 – collision. What to do? • If you have billions of possible keys and hundreds of buckets, lots of collisions are possible!

Hashing Strings • h(‘aVeryLongVariableName’)? • Instead of dealing with very large numbers, you can use Horner’s method: • 256 * 97 + 86 = 24918 % 101 = 72 • 256 * 72 + 101 = 18533 % 101 = 50 • 256 * 50 + 114 = 12914 % 101 = 87 • Scramble by replacing 256 with 117 int hash(char *v, int M) { int h, a=117; for (h=0; *v; v++) h = (a*h + *v) % M; return h; }

Collisions • How likely are collisions? • Birthday paradox M sqrt(p M/2) (about 1.25 sqrt(M)) 100 12 1000 40 10000 125 [1.25 sqrt(365) is about 24] • Experiment: generate random numbers 0..100 • 84 35 45 32 89 1 58 16 38 69 5 90 16 16 53 61 … • Collision at 13th number, as predicted • What to do about collisions?

Separate Chaining • Build a linked list for each bucket • Linear search within list 0:1: L A A A2: M X3: N C4:5: E P E E6: 7: G R8: H S9: I10: • Simple, practical, widely used • Cuts search time by a factor of M over sequential search

Separate Chaining 2 • Insertion time? • O(1) • Average search cost, successful search? • O(N/2M) • Average search cost, unsuccessful? • O(N/M) • M large: CONSTANT average search time • Worst case: N (“probabilistically unlikely”) • Keep lists sorted? • insert time O(N/2M) • unsuccessful search time O(N/2M)

Linear Probing • Or, we could keep everything in the same table • Insert: upon collision, search for a free spot • Search: same (ifyou find one, fail) • Runtime? • Still O(1) if tableis sparse • But: as table fills,clustering occurs • Skipping c spotsdoesn’t help…

Clustering • Long clusters tend to get longer • Precise analysis difficult • Theorem (Knuth): • Insert cost: approx. (1 + 1/(1-N/M)2)/2 • (50% full  2.5 probes; 80% full  13 probes) • Search (hit) cost: approx. (1 + 1/(1-N/M))/2 • (50% full  1.5 probes; 80% full  3 probes) • Search (miss): same as insert • Too slow when table gets 70-80% full • How to reduce/avoid clustering?

Double Hashing • Use a second hash function to compute increment seq. • Analysis extremely difficult • About like ideal (random probe) • Thm (Guibas-Szemeredi): Insert: approx 1+1/(1-N/M) Search hit: ln(1+N/M)/(N/M) Search miss: same as insert Not too slow until the table isabout 90% full

Dynamic Hash Tables • Suppose you are making a symbol table for a compiler. How big should you make the hash table? • If you don’t know in advance how big a table to make, what to do? • Could grow the table when it “fills” (e.g. 50% full) • Make a new table of twice the size. • Make a new hash function • Re-hash all of the items in the new table • Dispose of the old table

Table Growing Analysis • Worst case insertion: Q(n), to re-hash all items • Can we make any better statements? • Average case? • O(1), since insertions n through 2n cost O(n) (on average) for insertions and O(2n) (on average) for rehashing  O(n) total (with 3x the constant) • Amortized analysis? • The result above is actually an amortized result for the rehashing. • Any sequence of j insertions into an empty table has O(j) average cost for insertions and O(2j) for rehashing. • Or, think of it as billing 3 time units for each insertion, storing 2 in the bank. Withdraw them later for rehashing.

Separate Chaining vs.Double Hashing • Assume the same amount of space for keys, links (use pointers for long or variable-length keys) • Separate chaining: • 1M buckets, 4M keys • 4M links in nodes • 9M words total; avg search time 2 • Double hashing in same space: • 4M items, 9M buckets in table • average search time: 1/(1-4/9) = 1.8: 10% faster • Double hashing in same time • 4M items, average search time 2 • space needed: 8M words (1/(1-4/8) = 2) (11% less space)

Deletion • How to implement delete() with linear chaining? • Simply unlink unwanted item • Runtime? • Same as search() • How to implement delete() with linear probing? • Can’t just erase it. (Why not?) • Re-hash entire cluster • Or mark as deleted? • How to delete() with double hashing? • Re-hashing cluster doesn’t work – which “cluster”? • Mark as deleted • Every so often re-hash entire table to prune “dead-wood”

Comparisons and summary • Separate chaining advantages: • idiot-proof (degrades gracefully) • no large chunks of memory needed (but is this better?) • Why use hashing? • constant time search and insert, on average • easy to implement • Why not use hashing? • No performance guarantees • Too much arithmetic on long keys – high constant • Uses extra space • Doesn’t support pred, succ, sort, etc. – no notion of order • Where did perl “hashes” get their name?

Hashing Summary • Separate chaining: easiest to deploy • Linear probing: fastest (but takes more memory) • Double hashing: least memory (but takes more time to compute the second hash function) • Dynamic (grow): handles any number of inserts • Curious use of hashing: early unix spell checker (back in the days of the 3M machines…) Construction Search Miss RB Chain Probe Dbl Grow RB Chain Probe Dbl Grow 5k 6 1 4 4 3 2 1 0 1 0 50k 74 18 11 12 22 36 15 8 8 8 100k 182 35 21 23 47 84 45 23 21 15 190k 79 106 59 155 144 2194 261 30 200k 407 84 159 186 156 33

Hashing (Ch. 14)

Hashing (Ch. 14)

Presentation Transcript

Hashing

ch 14

Summer School on Hashing’14 Locality Sensitive Hashing

Ch. 14

Ch 14

Ch 14

Ch. 14 - Apud Salvium

Chapter 8-10

Fatty Acid Metabolism

Hashing

Hashing

America’s Politics

Summer School on Hashing’14 Locality Sensitive Hashing

Hashing

CHAPTER 14: Hashing

CH. 14

Hashing (Ch. 14)

Chapter 8-10