240 likes | 391 Views
Hashing. 15-211 Fundamental Data Structures and Algorithms. Margaret Reid-Miller 18 January 2005. Plan. Today Seat assignments Hash functions Reading: For today and next time: Sedgewick Chapter 14 Reminder: HW0 due on Thursday. Hash Tables
E N D
Hashing 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005
Plan • Today • Seat assignments • Hash functions • Reading: • For today and next time: Sedgewick Chapter 14 • Reminder: HW0 due on Thursday
Hash Tables An Alternative Representation for Dictionaries
Dictionary Interface An Abstract Data Type that maintains a dynamic set is a Dictionary. Crucial operations: • Insert • Find • Remove Standard operations: create, destroy, copy,…
Dictionary Interface insert: may or may not allow multiple occurrences find: membership query, often also retrieve associated information remove: may use deferred actions for speed up amortized running time
Small Universe • Suppose we have a small universe U = {0,1,2,…,M-1} of items. • We want to maintain a subset A of U. • Ease: Use an array of bits (boolean) of size M. • Insert: A[k] = 1 • Find: return A[k] != 0 • Remove: A[k] = 0 Operations are constant time.
Direct Access Tables • In most applications we do not store simple items but pairs (key, object). • Use an array of pointers (references to objects). • Insert: A[key] = object • Find: return A[key] • Remove: A[key] = null Again operations are constant time.
Large Universe • But what if the universe U of keys is large (and the subset is small)? e.g., names, symbol table of a compiler. • Even when the identifiers are at most 16 long there are some 1028 possibilities.
0 1 2 3 4 5 6 7 8 9 10 a b c d e f l h i j k l m n o p q r s t u v w x y z Hashing – the Idea • Map keys into integers in the range0 .. m-1, m<<M and m is the table size. • Pick a “good” mapping from keys to integers: • Easy to compute • Even distribution into the table
Hashing – Terminology • The array in which we store the objects is the hash table. • To enter an object into the table, we compute an index from the key. • The map from the key to the index is a hash function h: h(key) = index
Space-Time Tradeoff • A direct table has O(1) operations in the worse case. But space may be prohibitive. • Minimize space by using a sequential search. • Hashing balances space and time (on average) by changing the size of the hash table.
Problem - Collisions • Fundamental problem: Some keys map to the same location, a collision: h(x) = h(y). • Can we prevent collisions?
Pigeonhole Principal • There is no way to avoid collisions. • Since m << M there must be at least two keys that map to the same index. • The famous Pigeonhole Principle: If you put more than k items into k bins, then at least one bin contains more than one item.
Problem - Hash Function • Second problem: How do we find a suitable hash function? • Ideally, we want to distribute the keys uniformly over the hash table to minimize collisions. • That is, we want h to appear random, as though “hashing” the keys.
Hashing-Efficiency • We also need to make sure h(k) is easy to compute. • Note that k could be a fairly complicated data structure. How do you turn an array of integers into a single integer? Or how about a tree? • Goal: All operations should be constant time. • But things can go badly wrong on rare occasions.
Division method • Assume wlog the keys are integers. • A simple hash function is h(k) = k mod m, where m is the table size. • The choice of m is crucial. • Good choice: m prime.
Division method • Primes are fairly dense, so this is no great restriction on the table size. • In fact, we can nearly double the hash table: 31, 61, 127,251, 509, 1021, 2039,… • Store these values in a table; don’t try to compute on the fly.
Multipication Method • Another hash function is h(x) = floor( m ( k r mod 1) ) where 0 < r < 1 is cleverly chosen. • Advantage: the choice of m is not critical • Ideally should be irrational, then the values (i r mod 1), i = 1, 2,...,M are very evenly distributed over [0,1]. • Of course, there is a little problem here.
Random Input • Note that good hash functions are easy to come by if the input is random (as a bit pattern). Then we can take simply a few bits from the input (say, the first or last 16 bits). • However, such a method would fail miserably if the input shows some regularity. No good for general use.
Integer keys? • The assumption objects in U are integers has to be taken with a grain of salt. • Often we have to massage things a bit to extract numbers. • Of course, in the end everything is just one (possibly huge) number written in binary. This can be used in some languages like C to directly extract hash values from thesebits.
Example: Strings public int hashCode(String key, int m) { int h = 0; for (int i=0; i<key.length(); i++) h = 37 * h + key.charAt(i); // 37 is magic number h %= m; if (h < 0) // overflow? h += m; return h; } This is really an interpretation of the string as a number in base 37 (not ordinary radix notation, though.)
Hash functions • Desired properties • Approximates a random distribution • Over the range of table index values • Efficient calculation • Approaches • Modular arithmetic • Many • Perfect hashing • When full set of input keys known in advance