280 likes | 298 Views
Hashing. General idea: Get a large array Design a hash function h which converts keys into array indices . Insert (key, value) in the array at h(key) position. Can also be used in external memory (addresses instead of array indices). Example. Suppose the size of the hash table is k.
E N D
Hashing General idea: • Get a large array • Design a hash function h which converts keys into array indices . • Insert (key, value) in the array at h(key) position. • Can also be used in external memory (addresses instead of array indices).
Example Suppose the size of the hash table is k. How to map arbitrary large numbers to numbers in the range 0.....k-1? A possible solution: division hashing. index = largeNumber % k (remainder of division by k). However, even this is not very practical (largeNumber is too large in our case). Another problem is that the same index may now be associated with different words.
Requirements for Hash function The following are general requirements for hash functions: • If the hash table has size k, hash function should return numbers in the range 0....k-1. • Hash function must be quick to calculate the value. • Hash function should minimize collisions (when several keys are hashed to the same number). • The last requirement is satisfied if each key is equally likely to map to any of the k indices (uniform distribution). Usually we cannot completely avoid collisions (as in the example with words).
Hashing Methods • Division method • Multiplication method • Folding method • Other hashing methods
Division method • Algorithm: H(x) = x mod m + 1 Where m is some predetermined divisor integer (i.e., te table size), x is the preconditioned key, and mod stands for modulo. Note that adding 1 is only necessary if the table starts at key 1 (if it starts at 0, the algorithm simplifies to (H(x) = x mod m). • So, in other words: given an item, divide the preconditioned key of that item by the table size (+1). The remainder is the hash key.
Division method • Given a hash table with 10 buckets, what is the hash key for 'Cat'? Since 'Cat' = 131130 when converted to ASCII, then x = 131130. We are given the table size (i.e., m = 10, starting at 0 and ending at 9). • H(x) = x mod m H(131130) = 131130 mod 10 = 0 (there is no remainder) • 'Cat' is inserted into the table at address 0.
Multiplication method • The method that is used in the slide is the multiplication method. It multiplies all the individual digits in the key together, and takes the remainder after dividing the resulting number by the table size. In functional notation, the algorithm is: • H(x) = (a * b * c * d *....) mod m • Where: m is the table size, a, b, c, d, etc. are the individual digits of the item, and mod stands for modulo. • Let's apply this algorithm to an example.
Example • Given a hash table of ten buckets (0 through 9), what is the hash key for 'Cat'? Since 'Cat' = 131130 when converted to ASCII, then x = 131130 We are given the table size (i.e., m = 10). The constant can be any number we want, let's use five (i.e., c = 5). H(x) = (a * b * c * d *....) mod m H(131130) = (1 * 3 * 1 * 1 * 3 * 0) mod 10 = 0 mod 10 = 0 • 'Cat' is inserted into the table at address 0.
Folding method • Algorithm: H(x) = (a + b + c) mod m • Where a, b, and c represent the preconditioned key broken down into three parts, m is the table size, and mod stands for modulo. In other words: the sum of three parts of the preconditioned key is divided by the table size. The remainder is the hash key.
Examples • Fold the key 123456789 into a hash table of ten spaces (0 through 9). • We are given x = 123456789 and the table size (i.e., m = 10). • Since we can break x into three parts any way we want to, we will break it up evenly. Thus a = 123, b = 456, and c = 789. • H(x) = (a + b + c) mod m H(123456789) = (123+456+789) mod 10 = 1368 mod 10 = 8 • 123456789 is inserted into the table at address 8.
Other hashing methods • Now that we have shown, in detail, a few of the many hashing methods which you can use, we will briefly discuss the length-dependent method, • midsquare method, • digit-analysis and • addition method.
Other hashing methods • Midsquare mehod => Multiply x by itself (i.e., x2) and select a number of digits from the middle of the result. How many digits you select will depend on your table size and the size of your hash key. • Addition method =>The sum of the digits of the preconditioned key is divided by the table size. The remainder is the hash key. • Digit analysis method => Certain digits of the preconditioned key are selected in a certain predetermined and consistent pattern.
Perfect Hashing • In exceptional circumstances, it is possible to design a perfect hashing function which avoids collisions. • For example, if all items in the collection are given in advance (e.g. a dictionary) and a function is designed to suit this collection and nothing else.
Handling Collisions There are several methods of solving the problem of collision: • open addressing: probe array for the "next" slot which is still empty. Different methods for finding the "next" slot: • linear probing • quadratic probing • double hashing • separate chaining: each slot in the array contains a reference to a bucket of items with the keys hashed to that slot. Usually a linked list.
Open addressing • Using the open addressing technique, after a collision, new locations are examined until an empty one is found. • The item to be inserted is placed here. • If an item is deleted, then it is replaced with the word "DELETED" . • The sequence in which other locations are examined can be determined in many different ways. • For example, after a collision, new locations could be searched from bottom to top, top to bottom, or by every 2 places.
Open addressing • Compute the index given the key • Go to the index • If it is free, insert the item • If it is not free, go to the next possible slot. • To find the next possible slot, generates a "probing sequence".
Probing Sequence • linear probing: search sequentially for the next free slot. The simplest probing sequence is h(k), h(k) + 1, h(k) + 2, ...but in general We can do h(k), h(k) + c, h(k) + 2c, h(k) + 3c, ... where c is some constant. Problem: clustering.If several keys are mapped to the same index, we get large clusters of occupied cells with empty spaces in between. The larger the cluster, the longer the search. Suppose k1, k2 and k3 are all mapped to the same index. Then searching for/inserting (k1,v1) is OK, inserting (k2,v2) takes an extra step, inserting (k3,v3) takes extra two steps etc.:
K2, V2 K1, V1 K3, V3 Probing Sequence
Linear probing • After first being located to an already occupied table position, the item is sent sequentially down the table until an empty space is found. • If m is the number of possible positions in the table, then the linear probe continues until either an empty spot is found, or until m-1 locations have been searched. • One thing that happens with linear probing however is clustering, or bunching together. • Primary clustering occurs when items are always hashed to the same spot in the table, and the lookup requires searching through 4 or 5 buckets before finding the empty spot.
Clustering • A graphic might help explain this: The opposite of clustering is uniform distribution. Simply, this is when the hash function uniformly distributes the items over the table.
Chaining • A collision has occurred. The position resolved by the resolution technique is already full, therefore the element you wish to insert into the hash table cannot go there without losing data. Separate chaining simply makes a backup list of elements that would have otherwise been inserted into a given position had that position been empty.
Chaining • For example, let us assume that position x in the table has been previously filled by a hash insertion. • Now, let us assume that the element we wish to insert into the hash table resolves to position x as well. • We cannot simply overwrite the element already in position x. • Each hash table entry requires a pointer to a linked list of records that can hold the element data. • This list is only used when a collision occurs. • The element that we are inserting, E, is then appended onto the list associated with the target position, x. The result is a linked list associated with each table position, each list containing only those elements that collided on that table position with an element already in that position.
Chaining • The worst case: all keys hashed to the same slot. Items kept in a linked list of size N (size of the collection), search O(N). • Best case: hash function distributes the keys uniformly, the size of the bucket is N/k.
Probing Sequence • Quadratic probing - the offset is squared at every step. Jumps around much more, but the items with the same keys make exactly the same jumps. • Secondary clustering: each new item with the same hash value requires a longer probing sequence. • Double hashing: offset is calculated by a second hashing function h', which is different from the first one: h(k), h(k) + h'(k), h(k) + 2h'(k), h(k) + 3h'(k), ... Solves the problem of clustering.
Quadratic Probing • h(k,i) = (h(k) + c1i + c2 i2) mod m • Only certain combination of c1, c2, and m use the entire hash table. • h(k1,0) = h(k2,0) implies h(k1,i) = h(k2,i). This leads to secondary clustering. • There are only m (<<m!) distinct probe sequences. • Example: • h(k1,i) = (h(k) + i + i2) mod m, c1 = c2 = 1 • In this example, the probe sequence is • h(k)h(k) + 2h(k) + 6h(k) + 12...
Double hashing • Double hashing is another method of collision resolution, but unlike the linear collision resolution, double hashing uses a second hashing function which normally limits multiple collisions. • The idea is that if two values hash to the same spot in the table, a constant can be calculated from the initial value using the second hashing function which can then be used to change the sequence of locations in the table, but still have access to the entire table.
Useful rules for the table size • Unless it is possible to map each key to a unique index in the array, table size should be a prime number (important if modulo division is used; also for linear probing). • If open addressing is used, table size should be larger than the number of values stored. • Optimal load factor for open addressing is 1/2 (the table twice larger than the number of items). Load factor is the ratio of the number of items to the size of the hash table.
Summary • Hash tables are very useful when the size of collection is known in advance and search, insertion and deletion should be fast. • If the table is large enough and the hash function is chosen well, these operations are almost constant. • Difficult to resize (copy into a larger array and change the hashing function) and to iterate through all items (array contains gaps).