170 likes | 186 Views
Learn about hash tables, a data structure that stores key-value pairs and supports fast operations such as insertion, deletion, and retrieval. Understand different hash functions and their impact on spread and performance. Discover how to handle non-numeric keys.
E N D
Tirgul 11 Notes • Hash tables • reminder • examples • some new material
Dictionary / Map ADT • This ADT stores pairs of the form <key, data> (in java: “value” instead of “data”). • Supports the operations insert(key, data), find(key), and delete(key). • One way to implement it is by search trees. The standard operations take O(log n) this way. • What if we want to be more efficient??
Direct addressing • Say the keys comes from a (large) set U. one way to have fast operations is by allocating an array of size |U|. This is of course a waste of memory, since most entries in the array will remain empty. • For example, A Hebrew dictionary (e.g. Even-Shushan) holds less than 100,000 words whereas the number of possible combinations of Hebrew letters is much bigger (225 for 5-letter words only). It’s impractical to try and allocate all this space that will never be used.
Hash table • In a hash table, we allocate an array of size m, which is much smaller than |U|, and we use a hash function h() to determine the entry of each key. When we want to insert/delete/find a key k we look for it in the entry h(k) in the array. • Notice that this way, it is not necessary to have an order among the elements of the table. • Two questions arise from this: first, what kind of hash functions should we use, and second, what happens when several keys will have the same entry. (clearly it might happen, since U is much larger than m). • But first, example...
Example • Take for example the login names of the students in dast. There are about 300 names, so putting them in a search tree will create a tree of height 8, even if it is balanced. Storing them in a hash table of 100 entries, on the other hand, will create the following spread of names on entries (the x-axis describes the number of items in an entry, and the y-axis describes how many entries with this load exist)
Example (cont.) • We used the very simple hashing function of performing “mod”. • Notice that even if the spread of names was perfect, there would have been 3 names in an entry, so the actual spread of the hash is quite good. • Also notice that in a search tree, half of the elements are in the leaves, so it would take 8 operations to find them, while in this specific example, the worst case is still 8, but the search for most of the elements (about 80% of them) will take about half than that.
How to choose hash functions • The crucial point: the hash function should “spread” the keys of U equally among all the entries of the array. • Unfortunately, since we don’t know in advance the keys that we’ll get from U, this can be done only approximately. • Remark: the hash functions usually assume that the keys are numbers. We’ll discuss later what to do if the keys are not numbers.
The division method • If we have a table of size m, we can use the hash function • Some values of m are better than others: • Good m’s are prime numbers not too close to 2p. • Bad choice is 2p - the function uses only p bits • Likewise - if keys are decimal using 10p is bad • For example, if we have |U|=2000, and we want each search to take (on average) 3 operations, we can choose m=701. (what keys will fall into entry 0?) • Another example(Coremen. Page 226, question 12.2-2) : suppose m=9, what is the hash value of the keys 5, 28, 19, 15, 20, 33, 12, 17, 10
The multiplication method • One disadvantage of the previous hash function is that it depends on the size of the table: the size we choose for the table will affect the performance of the hash function. • In the multiplication method, we have a constant 0<A<1. We take the fractional part of kA, and multiply it by m. Formally, • For example, if m = 100 and A=1/3, then for k=10, h(k)=33, and for k=11, h(k)=66. You can see that this is not a good choice of A, since we’ll have only three values of h(k)... • It is argued (as a result of experiments with different values) that A=(sqrt(5)-1)/2 is a good choice. For example (m=1000): h(61)=700, h(62)=318, h(63)=936, h(64)=554
What if keys are not numbers? • The hash functions we showed only work for numbers. When keys are not numbers,we should first convert them to numbers. • For example, a string can be treated as a number in radix 256 (each character is a digit between 0 and 255).Thus, the string “key” will be translated to((int)’k’)*256^2+((int)’e’)*256^1+((int)’y’)256^0. • This may cause a problem if we have a long string, since we will have a very large number.If we use the division method, then we can use the fact that(a+b) mod n = (a+(b mod n)) mod n , and that(a*b) mod n = (a*(b mod n)) mod n.
Translating long strings to numbers (cont.) • We can write the integer value of “word” as(((256*w + o)*256 + r)*256 + d) • Using the properties of mod, we get the simple alg.:int hash(String s, int m) int h=s[0] for ( i=1 ; i<s.length ; i++) h = ((h*256) + s[i])) mod m return(h) • Notice that h is always smaller than m. This will also improve the performance of the algorithm if we don’t allow overflow.
How to handle collisions • Now that we have good hash functions, we should think about the second issue - collisions. • Collisions are more likely to happen when the hash table is more loaded. We define the “load factor” as , where n is the number of keys in the hash table. • In the lecture you saw two methods to handle collisions. • The first is chaining: each array entry is actually a linked list that holds all the keys that are mapped to this entry. It was proved in class that in a hash table with chaining, search operation takes time under uniform hashing assumption. (notice that in the chaining method, the load factor may be greater than one).
Open addressing • In this method, the table itself holds all the keys. • We change the hash function to receive two parameters, the first is the key and the second is the probe number. We first try h(k,0), if it fails we try h(k,1), and so on. It is required that {h(k, 0),...,h(k,m-1)} will be a permutation of {0,..,m-1}, so that after m-1 probes we’ll definitely find a place to put k (unless the table is full). • Notice that here, the load factor must be smaller than one. • In this method, there is a problem with deleting keys: when we search for a key and reach an empty slot, we don’t know if it means that the key we’re searching for doesn’t exist or that there was a key in this empty slot and it was deleted.
Open addressing (cont.) • There are several ways to implement this hashing: • linear probing - h(k,i)=(h’(k)+i) mod mproblem (primary clustering): if several consecutive slots are occupied, the next free slot has high probability of being occupied. Thus large clusters build up, increasing search time. • quadratic probing - h(k,i)=(h’(k)+c1i+c2i2) mod mbetter than linear but secondary clustering can still occur • double hashing - h(k,i)=(h1(k)+ih2(k)) mod m
Performance (without proofs) • Insertion and unsuccessful search of an element into an open-address hash table requires probes on average. • A successful search: the average number of probes is • For example, if the table is 50% full then a search will take about 1.4 probes on average. If the table 90% full then the search will take about 2.6 probes on average.
Example(cormen page 240 question 12.4-1) • Consider inserting the keys 10, 22, 31, 4, 15, 28, 17, 88, 59 into a hash table with m=11 with open addressing. The primary hash function is h1(k)=k mod m. Show the result for linear probing, quadratic probing with c1=1 and c2=3, and double hashing with h2(k)=1+(k mod (m-1))
Summary (or when to use hash tables) • Hash tables are very useful for implementing dictionaries if we don’t have an order on the elements, or we have order but we need only the standard operations. • On the other hand, hash tables are less useful if we have order and we need more than just the standard operations. For example, last(), or iterator over all elements, which is problematic if the load factor is very low. • We should have a good estimate of the number of elements we need to store (for example, the hebrew univ. has about 30,000 students each year, but still it is a dynamic d.b.). • Re-hashing: If we don’t know a-priori the number of elements, we might need to perform re-hashing, increasing the size of the table and re-assigning all elements.