400 likes | 522 Views
Hashing. Lecture 10. More Efficient Searching. Question: Can searching be more efficient than logarithmic O(log n) significantly? Example: a set of students represented as (id, name, data). Storing the record in an array.
E N D
Hashing Lecture 10
More Efficient Searching • Question: Can searching be more efficient than logarithmic O(log n) significantly? • Example: a set of students represented as (id, name, data).
Storing the record in an array • We store the records in an array: a record is stored in some address according its id (key) using the function: h(key) = last 3 digits of key -100 h(10379102)=2, so the student record with id 10379102 is stored at index 2 in the array. 0 1 2 3 4 6 6 7 8
Searching costs constant time • Searching a student with id = k: compute its storing address h(k), access the data at h(k). • Inserting a student with id = k: compute its storing address h(k) and store the record at index h(k); • Deleting a student with id = k: compute its storing address h(k) and remove the record at index h(k); • Complexity: constant time as long as computing h(k) is easy.
Hashing • Problem: a set of records of the form (key, data), each with a unique key, need efficient insertion, deletion and searching.
Hashing • General idea: store the records in an array T of size m using some function h that computes their storing (hash) addresses according to their keys: h : K -> [0..m-1], where K is the set of keys, that is, for each kK, 0h(k) m-1, h(k) is called the hash address of k. • The function h is called a hash function. • The search table T is called a hash table.
Hashing • Inserting a record (k, d): compute h(k) using the hash function, and store (k,d) in T[h(k)]; • Searching a record with key = k: compute its hash address h(k) and access the data T[h(k)]; • Deleting a record with key =k: compute its has address h(k) and access the data T[h(k)]; • This method is called hashing: design a hash function and store the records in an array. The array that stores the records and supports insertion, deletion and search is called a hash table. • Complexity: constant time as long as computing h(k) takes constant time.
Hashing Functions • The set of keys: {10389099,10389100, 10389102, 10389103, 10389106,…, 10389134, 10370001, 10370018, 10370052, 10370055, 10370084,…, 10370114}. • h(key) = last four digits. Problem: the array size would be too big and waste memory • h(key) = last three digits. Problem: two different keys hash to the same address: h(10389099)=h(10370099), collision occurs. • Compilers use hash tables (symbol table) to keep track of declared variables, the hash function is from the set of possible variables to a limited range [0..m-1].
Hash Function • With hashing, an element of key k is stored in T[h(k)] • h: hash function • maps the universe U of possible keys into the slots of a hash table T[0,1,...,m-1] • an element of key k hashes to slot h(k) • h(k) is the hash value of key k
Collisions • It looks that it is difficult to design a one-one hash function as the set of possible keys is big. • Collisions: two different keys hash to the same value, that is, k1k2, but h(k1)=h(k2).. • The birthday problem: How many randomly chosen people need to be in the room before it becomes likely that two people have the same birthday: • “The probability reaches 100% when the number of people reaches 367 (including February 29 births), but perhaps counter-intuitively, 99% probability is reached with just 57 people, and 50% probability with 23 people. ”(from wikipedia). • m people all have different birthdays is which will be less than 0.5 when m23.
Tasks of Hashing Using hashing needs to solve the following problems. • Task 1: Design a good hash function • that is fast to compute and • can minimize the number of collisions, or have an even distribution of the hash values. • Task 2: Design a method to resolve the collisions when they occur
Design Hash Function • A simple and reasonable strategy: h(k) = k mod m • e.g. m=12, k=100, h(k)=4 • Requires only a single division operation (quite fast) • Certain values of m should be avoided • e.g. if m=2p, then h(k) is just the p lowest-order bits of k; the hash function does not depend on all the bits • Similarly, if the keys are decimal numbers, should not set m to be a power of 10 • It’s a good practice to set the table size m to be a prime number • Good values for m: primes not too close to exact powers of 2 • e.g. the hash table is to hold 2000 numbers, choose m=2003.
Deal with String-type Keys • Can the keys be strings? • Most hash functions assume that the keys are natural numbers • if keys are not natural numbers, a way must be found to interpret them as natural numbers • Method 1: Add up the ASCII values of the characters in the string • Problems: • Different permutations of the same set of characters would have the same hash value • If the table size is large, the keys are not distribute well. e.g. Suppose m=10007 (prime) and all the keys are eight or fewer characters long. Since ASCII value <= 127, the hash function can only assume values between 0 and 127*8=1016
a,…,z and space 272 • Method 2 • If the first 3 characters are random and the table size is 10,007 => a reasonably equitable distribution • Problem • English is not random • Only 28 percent of the table can actually be hashed to (assuming a table size of 10,007) • Method 3 • computes • involves all characters in the key and be expected to distribute well
Multiplication • Mutiplication (see Knuth): • Multiply the key k by a constant 0< A < 1 and extract the fraction part of kA. • Multiply kA by m and take the floor of the result. h(k) = m(kA mod 1) where "kA mod 1" means the fractional part of kA, that is, kA - kA. Advantages: • The value of m is not critical. • Reasonable A = (sqrt(5)-1)/2=0.618…(Knuth)
Example: Hash(k) = 2^10*(0.618*k mod 1) %17 Key: PAL, LAP, PAM, MAP, PAT, PET, SET, SAT, TAT, BAT Addr: 9, 8, 13, 7, 11, 14, 6, 16, 2, 10 where a string abc is converted to a+b*29+c*29^2(a char is seen as an int)..
Collision Handling: (1) Separate Chaining • Instead of a hash table, we use a table of linked list • keep a linked list of keys that hash to the same value Keys: 0,1,4,9,16,25,36, 49,64,81 Hash function: h(K) = K mod 10
Separate Chaining Operations • To insert a key K • Compute h(K) to determine which list to traverse • If T[h(K)] contains a null pointer, initialize this entry to point to a linked list that contains K alone. • If T[h(K)] is a non-empty list, we add K at the beginning of this list. • Searching a key K: compute h(k), and traverse the list at T[h(K)]. • To delete a key K • compute h(K), then search for K within the list at T[h(K)]. Delete K if it is found.
Analysis of Hashing • We count the number of probes: one comparison of a key with the target. • This number will depends on how full the table is, the load factor. • The load factor of a table is =n/t, n is the number of entries in the table and t the size of the array. • For example, =1 for the previous example.
Analysis of chaining • The average length of the chains is , so the number of probes for insertion and unsuccessful search is . • The number of probes made for successful searches is (k+1)/2, assuming sequential search, k is the length of the chain containing the target. • The average for k is 1 plus the average length of the chains for the remaining (n-1) records, i.e. (n-1)/t, so • T()= (k+1)/2 = (1+(n-1)/t + 1)/2≈ 1+ /2. • So running time is less than T(0.8) for =0.8.
Collision Handling:(2) Open Addressing • Open addressing: relocate the key K to be inserted if it collides with an existing key. • We store K at an entry different from T[h(K)]. • Two issues arise • what is the relocation scheme? • how to search for K later? • Three common methods for resolving a collision in open addressing • Linear probing • Quadratic probing • Double hashing
Open Addressing Strategy • To insert a key K, compute hash(K). If T[hash(K)] is empty, insert it there. If collision occurs, probe alternative cells h1(K), h2(K), .... until an empty cell is found. • hi(K) = (hash(K) + f(i)) mod m, i=1,2,… • f: collision resolution strategy • Operation ‘mod’ makes the table a circular array.
Linear Probing • f(i) =i • cells are probed sequentially (with wraparound) • hi(K) = (hash(K) + i) mod m • Insertion: • Let K be the new key to be inserted, compute hash(K) • For i = 1 to m • compute L = ( hash(K) + i ) mod m • If T[L] is empty, then we put K there and stop. • If we cannot find an empty entry to put K, it means that the table is full and we should report an error.
Linear Probing Example • Example. keys: 10, 100, 32, 45, 58, 126,3, 26 • Hash table size: m=13 • Hash function: h(k)=k%m • Resolving collision: linear probing • h(k) = k%13 • k: 10, 100, 32, 45, 58, 126, 3, 26 • h(k): 10, 9, 6, 6, 6, 9, 3, 0 • The hash table:
k: 10, 100, 32, 45, 58, 126, 3, 26 • h(k): 10, 9, 6, 6, 6, 9, 3, 0 Address Keys A hash table
Primary Clustering • We call a block of contiguously occupied table entries a cluster • On the average, when we insert a new key K, we may hit the middle of a cluster. Therefore, the time to insert K would be proportional to half the size of a cluster. That is, the larger the cluster, the slower the performance.
Primary Clustering • Linear probing has the following disadvantages: • Once h(K) falls into a cluster, this cluster will definitely grow in size by one. Thus, this may worsen the performance of insertion in the future. • If two clusters are only separated by one entry, then inserting one key into a cluster can merge the two clusters together. Thus, the cluster size can increase drastically by a single insertion. This means that the performance of insertion can deteriorate drastically after a single insertion. • Large clusters are easy targets for collisions.
Quadratic Probing Example • f(i) = i2 • hi(K) = ( hash(K) + i2 ) mod m • Example. keys: 10, 100, 32, 45, 58, 126,3, 26 • Hash table size: m=13 • Hash function: h(k)=k%m • Resolving collision: quadratic probing • h(k) = k%13 • k: 10, 100, 32, 45, 58, 126, 3, 26 • h(k): 10, 9, 6, 6, 6, 9, 3, 0 • The hash table:
Quadratic Probing • Two keys with different home positions will have different probe sequences • e.g. m=101, h(k1)=30, h(k2)=29 • probe sequence for k1: 30,30+1, 30+4, 30+9 • probe sequence for k2: 29, 29+1, 29+4, 29+9 • If the table size is prime, then a new key can always be inserted if the table is at least half empty (see proof in text book) • Secondary clustering • Keys that hash to the same home position will probe the same alternative cells • To avoid secondary clustering, the probe sequence need to be a function of the original key value, not the home position
Double Hashing • To alleviate the problem of clustering, the sequence of probes for a key should be independent of its primary position => use two hash functions: hash() and hash2() • f(i) = i * hash2(K) • E.g. hash2(K) = R - (K mod R), with R is a prime smaller than m
Double Hashing Example • hi(K) = ( hash(K) + f(i) ) mod m; hash(K) = K mod m • f(i) = i * hash2(K); hash2(K) = R - (K mod R), • Example: m=10, R = 7 and insert keys 89, 18, 49, 58, 69 To insert 49, hash2(49)=7, 2nd probe is T[(9+7) mod 10] To insert 58, hash2(58)=5, 2nd probe is T[(8+5) mod 10] To insert 69, hash2(69)=1, 2nd probe is T[(9+1) mod 10]
Choice of hash2() • Hash2() must never evaluate to zero • For any key K, hash2(K) must be relatively prime to the table size m. Otherwise, we will only be able to examine a fraction of the table entries. • E.g.,if hash(K) = 0 and hash2(K) = m/2, then we can only examine the entries T[0], T[m/2], and nothing else! • One solution is to make m prime, and choose R to be a prime smaller than m, and set hash2(K) = R – (K mod R) • Quadratic probing, however, does not require the use of a second hash function • likely to be simpler and faster in practice
Deletion in Open Addressing • Actual deletion cannot be performed in open addressing hash tables • otherwise this will isolate records further down the probe sequence Hash(k) = k%13 What happens if we simply delete 45 and search 58? • Solution: Add an extra bit to each table entry, and mark a deleted slot by storing a special value DELETED.
Analysis of open addressing • Assume that the hashing addresses are evenly distributed and probes are independent random events. • The probability to hit a non-empty cell is , and the probability to hit an empty cell is (1- ). • The probability to make k probes, where only the last probe hits an empty cell, is k-1(1- ). • The expected number U() of probes in an unsuccessful search is kk = (k)’=(1/(1-))’
The number of probes for a successful search is the number of probes made when the entry is inserted, this is also the number of probes made for unsuccessful search (before the entry was inserted). • So the expected number of probes for successful searches is the average of U(x) on [0, ]: • Searching takes constant time as S() is an increasing function, for example, S() S(0.75).
Chaining versus Open Addressing • Advantages of chaining • If the records are large, a chained hash table can save space. • Collision resolution with chaining is simple; no clustering. • The hash table itself can be smaller than the number of records; overflow is no problem. • Deletion is quick and easy in a chained hash table. • Advantages of open addressing • Can be more space efficient for small records as no extra links are stored; • No overhead of creating new nodes for insertions • Easier to serialize because no pointers.
Conclusions • Chaining requires fewer probes than open addressing. • Chaining requires less probes for unsuccessful searches than successful searches. • Quadratic probing is generally better than linear probing. • For successful searches, the simple method linear probing is not so bad if the table is not almost full. • For unsuccessful searches, clustering quickly causes linear probing to degenerate into a long sequential search. • Finally, The time complexity does not depend on the size of the table or the size of the array. It depends on the load factor.
Summary • Hashing gives an efficient data structure for insertion, deletion and searching. Running time depends on the load factor , and is constant. For separate chaining, should be close to 1 and for open addressing should not exceed 0.5. • Hash functions: division method and multiplication method for integer keys. In general, “to hash means to chop something up or to make a mess out of it.”(V.3, Knuth). • Applications: symbol tables in compilers, on-line spelling check, transposition tables in game-playing. • Comparing with BST: BST can be used to search that requires ordering, for example, find the minimum, for all items in some range.
Exercises • Trakla: Hashing part • Labs: • Implement separate chaining. • Solve “message flood” using hashing, binary search and binary search trees.