180 likes | 252 Views
Hashing. Ch. 12. Motivation. How to retrieve data records with a key value? Direct Access Searching The fastest way to access an element with key value k , is to make it the k th element in an array
E N D
Hashing Ch. 12
Motivation • How to retrieve data records with a key value? • Direct Access Searching • The fastest way to access an element with key value k, is to make it the kth element in an array • The number of comparisons needed to access an element is zero. All that is required is a single calculation to find the array address • Therefore, access time is not dependent on the list size n, but is a constant function O(1) • Limits of Direct Access • Often the number of possible elements in a list far exceeds the amount of contiguous memory that can be spared for an array • Keys may not map to an obvious array element (consider a list of names)
Limits of Direct Access - Example • Consider a situation where we want to make a list of records for students currently doing the CS degree • each student is uniquely identified by a student number • The student numbers currently range from 500000 to 1500000 • therefore an array of 1 million elements would be enough to hold all possible student numbers • Given each student record is at least 100 bytes long we would require an array size of 100 Megabytes to do this • However as there are less than 400 students enrolled in CS at present, surely there is a better way? • If the number of possible keys is large but the expected number of actual values in a list is relatively small, key values and records can be stored in a hash table.
The Idea • Hash table • Much smaller than the array that would have been needed to hold all possible values • But it is large enough to hold the expected number of values in the list • Entries of the same type are stored in a hash table, a fixed-size data structure (usually implemented with an array) based on the values of their keys • Given a value V, transform it into a key value Key(V) (usually an integer), then transform Key(V) to access the table • Hash or hashing functions • INSERT, RETRIEVE and DELETE operations
Example • Strings • hashFunction(char * str) { int value = 0; for every letter in the string value += ascii value of the letter - 64 return value % HASHTABLE_SIZE // mod operator } • Ray = 18 + 1 + 25 = 44 % 11 = 0 • Sukhen = 19 + 21 + 11 + 8 + 5 + 14 = 78 % 11 = 1 • Wayne = 23 + 1 + 25 + 14 + 5 = 68 % 11 = 2 • Insert Boris, Ray, Sukhen, Wayne • Boris = 2 + 15 + 18 + 9 + 19 = 63 % 11 = 8
Issues in Hashing • We do not use the key directly as an array subscript. Instead, we have a hash function that converts a key value into a valid array subscript • Important questions : • 1) How do we write a hash function that will produce distinct integer values for a wide range of keys but still fit into the size of our hashtable array? • 2) What happens when two different keys lead to the same hash value (collision resolution)?
Hash Function : H(Key(V)) : V→Key(V)→Int • Must be efficient • necessary for the hashing function to be a constant time function (computed every time for access to hash table) • Perfect hashing function • different keys would correspond to different positions in the table • if Key(V1) ≠ Key(V2), then H(Key(V1)) ≠ H(Key(V2)) • Difficult to construct (the actual key values need to be known in advance) • In practice, hashing functions are not perfect • Key(V1) ≠ Key(V2), but H(Key(V1)) = H(Key(V2)) • Two or more elements map to the identical bucket - collision • Collisions need to be resolved • Minimize the number of collisions • Resolve collisions in a way that does not degrade performance
Minimizing The Number Of Collisions • Two most popular strategies • 1) Choose a hashing function that spreads the possible key values evenly across all the different positions in the hash table • H(Key(V)) = (P * Key(V)) mod TABLE_SIZE • Where P and TABLE_SIZE are two different prime numbers • 2) Make the hash table larger • By allowing several values to be stored in an array position (the `bucket' method) • By having more positions available - rehashing • Doubling the size of the table will halve the expected number of collisions • The load factor • # of values actually stored in table / TABLE_SIZE • Small load factor => chance of collision is small
Collisions • No matter how good our hash function is, we better be prepared for collisions • This is due to the birthday paradox: • the probability that your neighbor has the same birthday is • if you ask 23 people, this probability raises to • but, if there are 23 people in a room, two of them have the same birthday with probability: • Applying this to hashing yields: • the probability of no collisions after k insertions into an m-element table: • for m = 365 and k≥50 this probability goes to 0
Resolving Collisions • Open Addressing • Linear Probing • Quadratic Probing • Double Hashing • Chaining
Open Addressing • When collisions occur, put entries somewhere else in the hash table • 1) Compute the position at which value V ought to be stored P = H(Key(V)) • 2) If position P is not OCCUPIED, store V there • 3) If position P is OCCUPIED, compute another position in the table; set P to this value and repeat (2)-(3) • Linear Probing • P = (k + P) mod TABLE_SIZE , k = 1,2,… • Simple • Causes clustering – searching the table may be inefficient • To illustrate, consider an example
Example – Linear Probing • Insert Chang = 3 + 8 + 1 + 14 + 7 = 33 % 11 = 0 • Causes clustering • Leads to very inefficient operations • causes the number of collisions to be much greater than it need be • To eliminate primary clustering : use quadratic probing
Quadratic Probing • If there is a collision we first try and insert an element in the next adjacent space (at a distance of +1) • If this is full we try a distance of 4 (22) then 9 (32) and so until we find an empty element • P = (P + i2) mod TABLE_SIZE , i = 1, 2, … • Not all the locations in a table may be able to be reached • especially if the table size is a power of 2 • Each different key should probe the table in a different order
Double Hashing • Increment P, not by a constant, but by an amount that depends on the Key • Second hashing function, H2(Key(V)) • P = (P + H2(Key(V))) mod TABLE_SIZE • Example : H2(Key(V)) = 1 + (Key(V) mod 7) • Each value probes the array positions in a different order • No clustering: if two keys probe the same position, the next position they probe is different • Of course, there do exist keys that have the same value of H(Key(V)) and the same value of H2(Key(V)) • but these are much rarer than keys that just have the same H(Key(V)) value
Problems with Open Addressing • Open addressing has the advantage that amount of memory needed is fixed • Three disadvantages: • Necessary to distinguish between EMPTY positions (never been occupied) and DELETED positions (which once were occupied but the value stored there has been deleted) • RETRIEVE is very inefficient when the key does not occur in the table • The worst case, all positions are marked DELETED: to determine that a value is not in the table we must look at every position in the table • With open addressing, the amount you can store in the table is limited by the size of the table; and, what is worse, as the load factor gets large, all the operations degrade to linear-time
Chaining • Each position in the table contains a collection of values of unlimited size • e.g., linked implementation of some sort, with dynamically allocated storage • Insert Chang = 3 + 8 + 1 + 14 + 7 = 33 % 11 = 0
Analysis • Chaining has several advantages over open addressing: • Collision resolution is simple and efficient • The hash table can hold more elements without the large performance deterioration of open addressing • Because the hash table holds only pointers rather than records it takes up less space initially • Deletion and retrieval are easy - no special key values are necessary • The main cost of chaining is the extra space required for the pointers themselves • If records are small (i.e. simple integers), the space required for chaining is considerably more than for open addressing. However, if space is not a problem then chaining is the preferred method • Adds the time to insert and search in the linked lists • Chained tables still need periodic reorganization • list lengths become long and performance declines, but the performance of chaining declines much more slowly than open addressing
Conclusions • Other kinds of hashing • Universal Hashing • randomly switch H(Key(V)) from a set of functions H. Set H is universal if for every h in H chance collision of two keys is ≤ 1/TABLE_SIZE • Folding Hash Function • The key is divided into several parts. The parts are then combined (i.e., "folded") to generate the index number • shift folding and boundary folding • Multidimensional Hashing • The keys are from multiple attributes (spatial point or regions) • Hashing Applications • dictionary lookup • spelling checker • compilers (symbol lookup in a symbol table) • data integrity assurance and data origin authentication • Secure Hash Standard : SHA-1