1 / 18

Hashing

Hashing. Ch. 12. Motivation. How to retrieve data records with a key value? Direct Access Searching The fastest way to access an element with key value k , is to make it the k th element in an array

dean-parker
Download Presentation

Hashing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hashing Ch. 12

  2. Motivation • How to retrieve data records with a key value? • Direct Access Searching • The fastest way to access an element with key value k, is to make it the kth element in an array • The number of comparisons needed to access an element is zero. All that is required is a single calculation to find the array address • Therefore, access time is not dependent on the list size n, but is a constant function O(1) • Limits of Direct Access • Often the number of possible elements in a list far exceeds the amount of contiguous memory that can be spared for an array • Keys may not map to an obvious array element (consider a list of names)

  3. Limits of Direct Access - Example • Consider a situation where we want to make a list of records for students currently doing the CS degree • each student is uniquely identified by a student number • The student numbers currently range from 500000 to 1500000 • therefore an array of 1 million elements would be enough to hold all possible student numbers • Given each student record is at least 100 bytes long we would require an array size of 100 Megabytes to do this • However as there are less than 400 students enrolled in CS at present, surely there is a better way? • If the number of possible keys is large but the expected number of actual values in a list is relatively small, key values and records can be stored in a hash table.

  4. The Idea • Hash table • Much smaller than the array that would have been needed to hold all possible values • But it is large enough to hold the expected number of values in the list • Entries of the same type are stored in a hash table, a fixed-size data structure (usually implemented with an array) based on the values of their keys • Given a value V, transform it into a key value Key(V) (usually an integer), then transform Key(V) to access the table • Hash or hashing functions • INSERT, RETRIEVE and DELETE operations

  5. Example • Strings • hashFunction(char * str) {     int value = 0;     for every letter in the string         value += ascii value of the letter - 64     return value % HASHTABLE_SIZE // mod operator } • Ray = 18 + 1 + 25 = 44 % 11 = 0 • Sukhen = 19 + 21 + 11 + 8 + 5 + 14 = 78 % 11 = 1 • Wayne = 23 + 1 + 25 + 14 + 5 = 68 % 11 = 2 • Insert Boris, Ray, Sukhen, Wayne • Boris = 2 + 15 + 18 + 9 + 19 = 63 % 11 = 8

  6. Issues in Hashing • We do not use the key directly as an array subscript. Instead, we have a hash function that converts a key value into a valid array subscript • Important questions : • 1) How do we write a hash function that will produce distinct integer values for a wide range of keys but still fit into the size of our hashtable array? • 2) What happens when two different keys lead to the same hash value (collision resolution)?

  7. Hash Function : H(Key(V)) : V→Key(V)→Int • Must be efficient • necessary for the hashing function to be a constant time function (computed every time for access to hash table) • Perfect hashing function • different keys would correspond to different positions in the table • if Key(V1) ≠ Key(V2), then H(Key(V1)) ≠ H(Key(V2)) • Difficult to construct (the actual key values need to be known in advance) • In practice, hashing functions are not perfect • Key(V1) ≠ Key(V2), but H(Key(V1)) = H(Key(V2)) • Two or more elements map to the identical bucket - collision • Collisions need to be resolved • Minimize the number of collisions • Resolve collisions in a way that does not degrade performance

  8. Minimizing The Number Of Collisions • Two most popular strategies • 1) Choose a hashing function that spreads the possible key values evenly across all the different positions in the hash table • H(Key(V)) = (P * Key(V)) mod TABLE_SIZE • Where P and TABLE_SIZE are two different prime numbers • 2) Make the hash table larger • By allowing several values to be stored in an array position (the `bucket' method) • By having more positions available - rehashing • Doubling the size of the table will halve the expected number of collisions • The load factor • # of values actually stored in table / TABLE_SIZE • Small load factor => chance of collision is small

  9. Collisions • No matter how good our hash function is, we better be prepared for collisions • This is due to the birthday paradox: • the probability that your neighbor has the same birthday is • if you ask 23 people, this probability raises to • but, if there are 23 people in a room, two of them have the same birthday with probability: • Applying this to hashing yields: • the probability of no collisions after k insertions into an m-element table: • for m = 365 and k≥50 this probability goes to 0

  10. Resolving Collisions • Open Addressing • Linear Probing • Quadratic Probing • Double Hashing • Chaining

  11. Open Addressing • When collisions occur, put entries somewhere else in the hash table • 1) Compute the position at which value V ought to be stored P = H(Key(V)) • 2) If position P is not OCCUPIED, store V there • 3) If position P is OCCUPIED, compute another position in the table; set P to this value and repeat (2)-(3) • Linear Probing • P = (k + P) mod TABLE_SIZE , k = 1,2,… • Simple • Causes clustering – searching the table may be inefficient • To illustrate, consider an example

  12. Example – Linear Probing • Insert Chang = 3 + 8 + 1 + 14 + 7 = 33 % 11 = 0 • Causes clustering • Leads to very inefficient operations • causes the number of collisions to be much greater than it need be • To eliminate primary clustering : use quadratic probing

  13. Quadratic Probing • If there is a collision we first try and insert an element in the next adjacent space (at a distance of +1) • If this is full we try a distance of 4 (22) then 9 (32) and so until we find an empty element • P = (P + i2) mod TABLE_SIZE , i = 1, 2, … • Not all the locations in a table may be able to be reached • especially if the table size is a power of 2 • Each different key should probe the table in a different order

  14. Double Hashing • Increment P, not by a constant, but by an amount that depends on the Key • Second hashing function, H2(Key(V)) • P = (P + H2(Key(V))) mod TABLE_SIZE • Example : H2(Key(V)) = 1 + (Key(V) mod 7) • Each value probes the array positions in a different order • No clustering: if two keys probe the same position, the next position they probe is different • Of course, there do exist keys that have the same value of H(Key(V)) and the same value of H2(Key(V)) • but these are much rarer than keys that just have the same H(Key(V)) value

  15. Problems with Open Addressing • Open addressing has the advantage that amount of memory needed is fixed • Three disadvantages: • Necessary to distinguish between EMPTY positions (never been occupied) and DELETED positions (which once were occupied but the value stored there has been deleted) • RETRIEVE is very inefficient when the key does not occur in the table • The worst case, all positions are marked DELETED: to determine that a value is not in the table we must look at every position in the table • With open addressing, the amount you can store in the table is limited by the size of the table; and, what is worse, as the load factor gets large, all the operations degrade to linear-time

  16. Chaining • Each position in the table contains a collection of values of unlimited size • e.g., linked implementation of some sort, with dynamically allocated storage • Insert Chang = 3 + 8 + 1 + 14 + 7 = 33 % 11 = 0

  17. Analysis • Chaining has several advantages over open addressing: • Collision resolution is simple and efficient • The hash table can hold more elements without the large performance deterioration of open addressing • Because the hash table holds only pointers rather than records it takes up less space initially • Deletion and retrieval are easy - no special key values are necessary • The main cost of chaining is the extra space required for the pointers themselves • If records are small (i.e. simple integers), the space required for chaining is considerably more than for open addressing. However, if space is not a problem then chaining is the preferred method • Adds the time to insert and search in the linked lists • Chained tables still need periodic reorganization • list lengths become long and performance declines, but the performance of chaining declines much more slowly than open addressing

  18. Conclusions • Other kinds of hashing • Universal Hashing • randomly switch H(Key(V)) from a set of functions H. Set H is universal if for every h in H chance collision of two keys is ≤ 1/TABLE_SIZE • Folding Hash Function • The key is divided into several parts. The parts are then combined (i.e., "folded") to generate the index number • shift folding and boundary folding • Multidimensional Hashing • The keys are from multiple attributes (spatial point or regions) • Hashing Applications • dictionary lookup • spelling checker • compilers (symbol lookup in a symbol table) • data integrity assurance and data origin authentication • Secure Hash Standard : SHA-1

More Related