Hashing

Hashing 1. Def. Hash Table an array in which items are inserted according to a key value (i.e. the key value is used to determine the index of the item). Ex. Student records stored in an array where each student is assigned an id no. and that number is used for the index. Are there any problems with this idea? Gaps will develop if students leave and insertions of new students are limited by the original size of array. Knowing the student id no. is not convenient. Using the index itself as the key field is not efficient.

2. Def. Hash Function - a function used to convert numbers from a large range into numbers in a small range. (The key field is usually the large range and the index of the array is usually the small range.) Ex. Dictionary of 50,000 words. Use the word itself as the key field, but code it numerically to determine a unique location to store the word in the array. Let a = 1, b = 2, c = 3, …z = 26 and let positions of letters in the word have power of ten values: Ex. dab = 4 * 102 + 1 * 101 + 2 * 100 = 412 What size array would be needed to store these 50,000 words, if no word is longer than 10 characters?

zzzzzzzzzz would have the code 28,888,888,890! (too big - bigger than largest int - no array could be that big) Also, if locations were chosen this way, there would be manymany empty cells. What size array should be needed for this dictionary? 100,000 - usually twice as large as the no. of items to allow room for collisions (def. obvious but coming up) A hash function is needed to convert the numeric code to a smaller range.

Commonly used hash function: index = largerange % arraysize Ex. Hash the word gave to find its location in the array dictionary. 7*103 + 1*102 + 22*101 + 5*100 = 7325 Ex. Hash the word gaty to find its location in the array dictionary. 7*103 + 1*102 + 20*101 + 25*100 = 7325 COLLISION!

3. Def. Collision - hashvalue of occupied cell occurs. 4. There are 2 methods to resolve collisions: Def. Open addressing - in case of collision, search for or store in some other available cell. Def. Separate chaining - install a linked list at each index of the array and insert all items that hash to an index into the list.

Ex. Gaty would be stored in location 7326 (if available) otherwise location 7327, or 7328, etc. 5. Types of open addressing: Linear probe method - if collision occurs at index x, search locations x+1, x+2, etc. Note: resolves collisions but primary clusters occur. Quadratic probe method - search x+1, x+22, x+23 etc. Note: resolves primary clusters, but secondary clusters occur.

Rehashing (also called double hashing) - when collision occurs determine step to search for available cell by hashing the key value again by a new function. Ex. Step = 5 - key % 5 What steps result? 5,4,3,2,1 How is this different from the linear & quadratic probe methods? The step is different for different keys. Note: table size must be prime in order to probe all cells. (ex. size=20, step=5, x=0: 0,5,10,15,0,5, 10,15,… try size=19, step=5, x=0: 0,5,10,15,1,6,11,16,2,7,12,17,3,8,13,18,4,9,14

Hashval += step Wrap around: hashval %= arraysize Should not be allowed. When first item with key is found, search stops. Second item with same key would never be found (unless code is change. Select key value that is unique to the item. (ex. Social security no.) Write code to increase a hash value by step. What do we do if a hash value becomes greater than the size of the array? What do we do about duplicate key values?

Replace one field by -1 rather than replace entire object by null. Often object info may be needed in the future. Ex. Even when employee leaves, pension & tax info is needed. However, there is another reason in this code. Something undesirable occurs if the object is replaced by null. Demonstrate what and explain why. While (hashRay[hashVal] != null && hashRay[hashVal].iData != -1) How do we handle deletions? What method requires this condition and why?

The more full a table is the worse clustering becomes. Therefore, hash tables should be designed to never become more than 1/2 to 2/3 full when open addressing is used. No. n items or more can be placed in a table of size n and the load factor will be 1 or more.(i.e.some locations will hold 1 or more items in its linked list.) 6. Def. Load factor - the ratio of the no. of items in a hash table to the size of the table (array). 7. When separate chaining is used to avoid collisions, is load factor a concern?

Duplicates are allowed and will be stored in the same list. Note: search process slows as list is searched linearly. Deletions can be made from a linked list, if appropriate for the application, without empty cell problems resulting. How do we handle duplicates with separate chaining? How do we handle deletions?

7. What is the advantage of a hash table? O(1) complexity to search for or insert an item (i.e. constant time regardless of the number of items). 8. Disadvantage? Must know size of array needed in advance (in Java arrays can not be resized - another bigger array would be needed). This problem is reduced when separate chaining is used. Also, there is no way to access items in order.

Hashing

Hashing

Presentation Transcript

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

Hashing, Hashing Tables

Hashing

Hashing

Hashing