Hashing

Hashing CS 105

Hashing - Introduction • In a dictionary, if it can be arranged such that the key is also the index to the array that stores the entries, searching and inserting items would be very fast • Example: Empdata[1000], index = employee ID number • search for employee with emp. number = 500 • return: Empdata[500] • Running Time: O(1)

Hash table • Hash table: a data structure, implemented as an array of objects, where the search keys correspond to the array indexes • Insert and find operations involve straightforward array accesses: O(1) time complexity

About hash tables • In the first example shown, it was relatively easy since employee number is an integer • Problem 1: possible integer key values might be too large; creating an appropriate array might be impractical • Need to map large integer values to smaller array indexes • Problem 2: what if the key is a word in the English Alphabet (e.g. last names)? • Need to map names to integers (indexes)

Large numbers -> small numbers • Hash function - converts a number from a large range into a number from a smaller range (the range of array indices) • Size of array • Rule of thumb: the array size should be about twice the size of the data set (2s) • for 50,000 words, use an array of 100,000 elements

Hash function and modulo • Simplest hash function - achieved by using the modulo function (returns the remainder) • for example, 33 % 10 = 3 • General formula:LargeNumber % Smallrange

Hash functions for names • Sum of Digits Method • map the alphabet A-Z to the numbers 1 to 26 (a=1,b=2,c=3,etc.) • add the total of the letters • For example, “cats” • (c=3,a=1,t=20,s=19) • 3+1+20+19=43 • ”cats” will be stored using index = 43 • Can use modulo operation (%) if you need to map to a smaller array

Collisions • Problem • Too many words with the same index • “was”,”tin”,”give”,”tend”,”moan”,”tick” and several other words add to 43 • These are called collisions(case where two different search keys hash to the same index value) • Can occur even when dealing with integers • Suppose the size of the hash table is 100 • Keys 158 and 358 hash to the same value when using the modulo hash function

Collision resolution policy • Need to know what to do when a collision occurs; i.e., during an insert operation, what if the array slot is already occupied? • Most common policy: go to the next available slot • “Wrap around” the array if necessary • Consequence: when searching, use the hash function but first check whether the element is the one you are looking for. If not try the next slots. • How do you know if the element is not in the array?

Probe sequence • Sequence of indexes that serve as array slots where a key value would map to • The first index in the probe sequence is the home position, the value of the hash function. The next indexes are the alternative slots • Example: suppose the array size is 10, and the hash function is h(K) = K%10. The probe sequence for K=25 is: • 5, 6, 7, 8, 9, 0, 1, 2, 3, 4 • Here, we assume the most common collision resolution policy of going to the next slot:p(K,i) = i, • Goal: probe sequence should exhaust array slots

Recap: hash table operations • Insert object Obj with key value K • home <- h(K)for i <- 0 to M-1 do pos = (home + p(K,i)) % 10 if HT[pos].getKey() = K then throw exception “error: duplicate record” // alternative: overwrite else if HT[pos] is null then HT[pos] <- Obj break; • Finding an object with key value K • home <- h(K)for i <- 0 to M-1 do pos = (home + p(K,i)) % 10 if HT[pos].getKey() = K then return HT[pos] else if HT[pos] is null then throw exception “not found”

Hash table operations • Note: although insert and find run in O(1) time during typical conditions, the time complexity in the worst-case is O(n) • Something to think about: characterize the worst-case scenarios for insert and find

Removing elements • Removing an element from a hash table during a delete operation poses a problem • If we set the corresponding hash table entry to null, then succeeding find operations might not work properly • Recall that for the find algorithm, seeing a null means a target element is not found but in fact the element might be in a next slot • Solution: tombstone • Arrange it so that deleted entries seem null when inserting, but don’t seem null when searching • Requires a simple flag on the objects stored

Hash tables in Java • java.util.Hashtable • Important methods for the Hashtable class • put(Object key, Object entry) • Object get(Object key) • remove(Object key) • boolean containsKey(Object key)

Summary • Hash tables implement the dictionary data structure and enable O(1) insert, find, and remove operations • Caveat: O(n) in the worst-case because of the possibility of collisions • Requires a hash function (maps keys to array indices) and a collision resolution policy • Probe sequence depicts a sequence of array slots that an object would occupy, given its key • In Java: use the Hashtable class

Hashing

Hashing

Presentation Transcript

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

Hashing

Hashing

Hashing

HASHING

Hashing

Hashing

Hashing, Hashing Tables

Hashing

Hashing

Hashing