350 likes | 456 Views
COSC 2007 Data Structures II. Chapter 13 Advanced Implementation of Tables IV. Topics. How to choose a Hash Function? Closed hashing Linear hashing Quadratic hashing Double hashing. Hash Functions. G ood hash function: Easy & fast to compute Has minimal number of clashes
E N D
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV
Topics • How to choose a Hash Function? • Closed hashing • Linear hashing • Quadratic hashing • Double hashing
Hash Functions • Good hash function: • Easy & fast to compute • Hasminimal number of clashes • Data items are spread uniformly throughout the array • Hashing problems reduce to the following points: • Finding a hashing method that minimizes collisions • Resolving collisions when they do happen
Hashing Methods • Integer Type • It is sufficient for a hash function to operate on integers • Any arbitrary integer can be converted into an integer within a certain range • The index of the hash table lies within a specific range • Solutions • Digit Selection • Folding • Modulo arithmetic
Hashing Methods • Digit Selection • Choose a group of digits from the number • Use combination of Mod/div operations on the search key • One of the most effective hashing methods
Hashing Methods • Digit Selection • Example • Assume table size = 1000 • Key = 01234567 • Choose 2nd, 4th,& last digits • H(key) = 147 • key = d1 d2 d3 d4 d5 d6 d7 d8 d9 • Choose leftmost 3 digits • H(key) = key Div 1000000 = d1 d2 d3 • Choose rightmost 3 digits • H(key) = key Mod 1000 = d7 d8 d9
Hashing Methods • Digit Selection • Mid-square Method (Multiplication) • First Variant • Key is squared, then some digits of this square are selected to give the index. • Example • k = 54321 • H(k) = k2 = 2950771041 • Pick up 3 middle digits index = 077
Hashing Methods • Folding Method • Digits are added together instead of just being selected • Digits can first be grouped and then add the groups • Folding can be done more than once on the search key
Hashing Methods • Folding Method • Example: • Key = 1234567 • H(Key) = 1 + 2 + 3 + 4 + 5 + 6 + 7 = 28 • Disadvantage • All values will be put in the range • Solution • Divide into groups then fold • Key = 1234567 • Groups: 12 345 67 • Fold: 12 + 345 + 67 = 454 • Hash again to fit into table size
Hashing Methods • Modulo Arithmetic • Choose a prime table size • Divide the search key using modulo the size of the table • h(x) = x mod TableSize • Items will be distributed over the table • Advantages • Simple • Reduces collisions • items will be evenly distributed if table size is a prime number
Hashing Methods • What should be done if the search key is a character? • Convert the character string into some integer before applying the hash function • How should we do that? • Use the ASCII code: • Can lead to duplication (e.g. NOTE and TONE will result in the same hash function) • Write a numeric value for each character in binary • Concatenate the results
Hashing Methods • Example: • Key = NOTE • ASCII code for each character • N = 14 = (01110) // Order of ‘N’ in alphabet • O = 15 = (01111) • T = 20 = (10100) • E = 5 = (00101) • Concatenation • Binary result: • y = (01110 01111 10100 00101) • Equivalent decimal • X = 474,757 • Apply hash function • h(x) = x mod TableSize
Closed Hashing (Open Addressing) • No secondary data structure • All the data goes inside the table. • On collision, try alternate cells until an empty cell is found. How? • Bigger table is needed.
Linear Probing • Linear search from position where collision occurred.
Number 281942902 Number 233667136 Number 580625685 Number 506643548 Number 155778322 Number 265-7917 Linear Probing • This is called a collision, because there is already another valid record at [2]. [2] is occupied, how to do My hash value is [2]. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [100] . . .
Number 281942902 Number 233667136 Number 580625685 Number 506643548 Number 155778322 Number 265-7917 Linear Probing • This is called a collision, because there is already another valid record at [2]. When a collision occurs, move forward until you find an empty spot. My hash value is [2]. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [100] . . .
Number 281942902 Number 233667136 Number 580625685 Number 506643548 Number 155778322 Number 265-7917 Linear Probing • This is called a collision, because there is already another valid record at [2]. [5] is empty, I can insert it My hash value is [2]. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [100] . . .
Number 281942902 Number 233667136 Number 701466868 Number 580625685 Number 506643548 Number 155778322 Linear Probing • This is called a collision, because there is already another valid record at [2]. The new record goes in the empty spot. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [100] . . .
Linear Probing • Find the next index in the array up until the maximum subscript is reached and then it should return to the first index (wrap around) • Try alternate cells • Cells h0(x), h1(x), h2(x), … are tried until an free cell is found • hi(x) = ( hash(x) + f(i) ) mod TSIZE • f(0) = 0 • Linear probing • f(i) = i
Number 281942902 Number 233667136 Number 701466868 Number 580625685 Number 506643548 Number 155778322 Searching for a Key Number 265-7917 • The data that's attached to a key can be found fairly quickly. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [100] . . .
Number 281942902 Number 233667136 Number 701466868 Number 580625685 Number 506643548 Number 155778322 Searching for a Key Number 265-7917 • Calculate the hash value. • Check that location of the array for the key.. My hash value is [2]. Not me. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [100] . . .
Number 281942902 Number 233667136 Number 701466868 Number 580625685 Number 506643548 Number 155778322 Searching for a Key Number 265-7917 • Keep moving forward until you find the key, or you reach an empty spot. My hash value is [2]. Not me. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [100] . . .
Number 281942902 Number 233667136 Number 701466868 Number 580625685 Number 506643548 Number 155778322 Searching for a Key Number 265-7917 • Keep moving forward until you find the key, or you reach an empty spot. My hash value is [2]. Not me. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [100] . . .
Number 281942902 Number 233667136 Number 701466868 Number 580625685 Number 506643548 Number 155778322 Searching for a Key Number 265-7917 • Keep moving forward until you find the key, or you reach an empty spot. My hash value is [2]. Yes! [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [100] . . .
Number 281942902 Number 233667136 Number 701466868 Number 580625685 Number 506643548 Number 155778322 Searching for a Key Number 265-7917 • When the item is found, the information can be copied to the necessary location. My hash value is [2]. Yes! [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [100] . . .
Number 281942902 Number 233667136 Number 701466868 Number 580625685 Number 506643548 Number 155778322 Deleting a Record • Records may also be deleted from a hash table Please delete me. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [100] . . .
Number 281942902 Number 233667136 Number 701466868 Number 580625685 Number 155778322 Deleting a Record • Records may also be deleted from a hash table. • But the location must not be left as an ordinary "empty spot" since that could interfere with searches. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [100] . . .
Number 281942902 Number 233667136 Number 701466868 Number 580625685 Number 155778322 Deleting a Record • Records may also be deleted from a hash table. • But the location must not be left as an ordinary "empty spot" since that could interfere with searches. • The location must be marked in some special way so that a search can tell that the spot used to have something in it. [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [100] . . .
Linear Probing • Advantage • Uses less memory than chaining • don’t have to store all the links • Disadvantages • Can be slower than chaining • may have to walk along the table for a long way • Difficult to delete a key and associated record. • has an impact on the search process • Clustering • Primary clustering • Table contains groups of consecutively occupied locations
10 10 10 10 10 10 10 0 40 40 40 40 40 40 1 2 3 60 60 60 60 60 4 70 70 5 30 30 30 6 7 8 20 20 20 20 9 02 12 22 32 42 52 62 mod 10 = 6 Quadratic Probing • Linear probing: f(i) = i • Quadratic probing: f(i) = i2 • Insert 10, 40, 60, 20, 30, 70, 80
Quadratic Probing • Advantages • Easy to compute • Avoids primary clustering • Disadvantage • Not all entries are searched. Might not encounter a free storage location even when there are locations that are still free • Elements that has the same hash value will probe the same set of alternate cells • Secondary clustering • Not a big problem in practice • Use a good hash function
Double Hashing Use two hash functions one as before that generates the ‘home’ position. second one generates a sequence of offsets from the home position that define the probe sequence. probe = (probe + offset) mod N If the size of the table is prime, this method will eventually examine every position in the table.
Problems with Closed Hashing Table too full Running time too long Inserts could fail Must be chosen in advance Don’t know the number of elements Rehashing Build a new table that is about twice as big Hash the elements into the new table Need to apply new hash function to every item in the old hash table
Summary Hash tables are specialized for dictionary operations: Insert, Delete, Search Principle: Turn the key field of the record into a number, which we use as an index for locating the item in an array. O(1) in the ideal case Problems: find a good hash function, collisions, wasted space, do not support ordering queries Implementations: open hashing, closed hashing, dynamic hashing
Reveiw What is a perfect hash function? What is a collision? What is meant by clustering? How does clustering affect the overall efficiency of hashing? What is a bucket? What is the time complexity for insertion, deletion, and search in Hashing?