230 likes | 250 Views
Explore the concepts behind hashing in data structures, from hash functions to resolving collisions for efficient searching. Learn examples, problems, and techniques like linear probing.
E N D
Data Structures and Algorithms Hashing Dr. Ken Cosh
Review • Sorting Algorithms • Elementary • Insertion Sort • Selection Sort • Bubble Sort • Efficient • Shell Sort • Heap Sort • Quick Sort • Merge Sort • Radix Sort
Searching so far • We have encountered searching algorithms several timesalready in the course; • Linked List searches O(n) • Pick a Number • The efficiency of searches has varied, depending on how effectively the data has been arranged. • This week we look at an alternative approach to searching – where the data could be found in constant time O(1)
Searching in Constant Time • In order the find data in constant time, we need to know where to look for it. • Given a ‘key’, which could be in any form (alphanumeric), we need to return an index for some table (or array). • A function which converts a key to a address is known as a Hash function. • If that address turns out to be a unique address it is a perfect hash function.
Hashing Example • Take a student id number (IDNum); • 478603 • A possible hash function could be; • H(IDNum) = IDNum % 1000 • Which would return – what? • This number could then be the array index number.
Hashing • If only Hashing was that simple…! • There is a problem with the function, the hash function will return a total of 1000 possible different indexes; • What happens when there are more than 1000 students? • When a hash function returns the same index for more than one key, there is a collision. • A hash table, needs to contain at least as many positions as the number of elements to be hashed.
Hashing Example 2 • Suppose we need to convert a variable name into a data location. • int ken = 31; • We need a hash function that could return a unique address for each variable name; • H(“ken”) • Consider how many different variable names there could be? • How large should the hash table be?
Hashing Example cont. • Suppose set the function H() to sum the values of each letter in the variable name; • k=11,e=5,n=14; • H(“ken”) = 30. • Therefore we could store the ken data in index 30. • We can use this bad hashing function to highlight some problems that hashing functions should address;
Hashing Problems • If we have a program with 4 variables; • name H(“name”) = 33 • age H(“age”) = 13 • gender H(“gender”) = 53 • mean H(“mean”) = 33 • The data is spread out throughout the table – with many unused wasted cells. • There is a collision between name and mean. • These two problems have to be solved by a simple, efficient algorithm.
Good Hash Functions • A good hash function should: - be easy and quick to compute - achieve an even distribution of the key values that actually occur across the index range supported by the table • Typically a hash function will take a key value and: - chop it up into pieces, and - mix the pieces together in some fashion, and - compute an index that will be uniformly distributed across the available range. • Note: hash functions are NOT random in any sense.
Approaches • Truncation • Ignore a part of the key value and use the remainder as the table index • e.g., 21296876 maps to 976 • Folding: • Partition the key, then combine the parts in some simple manner • e.g., 21296876 maps to 212 + 968 + 76 = 1256 and then mod to 256 • Modular Arithmetic: • Convert the key to an integer, and then mod that integer by the size of the table • e.g., 21296876 maps to 876
Truncation Caution • It is a good idea if the entire key has some impact on the hash function, simply truncating a key may lead to many keys returning the same result when hashed. • Consider truncating the last 3 letters of the following keys; • hash, mash, bash, trash.
Hash Function intstrHash(string toHash, constintTableSize) { inthashValue = 0; for (unsigned intPos = 0; Pos < toHash.length(); Pos++) { hashValue = hashValue + int(toHash.at(Pos)); } return (hashValue % TableSize); } Given the key ‘ken’ and a table size of 1000, what would be returned?
Improving the hash function • The hash function given on the previous slide would return the same result if we put either of the following keys in; sham or mash • The hash function didn’t take position into account. • This can easily be remedied with the following change; hashValue = 4*hashValue + int(toHash.at(Pos)); • This is known as Collision Reduction, or rather reducing the chance of collision.
Resolving Collisions • Even with a sophisticated hashing function it is likely that collisions will still occur, so we need a strategy to deal with collisions. • We first find out about a collision if we try to insert data into a position which is already filled. • In this case we can simply insert the data into a different available position, leaving a record so the data can be retrieved.
Linear Probing • Linear probing deals with collisions by inserting a new value into the next available space after the space returned by the hash function; • If H(key) is occupied store data in H(key)+1.
Linear Probing, problem Consider the following hash table a b c d If c is duplicated, the new value is placed in the successive cell. a b c c d a b c c d c This leads to clustering, which contradicts one of our key objectives.
Quadratic Probing • Quadratic Probing is designed to combat the clustering effect of linear probing. • Rather than inserting the data into the next available cell, data is inserted into a cell based on the following sequence; • k • k+1 • k+4 • k+9 • k+16 • While this solves the problem of clustering, it produces a problem that the hash function may not try every slot in the table (if table size is a prime, then approximately half of the cells will be tested).
General Increment Probing • A general Increment Probing approach will try each cell in a sequence based on a formula; • k • k+s(1) • k+s(2) • k+s(3) • k+s(4) • Here care must be taken that the formula doesn’t return to the first cell too quickly. • What happens if s(i) = i2 ?
Key dependent Probing • Another probing strategy could calculate the formula based on some part of the original key – perhaps just adding the value of the first segment of the key. • However this could produce inefficient code.
Deletion • Given this approach to collision resolution, care needs to be taken when deleting data from a cell. • Why? • The tombstone method marks a deleted cell as available for insertion, but marks it has having had data in it.
Alternative to Probing • An alternative strategy to probing is to use Separate Chaining. • Here more than one piece of data can be associated to the same cell in the hash table • The cell can contain a pointer to a linked list of data insertions. • This is sometimes known as a bucket.