Understanding Hashing Algorithms in Data Structures

Data Structures and Algorithms Hashing Dr. Ken Cosh

Review • Sorting Algorithms • Elementary • Insertion Sort • Selection Sort • Bubble Sort • Efficient • Shell Sort • Heap Sort • Quick Sort • Merge Sort • Radix Sort

Searching so far • We have encountered searching algorithms several timesalready in the course; • Linked List searches O(n) • Pick a Number • The efficiency of searches has varied, depending on how effectively the data has been arranged. • This week we look at an alternative approach to searching – where the data could be found in constant time O(1)

Searching in Constant Time • In order the find data in constant time, we need to know where to look for it. • Given a ‘key’, which could be in any form (alphanumeric), we need to return an index for some table (or array). • A function which converts a key to a address is known as a Hash function. • If that address turns out to be a unique address it is a perfect hash function.

Hashing Example • Take a student id number (IDNum); • 478603 • A possible hash function could be; • H(IDNum) = IDNum % 1000 • Which would return – what? • This number could then be the array index number.

Hashing • If only Hashing was that simple…! • There is a problem with the function, the hash function will return a total of 1000 possible different indexes; • What happens when there are more than 1000 students? • When a hash function returns the same index for more than one key, there is a collision. • A hash table, needs to contain at least as many positions as the number of elements to be hashed.

Hashing Example 2 • Suppose we need to convert a variable name into a data location. • int ken = 31; • We need a hash function that could return a unique address for each variable name; • H(“ken”) • Consider how many different variable names there could be? • How large should the hash table be?

Hashing Example cont. • Suppose set the function H() to sum the values of each letter in the variable name; • k=11,e=5,n=14; • H(“ken”) = 30. • Therefore we could store the ken data in index 30. • We can use this bad hashing function to highlight some problems that hashing functions should address;

Hashing Problems • If we have a program with 4 variables; • name H(“name”) = 33 • age H(“age”) = 13 • gender H(“gender”) = 53 • mean H(“mean”) = 33 • The data is spread out throughout the table – with many unused wasted cells. • There is a collision between name and mean. • These two problems have to be solved by a simple, efficient algorithm.

Good Hash Functions • A good hash function should: - be easy and quick to compute - achieve an even distribution of the key values that actually occur across the index range supported by the table • Typically a hash function will take a key value and: - chop it up into pieces, and - mix the pieces together in some fashion, and - compute an index that will be uniformly distributed across the available range. • Note: hash functions are NOT random in any sense.

Approaches • Truncation • Ignore a part of the key value and use the remainder as the table index • e.g., 21296876 maps to 976 • Folding: • Partition the key, then combine the parts in some simple manner • e.g., 21296876 maps to 212 + 968 + 76 = 1256 and then mod to 256 • Modular Arithmetic: • Convert the key to an integer, and then mod that integer by the size of the table • e.g., 21296876 maps to 876

Truncation Caution • It is a good idea if the entire key has some impact on the hash function, simply truncating a key may lead to many keys returning the same result when hashed. • Consider truncating the last 3 letters of the following keys; • hash, mash, bash, trash.

Hash Function intstrHash(string toHash, constintTableSize) { inthashValue = 0; for (unsigned intPos = 0; Pos < toHash.length(); Pos++) { hashValue = hashValue + int(toHash.at(Pos)); } return (hashValue % TableSize); } Given the key ‘ken’ and a table size of 1000, what would be returned?

Improving the hash function • The hash function given on the previous slide would return the same result if we put either of the following keys in; sham or mash • The hash function didn’t take position into account. • This can easily be remedied with the following change; hashValue = 4*hashValue + int(toHash.at(Pos)); • This is known as Collision Reduction, or rather reducing the chance of collision.

Resolving Collisions • Even with a sophisticated hashing function it is likely that collisions will still occur, so we need a strategy to deal with collisions. • We first find out about a collision if we try to insert data into a position which is already filled. • In this case we can simply insert the data into a different available position, leaving a record so the data can be retrieved.

Linear Probing • Linear probing deals with collisions by inserting a new value into the next available space after the space returned by the hash function; • If H(key) is occupied store data in H(key)+1.

Linear Probing, problem Consider the following hash table a b c d If c is duplicated, the new value is placed in the successive cell. a b c c d a b c c d c This leads to clustering, which contradicts one of our key objectives.

Quadratic Probing • Quadratic Probing is designed to combat the clustering effect of linear probing. • Rather than inserting the data into the next available cell, data is inserted into a cell based on the following sequence; • k • k+1 • k+4 • k+9 • k+16 • While this solves the problem of clustering, it produces a problem that the hash function may not try every slot in the table (if table size is a prime, then approximately half of the cells will be tested).

General Increment Probing • A general Increment Probing approach will try each cell in a sequence based on a formula; • k • k+s(1) • k+s(2) • k+s(3) • k+s(4) • Here care must be taken that the formula doesn’t return to the first cell too quickly. • What happens if s(i) = i2 ?

Key dependent Probing • Another probing strategy could calculate the formula based on some part of the original key – perhaps just adding the value of the first segment of the key. • However this could produce inefficient code.

Deletion • Given this approach to collision resolution, care needs to be taken when deleting data from a cell. • Why? • The tombstone method marks a deleted cell as available for insertion, but marks it has having had data in it.

Alternative to Probing • An alternative strategy to probing is to use Separate Chaining. • Here more than one piece of data can be associated to the same cell in the hash table • The cell can contain a pointer to a linked list of data insertions. • This is sometimes known as a bucket.

Understanding Hashing Algorithms in Data Structures

Understanding Hashing Algorithms in Data Structures

Presentation Transcript

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Algorithms and Data Structures

DATA STRUCTURES AND ALGORITHMS

Algorithms and Data Structures

Data Structures and Algorithms

Data Structures and Algorithms

Algorithms and Data Structures

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Data Structures and Algorithms

Algorithms and Data Structures