Introduction to Hashing

Introduction to Hashing CS 311 Winter, 2013

Dictionary Structure • A dictionary structure has the form: (Key, Data) • Dictionary structures are organized in a manner that optimizes search time for the key. • Hashing stores dictionary objects in a table where each location has an address.

Key to Address • Hashing is called a Key to Address system because the address of a dictionary object is computed directly from the key using a function called the Hash Function. • A good hash function should • Be easy to calculate. • Distribute the objects throughout the table with equal probability. • Minimize collisions.

A Simple Hash Function • An example of a simple hash function for a table of size M (locations 0 to M-1) is: int hash( int key ) { return key % M; } • With a good hash function, the search time is O( 1 ).

Collision Resolution • A collision occurs when two keys result in the same address. • When this happens, we must be able to store the second object in a location that can be quickly found starting from the original hash location. • The two basic approaches to collision resolution are called open hashing (or Separate Chaining) and closed hashing (or Open Addressing).

Open Hashing • Open Hashing means that collisions are resolved by storing the colliding object in a separate area. • In essence, the objects that collide form linked lists, where the head of the list is the original hash location. Thus, the name Separate Chaining. • One variation of open hashing is called Bucket Hashing.

Closed Hashing • In closed hashing, objects that collide are stored within the hash table itself. • This can create an addition problem called a Secondary Collision. • Two general methods to resolve collisions in closed hashing are called Probing and Double Hashing.

Probing • In probing, the hash function becomes: hash( key ) + p( i ) where i is an iteration value and p(0) = 0. • The simplest form of probing is linearprobing where p( i ) = i, for i = 0, 1, 2, … • A problem with linear probing, however, is that it can cause clustering.

Probing II • Another common approach to probing that avoids clustering is called quadratic probing. • In quadratic probing p(i) = i2, for i = 0, 1, 2,… • However, if the table is more than half full or if the table size is not a prime number, it is possible that quadratic probing will not find an open slot even when there is one.

Double Hashing • A problem with probing is that the probe sequence is the same for all colliding keys. • An alternative to probing is double hashing. In this case the hash function is hash1( key ) + i hash2( key ) • If the table size is a prime number M and if R is a prime number less than M, then a good choice for hash2 is: hash2( key ) = R – ( key % R )

Load Factor • The load factor  is defined to be N/M, where N is the number of objects in the table and M is the size of the table. • For open hashing, we want the load factor to be close to 1. • For closed hashing, we want the load factor to be less than 0.5.

Deletions • When deleting an object from a hash table, there are two important considerations. • Deleting an object must not hinder later searches. That is, it must not cut off a chain used for probing. • A slot freed because of a deleting must remain usable. • One solution is to use a tombstone.

Tombstones • A tombstone is special marker that states that a slot is free; however, it used to be part of a chain. • A search encountering a tombstone keeps going. • When inserting and encountering a tombstone, we must continue to the end of the chain before reusing the tombstone to prevent inserting a duplicate value.

Tombstone II • Tombstones do lengthen the size of a chain. • An alternative to a tombstone is the following. When a value is removed, continue down the chain, swapping the free slot with the next value in the chain. • This shortens the chain by one slot and always put the freed slot at the end of the chain.

Rehashing • When a table gets too full or when chains get too long, Rehashing creates another table at least twice as big as the original. • This also requires a new hash function. • Then, starting from slot 0, each value in the original table is hashed (using the new function) into the new table.

Introduction to Hashing