Programming, Data Structures and Algorithms (Hashing)

Programming, Data Structures and Algorithms (Hashing) Anton Biasizzo

Hash table ADT • Search tree ADT • Various operations on a set of elements. • Find operates in fast O(log n) time. • Insert and Delete require find procedure – both require O(log n) time • Hash table ADT • Supports only subset of the operations of search tree ADT (insert, delete, and find) • Very fast operations (close to constant time O(1)) • Does not provide ordering information • Implementations are referred as hashing

General idea • Hash table is an array of fixed size. • The array contains keys (i.e. string with associated value). • The table size (TableSize) is a part of hash data structure. • Each key is mapped into some number in the range [0,TableSize-1] and stored in appropriate cell. • Mapping is called Hash function. • Hash function should be simple to implement.

General idea • Returned values called hash values, hash codes, hash sums, or hashes. • Ideally distinct keys should have distinct hash values. • Finite number of cells (i.e. hash values). • Inexhaustible supply of keys. • Hash function should distribute keys evenly among the cells. • More keys map to same hash values – collision • Hash table implementation: • Choose hash function, • Manage collisions, • Determine table size.

Hash function • If input keys are integers, hash function is typically Key mod TableSize: • Unless Key have some undesirable properties (i.e. Table size is 10 and keys end in zero). • Collisions can be reduced when the table size is a prime. • When keys are random integers they are evenly distributed. • Keys are usually strings: • Hash functions have to be chosen carefully. • One option is to sum ASCII values of characters in the string • Second option is to use only first few characters of key

Hash function • Sum of ASCII codes: typedef unsigned int INDEX; INDEX hash( char *key, unsigned int H_SIZE ) { unsigned int hash_val = 0; while ( *key != ‘\0’ ) hash_val += *key++; return ( hash_val % H_SIZE); } • It is simple and fast hash function • If the table size is large, it does not distribute the keys well: • For keys with eight or fewer characters hash is between 0 and 1016

Hash function • First three characters: typedef unsigned int INDEX; INDEX hash( char *key, unsigned int H_SIZE ) { return ( ( key[0] + 27*key[1] + 729*key[2] ) % H_SIZE); } • Assumes that key has at least three characters. • 27 is the number of letters in English alphabet. • This is good hash function if characters are random, not the case for any language.

Hash function • Use all characters in key: typedef unsigned int INDEX; INDEX hash( char *key, unsigned int H_SIZE ) { unsigned int hash_val = 0; while ( *key != ‘\0’ ) hash_val = ( hash_val << 5 ) + *key++; return ( hash_val % H_SIZE); } • Multiplication with 32 instead of 27. • Simple and fast (if overflows are allowed) hash function. • If keys are very long: • it might be too time consuming. • first characters are shifted out • Use only some characters (odd, from different field, …)

Collision resolution • Collision: When inserting new element, it hashes to the same value as an already inserted element. • Strategies to resolve collisions: • Open hashing, • Closed hashing.

Open hashing • Open hashing or separate chaining. • Keep a list of all elements that hash to the same value. • ADT operations (find, insert,…) must be adopted. • In the example lists have headers. • Hash function is: mod 10 • Assume that keys are first 10 squares.

Open hashing type declaration • Type declaration: typedef struct list_node *node_ptr; struct list_node { element_type element; node_ptr next; }; typedef tree_ptr LIST; typedef tree_ptr position; struct hash_tbl { unsigned int table_size; LIST *the_lists; } typedef struct hash_tbl *HASH_TABLE

Open hashing operations • Initialization HASH_TABLE initialize_table( unsigned int table_size ) { HASH_TABLE H; int i; /* Allocate table */ H = (HASH_TABLE) malloc ( sizeof (struct hash_tbl) ); /* Allocate list pointers */ H->the_lists = (position *) malloc( sizeof (LIST) * H->table_size ); /* Allocate list headers */ for(i=0; i<H->table_size; i++ ) { H->the_lists[i] = (LIST) malloc sizeof (struct list_node) ); H->the_lists[i]->next = NULL; } return H; }

Open hashing operations • Find operation • If keys are strings or complex structures appropriate functions must be used for key comparison. position find( element_type key, HASH_TABLE H ) { position p; LIST L; L = H->the_lists[ hash( key, H->table_size) ]; p = L->next; while ( (p != NULL) && (p->element != key) ) p = p->next; return p; }

Open hashing operations • Insert operation (no duplicates) void insert( element_type key, HASH_TABLE H ) { position pos, new_cell; LIST L; pos = find( key, H); if ( pos == NULL ) new_cell = (position) malloc(sizeof(struct list_node)); L = H->the_lists[ hash( key, H->table size ) ]; new_cell->next = L->next; new_cell->element = key; L->next = new_cell; } • This implementation compute hash value twice.

Open hashing • Any scheme could be used instead of linked lists to resolve the collisions (trees, other hash table, …) • We expect that if the table is large, the lists are short. • Load factor λ is a ratio of the number of elements in the hash table to the table size. • The average length of a list is λ. • Effort to perform a search is a constant time to calculate the hash value plus the time to traverse the list. • In an unsuccessful search, the number of links to traverse is λ on average. • The general rule for open hashing is to make table size about as large as the number of elements expected (λ ≈ 1)

Closed hashing • Open hashing has disadvantage of requiring lists or other data structure. • Closed hashing or Open addressing is an alternative to resolve collisions with linked lists. • If collision occurs an alternate cells are tried until an empty cell is found. • Formally: Cells h0(X), h1(X), h2(X),… are tried in succession where • Function F is the collision resolution strategy (F(0) = 0). • For closed hashing bigger tables are needed. • In general the load factor should be below λ=0.5.

Linear probing • Collision resolution function F is linear function (typically F(i)=i). • Cells are tried sequentially with wraparound in search of an empty cell. • As long as table is big enough a free cell can be found • Time to find empty cell can get quite large • Even when table are relatively empty blocks of occupied cells start forming – primary clustering

Example of linear probing • Example of inserting keys {89, 18, 49, 58, 69}

Quadratic probing • Collision resolution function F is quadratic function (typically F(i)=i2). • It eliminates primary clustering. • For linear probing it is bad if table gets almost full. • In quadratic probing only at most half of table can be used as alternate locations. • For quadratic probing there is no guarantee of finding an empty cell once the table gets more then half full. • If table size is not prime the empty cell might not be found even when the table is less than half full.

Example of quadratic probing • Example of inserting keys {89, 18, 49, 58, 69}

Double hashing • Collision resolution function F includes second hash function F(i) = i hash2(X) . • We probe at distance hash2(X) , 2 hash2(X) , 3 hash2(X) ,… • Good second hash function is essential. • Second hash function must never evaluate to zero! • Second hash function must be chosen such that all cells can be probed (prime table size).

Example of double hashing • Hash2(X) = R – (X mod R), where R=7 • Example of inserting keys {89, 18, 49, 58, 69}

Problems with closed hash table • Standard deletion cannot be performed, because the cell might have caused a collision to go past it. • Closed hash table require lazy deletion. Additional field is introduced to an element which tags it as deleted. • If the table gets too full, the operations gets slower and insertion might even fail. • This happens when many deletions are intermixed with insertions.

Rehashing • Solution is rehashing: • Build another table that is twice as big with new hash function. • Scan original hash table • Insert all non-deleted elements into new hash table • It is expensive operation. • It happens infrequent. • Several strategies: • Rehash when the table is half full, • Rehash only when insertion fails, • Rehash on certain load factor.

Hash tables • Hash tables are used to implement Insert and Find operation in constant average time. • Hash table usage: • Compilers to keep track of declared variables – symbol table • Graph theory where nodes have names instead of numbers • In playing games for recording positions – transposition table • For dictionary implementation (spell checker, search engines, …) • For database implementation

Programming, Data Structures and Algorithms (Hashing)