250 likes | 360 Views
CS 261 – Data Structures. Hash Tables Part 1. Open Address Hashing. Can we do better than O(log n) ?. We have seen how skip lists and AVL trees can reduce the time to perform operations from O(n) to O(log n) Can we do better? Can we find a structure that will provide O(1) operations?
E N D
CS 261 – Data Structures Hash Tables Part 1. Open Address Hashing
Can we do better than O(log n) ? • We have seen how skip lists and AVL trees can reduce the time to perform operations from O(n) to O(log n) • Can we do better? Can we find a structure that will provide O(1) operations? • Yes. No. Well, Maybe….
Hash Tables • Hash tables are similar to Arrays except… • Elements can be indexed by values other than integers • A single position may hold more than one element • Arbitrary values (hash keys) map to integers by means of a hash function • Computing a hash function is usually a two-step process: • Transform the value (or key) to an integer • Map that integer to a valid hash table index • Example: storing names • Compute an integer from a name • Map the integer to an index in a table (i.e., a vector, array, etc.)
Hash Tables Say we’re storing names: Angie Joe Abigail Linda Mark Max Robert John 0Angie, Robert Hash Function 1Linda 2Joe, Max, John 3 4Abigail, Mark
Hash Function: Transforming to an Integer • Mapping: Map (a part of) the key into an integer • Example: a letter to its position in the alphabet • Folding: key partitioned into parts which are then combined using efficient operations (such as add, multiply, shift, XOR, etc.) • Example: summing the values of each character in a string • Shifting: get rid of high- or low-order bits that are not random • Example: if keys are always even, shift off the low order bit • Casts: converting a numeric type into an integer • Example: casting a character to an int to get its ASCII value
Hash Function: Combinations • Another use for shifting: in combination with folding when the fold operator is commutative:
Hash Function: Mapping to a Valid Index • Almost always use modulus operator (%)with table size: • Example: idx = hash(val) % data.size() • Must be sure that the final result is positive. • Use only positive arithmetic or take absolute value • Remember smallest negative number, possibly use longs • To get a good distribution of indices, prime numbers make the best table sizes: • Example: if you have 1000 elements, a table size of 997 or 1009 is preferable
Hash Functions: some ideas • Here are some typical hash functions: • Character: the char value cast to an int it’s ASCII value • Date: a value associated with the current time • Double: a value generated by its bitwise representation • Integer: the int value itself • String: a folded sum of the character values • URL: the hash code of the host name
Hash Tables: Collisions • Ideally, we want a perfect hash function where each data element hashes to a unique hash index • However, unless the data is known in advance, this is usually not possible • A collision is when two or more different keys result in the same hash table index
Example, perfect hashing • Alfred, Alessia, Amina, Amy, Andy and Anne have a club. Amy needs to store information in a six element array. Amy discovers can convert 3rd letter to index:
Indexing is faster than searching • Can convert a name (e.g. Alessia) into a number (e.g. 4) in constant time. • Even faster than searching. • Allows for O(1) time operations. • Of course, things get more complicated when the input values change (Alan wants to join the club, since ‘a’ = 0 same as Amy, or worse yet Al who doesn’t have a third letter!)
Hash Tables: Resolving Collisions There are several general approaches to resolving collisions: • Open address hashing: if a spot is full, probe for next empty spot • Chaining (or buckets): keep a Collection at each table entry • caching: save most recently access value, slow search otherwise Today we will examine Open Address Hashing
Open Address Hashing • All values are stored in an array. • Hash value is used to find initial index to try. • If that position is filled, next position is examined, then next, and so on until an empty position is filled • The process of looking for an empty position is termed probing, specifically linear probing. • There are other probing algorithms, but we won’t consider them.
Example • Eight element table using Amy’s hash function. 0-aiqy 1-bjrz 2-cks 3-dlt 4-emu 5-fnv 6-gpw 7-hpq Amina Andy Alessia Alfred Aspen
Now Suppose Anne wants to Join • The index position (5) is filled by Alfred. So we probe to find next free location. 0-aiqy 1-bjrz 2-cks 3-dlt 4-emu 5-fnv 6-gpw 7-hpq Amina Andy Alessia Alfred Anne Aspen
Next comes Agnes • Her position, 6, is filled by Anne. So we once more probe. When we get to the end of the array, start again at the beginning. Eventually find position 1. 0-aiqy 1-bjrz 2-cks 3-dlt 4-emu 5-fnv 6-gpw 7-hpq Amina Agnes Andy Alessia Alfred Anne Aspen
Finally comes Alan • Lastly Alan wants to join. His location, 0, is filled by Amina. Probe finds last free location. Collection is now completely filled. (More on this later) 0-aiqy 1-bjrz 2-cks 3-dlt 4-emu 5-fnv 6-gpw 7-hpq Amina Agnes Alan Andy Alessia Alfred Anne Aspen
Next operation, contains test • Hash to find initial index, move forward examining each location until value is found, orempty location is found. • Search for Amina, Search for Anne, search for Albert • Notice that search time is not uniform 0-aiqy 1-bjrz 2-cks 3-dlt 4-emu 5-fnv 6-gpw 7-hpq Amina Andy Alessia Alfred Aspen
Final Operation: Remove • Remove is tricky. Can’t just replace entry with null. What happens if we delete Agnes, then search for Alan? 0-aiqy 1-bjrz 2-cks 3-dlt 4-emu 5-fnv 6-gpw 7-hpq Amina Alan Andy Alessia Alfred Anne Aspen
How to handle remove • Simple solution: Just don’t do it. (we will do this one) • Better: create a tombstone: • A value that marks a deleted entry • Can be replaced with new entry • But doesn’t halt a search 0-aiqy 1-bjrz 2-cks 3-dlt 4-emu 5-fnv 6-gpw 7-hpq Amina TOMBSTONE Alan Andy Alessia Alfred Anne Aspen
Hash Table Size - Load Factor • Load factor: l = n / m • So, load factor represents the average number of elements at each table entry • For open address hashing, load factor is between 0 and 1 (often somewhere between 0.5 and 0.75) • For chaining, load factor can be greater than 1 • Want the load factor to remain small # of elements Load factor Size of table
What to do with a large load factor • Common solution: When the load factor becomes too large (say, bigger than 0.75) then reorganize. • Create a new table with twice the number of positions • Copy each element, rehashing using the new table size, placing elements in new table • The delete the old table • Exactly like you did with the dynamic array, only this time using hashing.
Hash Tables: Algorithmic Complexity • Assumptions: • Time to compute hash function is constant • Worst case analysis All values hash to same position • Best case analysis Hash function uniformly distributes the values(all buckets have the same number of objects in them) • Find element operation: • Worst case for open addressing O(n) • Best case for open addressing O(1)
Hash Tables: Average Case • What about average case? • Turns out, it is 1/(1-) • So keeping load factor small is very important
Your turn • Complete the implementation of the hash table • Use hashfun(value) to get hash value • Don’t do remove. • Do add and contains test first, then do the internal reorganize method