1.28k likes | 1.48k Views
Hashing. Joe Meehean. 1. Motivation. BST easy to implement average-case times O(LogN ) worst-case times O(N) AVL Trees harder to implement worst case times O( LogN ) Can we do better in the average-case?. Concept. “Dictionary” ADT
E N D
Hashing Joe Meehean 1
Motivation • BST • easy to implement • average-case times O(LogN) • worst-case times O(N) • AVL Trees • harder to implement • worst case times O(LogN) • Can we do better in the average-case?
Concept • “Dictionary” ADT • average-case time O(1) for lookup, insert, and delete • Idea • stores keys (and associated values) in an array • compute each key’s array index as a function of its value • take advantage of array’s fast random access • Alternative implementation for sets and maps
Example • Goal • Store info about a companies 50 employees • Each employee has a unique employee ID • in range 100-200 • Approach • use an array of size 101 (range of IDs) • store employee E’s info in array[E-100] • Result • insert, lookup, delete each O(1) • Wasted space, 51 locations
Drawbacks • Less functionality than trees • Hash tables cannot efficiently • find min • find max • print entire table in sorted order • Must be very careful how we use them
Terminology • Hashtable • the underlying array • Hash function • function that converts a key to an index • in example: hash(x) = x – 100 • TableSize • size of underlying array or vector • Bucket • single cell of a hash table array • Collision • when two keys hash to the same bucket
Assumptions • Keys we are using have a hash function • or we can define good hash functions for them • Keys overload the following operators • == • !=
Resolving Obvious Problems How do we make a good hash function? What should we do about collisions? How large should we make our hash table?
Hash Function Goals • Hash function should be fast • Keys should be evenly distributed • different keys should have different hash values • Should reduce space needed • e.g., student IDs are 10 digits • do not need an array size of 10,000,000,000 • there are only ~3,000 students
Hash Functions Approach • Convert key to an intn • scramble up the data • ensure the data spreads over the entire integer space • Return n % TableSize • ensures that n doesn’t fall off the end of the table
Example: Converting Strings • Method 1 • convert each char to an int • sum them • return sum % TableSize • Advantages • simple • time is O(key length)
Example: Converting Strings • Method 1 • convert each char to an int • sum them • return sum % TableSize • Problems • short keys may not reach end of table • sum of characters < TableSize(by a lot) • maps all permutations to same hash • hash(“able”) = hash(“bale”) • Time is O(key length)
Example: Converting Strings • Method 2 • Multiply individual chars by different values • Then sum • a[0] * 37n + a[1] * 37n-1 + … + a[n-1] * 37 • a[i] * 37n-i • Advantages • produces big range of values • permutations hash to different values
Example: Converting Strings • Method 2 • Multiply individual chars by different values • Then sum • Disadvantages • relies on integer overflow • need to worry about negative hashes • Handling negative hash • hash = hash % TableSize • if(hash < 0) hash += TableSize
Hash Function Tradeoffs • Fast hash vs. evenly distributed hash • often faster leads to less evenly distributed • even distribution leads to slower • String example • could use only some of the characters • faster, but more collisions likely
Resolving Obvious Problems How do we make a good hash function? What should we do about collisions? How large should we make our hash table?
Handling collisions • What if two keys hash to the same bucket (array entry)? • Array entries are linked lists (or trees) • different keys with same hash value stored in same list (or tree) • commonly called chained bucket hashing, or just chaining
Handling Collisions Example • TableSize = 10 • keys: 10 digit student IDs • hashfn = sum of digits % TableSize
Example 0 2 4 1 3 5 7 9 6 8 C E B A D
Handling collisions • During a lookup • How can we tell which value we want if there are > 1 entries in the bucket? • Compare the keys • buckets store keys and values
Resolving Obvious Problems How do we make a good hash function? What should we do about collisions? How large should we make our hash table?
Hash Table Size • Related to load factor • ratio of items in hash table to TableSize • average length of bucket list is • Goal is to keep around 1
Hash Table Size • Related to hashing function • Some hashing functions lead to data clustered together • Using a prime TableSize helps resolve this issue • hashing function not like to share factor with table size
Hash Table Size • If number of keys known in advance • make the hash table a little larger • prime near 1.25 * the number of keys • a little room to avoid collisions • trades space for potentially faster lookup • If number of keys not known in advance • plan to expand array as needed • coming up in another lecture
Hash Table operations • Lookup Key k • compute h = hash(k) • see if k is in the list in hashtable[h] • Insert Key k • Compute h = hash(k) • Make sure k is not already in hashtable[h] • Add k to the list in hashtable[h] • Delete Key k • Compute h = hash(k) • Remove k from list in hashtable[h]
HashSet Class template<class K, class Hash> class HashSet{ private: vector< list<K> > table; intcurrentSize; Hash hashfn; public: … bool contains(const K&) const; void insert(const K&); void remove(const K&); };
Alternative to chaining • Recall chaining hash tables • array cells stored linked lists • 2 keys with same hash end up in same list • Chaining hash tables • require 2 data structures • hash table and linked list • Can we solve collisions with more hashing? • use just one data structure
Probing Hash Tables • No linked list in array cells • Collisions handled using alternative hash • try cells h0(x), h1(x), h2(x),… • until an empty cell is found • hi(x) = hash(x) + f(i) • f(i)is collision resolution strategy • Probing • looking for alternative hash locations
Probing Hash Tables • All data goes directly into table • instead of into lists in the table • Need a bigger table • ≈ 0.5 (half full) • More wasted space • Marginally less complexity
Linear probing • f(i) is a linear function • often f(i)= i • If a collision occurs, look in the next cell • hash(x) + 1 • keep looking until an empty cell is found • hash(x) + 2, hash(x) + 3, … • use modulus to wrap around table • Should eventually find an empty cell • if the table is not full
Linear probing Insert 89 0 2 4 1 3 5 7 9 6 8 89 h0(x) Simple hash: h(x) = x % TableSize
Linear probing Insert 18 0 2 4 1 3 5 7 9 6 8 89 18 h0(x) Simple hash: h(x) = x % TableSize
Linear probing Insert 49 0 2 4 1 3 5 7 9 6 8 89 18 h0(x) Collision Simple hash: h(x) = x % TableSize
Linear probing Insert 49 0 2 4 1 3 5 7 9 6 8 49 89 18 h1(x) Simple hash: h(x) = x % TableSize
Linear probing Insert 58 0 2 4 1 3 5 7 9 6 8 49 89 18 h0(x) Collision Simple hash: h(x) = x % TableSize
Linear probing Insert 58 0 2 4 1 3 5 7 9 6 8 49 89 18 h1(x) Collision Simple hash: h(x) = x % TableSize
Linear probing Insert 58 0 2 4 1 3 5 7 9 6 8 49 89 18 h2(x) Collision Simple hash: h(x) = x % TableSize
Linear probing Insert 58 0 2 4 1 3 5 7 9 6 8 49 58 89 18 h3(x) Simple hash: h(x) = x % TableSize
Linear probing • Advantages • no need for list • collision resolution function is fast • Disadvantages • requires more book keeping • primary clustering
Probing Extra Book Keeping Delete 89 0 2 4 1 3 5 7 9 6 8 49 58 89 18 h0(x) What if an entry is deleted and we try to lookup another entry that collided with it?
Probing Extra Book Keeping Lookup 49 0 2 4 1 3 5 7 9 6 8 49 58 18 Not Found h0(x) What if an entry is deleted an we try to lookup another entry that collided with it?
Probing Extra Book Keeping • Need extra information per cell • Differentiate between states • ACTIVE: cell contains a valid key • EMPTY: cell never contained a valid key • DELETED: previously contained a valid key • All cells start EMPTY • Lookup • keep looking until you find key or EMPTY cell
Probing Extra Book Keeping Delete 89 0 2 4 1 3 5 7 9 6 8 49 58 89 18 A E E A E E E A E A h0(x)
Probing Extra Book Keeping Delete 89 0 2 4 1 3 5 7 9 6 8 49 58 89 18 A E E A E E E D E A h0(x)
Probing Extra Book Keeping Lookup 49 0 2 4 1 3 5 7 9 6 8 49 58 89 18 A E E A E E E D E A h0(x) Collision
Probing Extra Book Keeping Lookup 49 0 2 4 1 3 5 7 9 6 8 49 58 89 18 A E E A E E E D E A h1(x)
Probing HashSet Class template<class K, class Hash> class HashSet{ private: vector<HashEntry> table; intcurrentSize; ... }; class HashEntry{ public: enumEntryType{ACTIVE, EMPTY, DELETED}; private: K element; EntryType info; friend class HashSet; };
Linear Probing Hashing Recall • No more bucket lists • Use collision resolution strategy • hi(x) = hash(x) + f(i) • If collision occurs, try the next cell • f(i) = i • repeat until you find an empty cell • Need extra book keeping • ACTIVE, EMPTY, DELETED