1 / 127

Hashing

Hashing. Joe Meehean. 1. Motivation. BST easy to implement average-case times O(LogN ) worst-case times O(N) AVL Trees harder to implement worst case times O( LogN ) Can we do better in the average-case?. Concept. “Dictionary” ADT

cathal
Download Presentation

Hashing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hashing Joe Meehean 1

  2. Motivation • BST • easy to implement • average-case times O(LogN) • worst-case times O(N) • AVL Trees • harder to implement • worst case times O(LogN) • Can we do better in the average-case?

  3. Concept • “Dictionary” ADT • average-case time O(1) for lookup, insert, and delete • Idea • stores keys (and associated values) in an array • compute each key’s array index as a function of its value • take advantage of array’s fast random access • Alternative implementation for sets and maps

  4. Example • Goal • Store info about a companies 50 employees • Each employee has a unique employee ID • in range 100-200 • Approach • use an array of size 101 (range of IDs) • store employee E’s info in array[E-100] • Result • insert, lookup, delete each O(1) • Wasted space, 51 locations

  5. Drawbacks • Less functionality than trees • Hash tables cannot efficiently • find min • find max • print entire table in sorted order • Must be very careful how we use them

  6. Terminology • Hashtable • the underlying array • Hash function • function that converts a key to an index • in example: hash(x) = x – 100 • TableSize • size of underlying array or vector • Bucket • single cell of a hash table array • Collision • when two keys hash to the same bucket

  7. Assumptions • Keys we are using have a hash function • or we can define good hash functions for them • Keys overload the following operators • == • !=

  8. Resolving Obvious Problems How do we make a good hash function? What should we do about collisions? How large should we make our hash table?

  9. Hash Function Goals • Hash function should be fast • Keys should be evenly distributed • different keys should have different hash values • Should reduce space needed • e.g., student IDs are 10 digits • do not need an array size of 10,000,000,000 • there are only ~3,000 students

  10. Hash Functions Approach • Convert key to an intn • scramble up the data • ensure the data spreads over the entire integer space • Return n % TableSize • ensures that n doesn’t fall off the end of the table

  11. Example: Converting Strings • Method 1 • convert each char to an int • sum them • return sum % TableSize • Advantages • simple • time is O(key length)

  12. Example: Converting Strings • Method 1 • convert each char to an int • sum them • return sum % TableSize • Problems • short keys may not reach end of table • sum of characters < TableSize(by a lot) • maps all permutations to same hash • hash(“able”) = hash(“bale”) • Time is O(key length)

  13. Example: Converting Strings • Method 2 • Multiply individual chars by different values • Then sum • a[0] * 37n + a[1] * 37n-1 + … + a[n-1] * 37 • a[i] * 37n-i • Advantages • produces big range of values • permutations hash to different values

  14. Example: Converting Strings • Method 2 • Multiply individual chars by different values • Then sum • Disadvantages • relies on integer overflow • need to worry about negative hashes • Handling negative hash • hash = hash % TableSize • if(hash < 0) hash += TableSize

  15. Hash Function Tradeoffs • Fast hash vs. evenly distributed hash • often faster leads to less evenly distributed • even distribution leads to slower • String example • could use only some of the characters • faster, but more collisions likely

  16. Resolving Obvious Problems How do we make a good hash function? What should we do about collisions? How large should we make our hash table?

  17. Handling collisions • What if two keys hash to the same bucket (array entry)? • Array entries are linked lists (or trees) • different keys with same hash value stored in same list (or tree) • commonly called chained bucket hashing, or just chaining

  18. Handling Collisions Example • TableSize = 10 • keys: 10 digit student IDs • hashfn = sum of digits % TableSize

  19. Example 0 2 4 1 3 5 7 9 6 8 C E B A D

  20. Handling collisions • During a lookup • How can we tell which value we want if there are > 1 entries in the bucket? • Compare the keys • buckets store keys and values

  21. Resolving Obvious Problems How do we make a good hash function? What should we do about collisions? How large should we make our hash table?

  22. Hash Table Size • Related to load factor • ratio of items in hash table to TableSize • average length of bucket list is • Goal is to keep around 1

  23. Hash Table Size • Related to hashing function • Some hashing functions lead to data clustered together • Using a prime TableSize helps resolve this issue • hashing function not like to share factor with table size

  24. Hash Table Size • If number of keys known in advance • make the hash table a little larger • prime near 1.25 * the number of keys • a little room to avoid collisions • trades space for potentially faster lookup • If number of keys not known in advance • plan to expand array as needed • coming up in another lecture

  25. Hash Table operations • Lookup Key k • compute h = hash(k) • see if k is in the list in hashtable[h] • Insert Key k • Compute h = hash(k) • Make sure k is not already in hashtable[h] • Add k to the list in hashtable[h] • Delete Key k • Compute h = hash(k) • Remove k from list in hashtable[h]

  26. HashSet Class template<class K, class Hash> class HashSet{ private: vector< list<K> > table; intcurrentSize; Hash hashfn; public: … bool contains(const K&) const; void insert(const K&); void remove(const K&); };

  27. Questions?

  28. Alternative to chaining • Recall chaining hash tables • array cells stored linked lists • 2 keys with same hash end up in same list • Chaining hash tables • require 2 data structures • hash table and linked list • Can we solve collisions with more hashing? • use just one data structure

  29. Probing Hash Tables • No linked list in array cells • Collisions handled using alternative hash • try cells h0(x), h1(x), h2(x),… • until an empty cell is found • hi(x) = hash(x) + f(i) • f(i)is collision resolution strategy • Probing • looking for alternative hash locations

  30. Probing Hash Tables • All data goes directly into table • instead of into lists in the table • Need a bigger table • ≈ 0.5 (half full) • More wasted space • Marginally less complexity

  31. Linear probing • f(i) is a linear function • often f(i)= i • If a collision occurs, look in the next cell • hash(x) + 1 • keep looking until an empty cell is found • hash(x) + 2, hash(x) + 3, … • use modulus to wrap around table • Should eventually find an empty cell • if the table is not full

  32. Linear probing Insert 89 0 2 4 1 3 5 7 9 6 8 89 h0(x) Simple hash: h(x) = x % TableSize

  33. Linear probing Insert 18 0 2 4 1 3 5 7 9 6 8 89 18 h0(x) Simple hash: h(x) = x % TableSize

  34. Linear probing Insert 49 0 2 4 1 3 5 7 9 6 8 89 18 h0(x) Collision Simple hash: h(x) = x % TableSize

  35. Linear probing Insert 49 0 2 4 1 3 5 7 9 6 8 49 89 18 h1(x) Simple hash: h(x) = x % TableSize

  36. Linear probing Insert 58 0 2 4 1 3 5 7 9 6 8 49 89 18 h0(x) Collision Simple hash: h(x) = x % TableSize

  37. Linear probing Insert 58 0 2 4 1 3 5 7 9 6 8 49 89 18 h1(x) Collision Simple hash: h(x) = x % TableSize

  38. Linear probing Insert 58 0 2 4 1 3 5 7 9 6 8 49 89 18 h2(x) Collision Simple hash: h(x) = x % TableSize

  39. Linear probing Insert 58 0 2 4 1 3 5 7 9 6 8 49 58 89 18 h3(x) Simple hash: h(x) = x % TableSize

  40. Linear probing • Advantages • no need for list • collision resolution function is fast • Disadvantages • requires more book keeping • primary clustering

  41. Probing Extra Book Keeping Delete 89 0 2 4 1 3 5 7 9 6 8 49 58 89 18 h0(x) What if an entry is deleted and we try to lookup another entry that collided with it?

  42. Probing Extra Book Keeping Lookup 49 0 2 4 1 3 5 7 9 6 8 49 58 18 Not Found h0(x) What if an entry is deleted an we try to lookup another entry that collided with it?

  43. Probing Extra Book Keeping • Need extra information per cell • Differentiate between states • ACTIVE: cell contains a valid key • EMPTY: cell never contained a valid key • DELETED: previously contained a valid key • All cells start EMPTY • Lookup • keep looking until you find key or EMPTY cell

  44. Probing Extra Book Keeping Delete 89 0 2 4 1 3 5 7 9 6 8 49 58 89 18 A E E A E E E A E A h0(x)

  45. Probing Extra Book Keeping Delete 89 0 2 4 1 3 5 7 9 6 8 49 58 89 18 A E E A E E E D E A h0(x)

  46. Probing Extra Book Keeping Lookup 49 0 2 4 1 3 5 7 9 6 8 49 58 89 18 A E E A E E E D E A h0(x) Collision

  47. Probing Extra Book Keeping Lookup 49 0 2 4 1 3 5 7 9 6 8 49 58 89 18 A E E A E E E D E A h1(x)

  48. Probing HashSet Class template<class K, class Hash> class HashSet{ private: vector<HashEntry> table; intcurrentSize; ... }; class HashEntry{ public: enumEntryType{ACTIVE, EMPTY, DELETED}; private: K element; EntryType info; friend class HashSet; };

  49. Questions?

  50. Linear Probing Hashing Recall • No more bucket lists • Use collision resolution strategy • hi(x) = hash(x) + f(i) • If collision occurs, try the next cell • f(i) = i • repeat until you find an empty cell • Need extra book keeping • ACTIVE, EMPTY, DELETED

More Related