Data Structures & Algorithms Hash Tables

Data Structures & Algorithms Hash Tables Richard Newman based on slides by S. Sahni and book by R. Sedgewick

Dictionary • get(theKey) • find if an item with theKey is in dictionary • put(theKey, theElement) • add an item to the dictionary • remove(theKey) • delete (an item) with theKey from dictionary

a c e d b Unsorted Array • get(theKey) • O(N) time • put(theKey, theElement) • O(N) time to find duplicate, O(1) to add • remove(theKey) • O(N) time.

a b c d e Sorted Array • get(theKey) • O(lg N) time • put(theKey, theElement) • O(lg N) time to find duplicate, O(N) to add • remove(theKey) • O(N) time.

firstNode null a c e d b Unsorted Chain • get(theKey) • O(N) time • put(theKey, theElement) • O(N) time to verify duplicate, O(1) to add • remove(theKey) • O(N) time.

firstNode null a b c d e Sorted Chain • get(theKey) • O(N) time • put(theKey, theElement) • O(N) time to verify duplicate, O(1) to add • remove(theKey) • O(N) time.

Costs of Insertion and Search N = number of items, M = size of container

Binary Search Trees k <=k >=k • get(theKey) • O(N) time worst case – O(lg N) time average • put(theKey, theElement) O(N) (wc) – O(lg N) time (avg), O(1) to add • remove(theKey) • O(N) time worst case – O(lg N) time average.

Other BSTs Randomized – recursively make new node root of subtree as search to insert with uniform prob. AVL Tree – rotate subtrees to maintain balance 2-3-4 Tree – adjust number of children so depth is maintained Red-Black Tree – rotate to maintain same number of black edges in all paths to leaves, with no two consecutive reds

Other BSTs Splay trees – good for non-uniform searching, or for searching that displays temporal correlations (i.e., search for something, likely to search for it again soon) ... skip lists – use linked list structure but with extra links to allow tree-like speeds but with simpler structures We will explore them now...

Skip Lists Skip list are linked lists… Except with extra links That allow the ADT to skip over large portions of a list at a time during search Defn 13.5: A skip list is an ordered linked list where each node contains a variable number of links, with the ith links in the nodes implementing singly linked lists that skip nodes with < i links.

Skip Lists A A C E E G H IA L M N P R Linked list – search for N: how many steps? Ans: 11 links followed Skip list with jumps of 3 – search for N: steps? Ans: 6 More links can help – how many steps now? Ans: 4

Skip List Search Algo Sketch: Start at highest level Search linked list at that level If find item – great! If find end of list or larger key Then drop down a level and resume Until there are no more levels

Skip List Insert Algo Sketch: Find where new node should go Build node with j links Connect each previous node in j lists to new node Connect new node to successor (if any) in j lists But what should j be? Want every tj nodes to have at least j+1 links to skip t – use random fcn to decide

Skip List Time Prop 13.10: Search and Insertion in a randomized skip list with parameter t take about (t logt N)/2 = (t/2lg t))lg N comparisons, on the average. We expect about logt levels, and that about t nodes were skipped on the previous level each link, and we go through about half the links on each level.

Skip List Space Prop 13.11: Skip lists with parameter t have about (t/(t-1))N links on the average. There are N links on the bottom, N/t on the next level, about N/t2 on the next, and so on. The total number of links is about N(1 + 1/t + 1/t2 + … ) = N/(1 – 1/t)

Skip List Tradeoff Picking the parameter t gives a time/space trade-off. When t = 2, skip lists need about lg N comparisons and 2N links on average, like the best BST types. Larger t give longer search and insert times, but uses less space. The choice t = e (base of natural log) minimized the expected number of comparisons (differentiate eq. in 13.10)

Skip List Other Functions Remove, join, and select functions are straight-forward extensions.

Hash Tables So skip lists can also give good performance, similar to the best tree structures. Can we do better? Key-indexed searching has great performance – O(1) time for almost everything But big constraints: • Array size = key range • No duplicate keys

Hash Tables Expected time for insert, find, remove is constant, but worst case is O(N) Idea is to squeeze big key space into smaller key space, so all keys fit into fairly small table Challenge is to avoid duplicate keys … … and to deal with them if they occur anyway So here they are ...

Ideal Hash Tables • Uses a 1D array (or table) table[0:b-1]. • Each position of this array is a bucket. • A bucket can normally hold only one dictionary pair. • Uses a hash function f thatconverts each key k into an index in the range [0, b-1]. • f(k) is the home bucket for key k. • Every dictionary pair (key, element) is stored in its home bucket table[f[key]].

Ideal Hash Tables [0] [1] [2] [3] [4] [5] [6] [7] Pairs are: (22,a),(33,c),(3,d),(73,e),(85,f). Hash table is table[0:7], b = 8. Hash function is key/11. Pairs are stored in table as below: 22/11=2, 33/11=3, 3/11=0, 73/11=6, 85/11=7 Everything is fast – constant time! What could possibly go wrong? (3,d) (22,a) (33,c) (73,e) (85,f)

What Could Go Wrong [0] [1] [2] [3] [4] [5] [6] [7] Where to put (26,g)? (22,a) is already in the 2 slot! Keys with the same home bucket are called synonyms. 22 and 26 are synonyms under /11 hash function (3,d) (22,a) (33,c) (73,e) (85,f) (26,g)

What Could Go Wrong A collision occurs when two items with different keys have the same home bucket A bucket may be able to store more than one item... If bucket is full, then we have an overflow If buckets are of size 1, then overflows occur on every collision We must deal with these somehow!

Hash Table Issues What is size of table? Want it to be small for efficiency … big enough to reduce collisions What is hash function? Want it fast (so time const is small) … but “random” to avoid collisions How do we deal with overflows?

Hash Function First – convert to integer if not already Char to int, e.g. Repeat and combine for string Use imagination for other objects Next – reduce space of integers to table size Divide by some number (example /11) More often – take modulo table size

Hash Function Let KeySpace be the set of all possible keys Could be unbounded (with distribution) Uniform Hash Function maps keys over all of KeySpace to buckets so that the number of keys per bucket is about the same for all buckets Equivalently: any random key maps to any given bucket with probability 1/b, where b is the number of buckets

Hash Function Uniform Hash Functions make collisions (hence overflows) unlikely when keys are randomly chosen For any table size b, if keyspace is 32-bit integers, k%b will uniformly distribute In practice, keys tend to be non-uniform and correlated So want hash function to help break up correlations What effect does modulus b have???

Selecting the Modulus The modulus is the table size If modulus b is even, Then even keys will always map to even buckets, Odd keys will always map to odd buckets Not good Bias in keys leads to bias in buckets!

Selecting the Modulus If modulus b is odd, Then even keys will map to even and odd buckets, Odd keys will map to odd and even buckets Odd/even bias in keys does NOT lead to bias in buckets! So pick odd b!

Selecting the Modulus Similar effects are seen with moduli that are multiples of small primes (3, 5, 7, ...) Effect diminishes as prime size grows Ideally, pick b to be a prime number! Or at least avoid any prime factors less than 20 in b For convenient resizing, may end up with just odd numbers (b → 2b + 1)

Table Size Typically want table of size about twice the number of entries Depends on how much space you are willing to “waste” on empty buckets Depends also on how expensive it is to deal with collisions/overflows Also, subject to avoiding bias using b Which also depends on the hash function itself (if it maps pretty randomly, then may not worry about bias)

Collisions and Overflows Can handle collisions by making bucket larger Allow it to hold multiple pairs Array Linked list (hash chain) Or, may allow probing into table on overflow Linear probing Quadratic probing Random probing

Linear Probing If collision, then overflow Walk through table until find empty bucket Place item in bucket To find, must not only look at bucket, but keep walking through table until … Find item, or … Find empty bucket Remove must take care to preserve linear probe search

Linear Probing Example 0 4 8 12 16 modulus = b (number of buckets) = 17. Home bucket = key % 17. Put in pairs whose keys are6, 12, 34, 29, 28, 11, 23, 7, 0, 33, 30, 45 34 0 45 6 23 7 28 12 29 11 30 33 6→6; 12→12; 34→0; 29→12, then 13; 28→11; 11→11, 12, 13, 14; 23→6, then 7; 7→7, 8; 0→0, 1; 33→16; 30→13,14,15; 45→11,12,13,14,15,16,0,1,2

Linear Probing Example 0 4 8 12 16 modulus = b (number of buckets) = 17. Home bucket = key % 17. Find pairs whose keys are26,18, 45 34 0 45 6 23 7 28 12 29 11 30 33 26→9; empty, hence a miss 18→1: filled, but key is not 18, so try 2 2: filled, but key is not 18, so try 3 3: empty, hence a miss; 45→11,12,13,14,15,16, 0,1,2 – all filled, none 45 – found it!!!!

Linear Probing Example 0 4 8 12 16 modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is0 34 0 45 6 23 7 28 12 29 11 30 33 0→0: filled, but key is not 0, so try 1 1: filled, and key is 1, so delete But now search 45 would find hole – a miss! Search rest of cluster for replacement 2: key is 45 → 11, “<=” 1, so Move 45 to replace 0 item

Linear Probing Example 0 4 8 12 16 modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is29 34 0 45 6 23 7 28 12 29 11 30 33 29→12: filled, but key is not 29, so try 13 13: filled, and key is 29, so delete But now search 11 would find hole – a miss! Search rest of cluster for replacement 14: key is 11, “less than” 12, so Move 11 to replace 29 item

Linear Probing Example 0 4 8 12 16 modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is29 34 0 45 6 23 7 28 12 11 30 33 Can we stop? No – continue to search cluster 15: key is 30→13 so shift left and continue 16: key is 33 →16, what do we do? We can't shift it. Are we done?

Linear Probing Example 0 4 8 12 16 modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is29 34 0 45 6 23 7 28 12 11 30 33 No – continue to search cluster 0: key is 34→0 can't shift; continue 1: key is 0→0, what do we do? Can't shift it past 0, so it stays. Not yet – still non-empty buckets 2: key is 45 → 11, so shift! Are we done? 3: empty – done!

Linear Probing Example 0 4 8 12 16 modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is29 34 0 45 6 23 7 28 12 29 11 30 33 29→12: filled, but key is not 29, so try 13 13: filled, and key is 29, so delete But now search 11 would find hole – a miss! Search rest of cluster for replacement 14: key is 11, “less than” 12, so Move 11 to replace 29 item

Linear Probing Performance Worst case for insert/find/remove: (N) where N is number of items When does this happen? All items in same bucket! Observations: insertion of key with one hash value can make search time for key with different hash value take much longer time!!! Clustering!!!

Linear Probing Expected Performance (large N) Loading density  = #items / #buckets  = 12/17 in example SN = # buckets examined on hit UN = # buckets examined on miss Insertion and removal governed by UN

Linear Probing Loading density  = #items / #buckets  = 12/17 in example  < .75 recommended SN≈ (½) (1 + 1/(1-)) UN≈ (½) (1 + 1/(1-)2)

Linear Probing Design Suppose you want at most 10 compares on a hit, And at most 13 compares on a miss What is the most your load density should be? SN≈ (½) (1 + 1/(1-)) <= 10 UN≈ (½) (1 + 1/(1-)2) <= 13 Work it out. Left half do hits, right do misses.

Linear Probing Design SN≈ (½) (1 + 1/(1-)) <= 10 • 1/(1-) <= 19 • 1/19 <= (1-) •  <= 18/19 UN≈ (½) (1 + 1/(1-)2) <= 13 1/(1-)2 <= 25 1/(1-) <= 5  <= 4/5 Take smaller of two, so  <= 4/5

Linear Probing Design Suppose you want at most 10 compares on a hit, And at most 13 compares on a miss Your load density should be <= 4/5 So if you know there will be at most 1000 entries, design table of size..... b = 1000 * 5/4 b = 1250, ... but maybe better choice... Might pick 1259 as the smallest b >= 1250 that has no prime factors < 20

Linear Probing Design Suppose you want at most 10 compares on a hit, And at most 13 compares on a miss Your load density should be <= 4/5 If you don't know how many entries there will be – then what? Start out with some “reasonable” size, And “double” table if load > 4/5 Easy to monitor load....

Data Structures &amp; Algorithms Hash Tables

Data Structures &amp; Algorithms Hash Tables

Presentation Transcript

Data Structures & Algorithms Hash Tables

Data Structures & Algorithms Hash Tables