Lecture 12: Collisions

CSC 213 – Large Scale Programming Lecture 12: Collisions

Today’s Goal

Today’s Goal • Review when & why we need search ADTs • Why Sequence-based approach causes problems • How hash can help solve these problems • What is inappropriate and incorrect about hash jokes • Discover hash’s problems & what must be done • What would happen if keys hashed to same index • Ways of handling situation so that hash still works • To remove data, using null may not be best option • Dark secrets of hashing, exposed at lecture’s end

Map Performance • In many situations can be matter of life-or-death • 911 Operators immediatelyneed addresses • Google’s search performance in TB/s • O(log n) time too slow for these uses • Would love to use arrays • Convertkeyto intwith hash function • With result of hash, have index in table to examine put,remove&getonly O(1) time

Hash Table • Array locations either: • null • Reference to Entry • Marker value* • Table will contain gaps • Better when spread out • Hash keyto index • Always start with hash • After hash, move to array

Ideal World • key hashed to unique index • Hash and done, Entry is there

Ideal World • key hashed to unique index • Hash and done, Entry is there And then… You wake up

Collisions • Occurs when 2 keys hash to same index • Ideal hash spreads keys out evenly across table • As nice side effect, this limits collisions • Small table size important also, since RAM limited • Unfortunately, no such thing as ideal hash • Must handle collisions to get O(1) efficiency buzz

Bad Hash • Perfect hash does not exist • Cannot know all keys beforehand • Clustered around a few indices • Or find all keys hashed to same index • Handling bad hash is a necessary • Even given Entryalways check key • Store multiple Entryswith same hash • (Shot of adrenaline restarts heart)

Bucket Arrays • Make hash table an array of linked list Nodes • First node aliased by the array location • Whenever we have collision, we “chain” Entrys • Create new Nodeto store the Entry • The linked list will have new Node at its front

Bucket Arrays • But what if have really bad hash? • Hashes to same index in every situation • All Entrys now found in single linked list • O(n) execution times would now be required

Bucket Arrays • But what if have really bad hash? • Hashes to same index in every situation • All Entrys now found in single linked list • O(n) execution times would now be required • (Also get bad case of the munchies)

Collisions • Normally, table holds one Entry per index • Need to be smarter when keys collide • Efficiency matters • If we do not care, use Sequence-based approach • Several common schemes used to provide speed • Each of these schemes has strengths & weaknesses • Silver bullets do not exist in CSC, must balance needs • If all-powerful answers desired, try Religious Studies

Collisions • Normally, table holds one Entry per index • Need to be smarter when keys collide • Efficiency mattersimportant • If we do not care, use Sequence-based approach • Several common schemes used to provide speed • Each of these schemes has strengths & weaknesses • Silver bullets do not exist in CSC, must balance needs • If all-powerful answers desired, try Religious Studies

Collisions • Normally, table holds one Entry per index • Need to be smarter when keys collide • Efficiency mattersimportantcritical • If we do not care, use Sequence-based approach • Several common schemes used to provide speed • Each of these schemes has strengths & weaknesses • Silver bullets do not exist in CSC, must balance needs • If all-powerful answers desired, try Religious Studies

Linear Probing • Musical chairs uses this algorithm • At index where keyhashed examine Entry • Circle through array until empty index found • Algorithm is very simple • But creates clusters of Entrys

Linear Probe Example h(x) = xmod13Now add: 44h(44) =5 20h(20) =7 22h(22) =9 31h(31) =5 22 31 15 18 44 20 32 76 0 1 2 3 4 5 6 7 8 11 12 9 10

Probing Reaction Oh, ****Adding to hash table still O(n)

Quadratic Probe • Avoids primary clustering problems • But does create secondary clustering (no one cares) • Quadratic probe still simple (like linear probe) • Examine Entry at index k, hashed value of key • Check(k+j2) % length:k+1,k+4,k+9, k+16, … • Continue probing until unused array slot found • Guaranteed to work when: • Need to get around -- table size is prime number • Under 50% full so many open slots exist

Quadratic Probe Example h(x) = xmod13Now add: 44h(44) =5 20h(20) =7 22h(22) =9 31h(31) =5 31 15 18 44 20 32 76 22 0 1 2 3 4 5 6 7 8 11 12 9 10

Quadratic Probing Reaction Darn it to heck.Adding to hash table still O(n)

Double Hashing • Solve bad hash with even more hash • Use 2nd hash function very different from first • 2nd hash function not allowed to return zero • Re-hash key using 2nd function after the collision • Check index equal to sum of two hash functions • Re-add 2nd hash to this sum to continue probing • Guaranteed to work when • Still must get around -- table size is prime number

Double Hash Example h(x) = xmod13h2(x) = 5-(xmod5)Now add: 44h(44) =5 20h(20) =7 22h(22) =9 31h(31) =5 31 15 18 44 20 32 76 22 0 1 2 3 4 5 6 7 8 11 12 9 10

Double Probing Reaction Sweet! Double hashing keeps putO(n)

Probing and Searching • Search index where key hashed • If cannot place Entryat index • The array must keep being probed • Stop only at usableindex • May need to probe every index! • Searching takes O(n)even with hash • May need to reallocate & rehash table • Worst case O(n)put even with perfect hash

Post-Removal Operations • What happens when we remove an Entry? • Set index to nullin most structures • Consider if we call remove(44) 15 18 44 20 32 76 22 31 0 1 2 3 4 5 6 7 8 11 12 9 10

Post-Removal Operations • What happens when we remove an Entry? • Set index to nullin most structures • Consider if we call remove(44) 15 18 20 32 76 22 31 0 1 2 3 4 5 6 7 8 11 12 9 10

Post-Removal Operations • What happens when we remove an Entry? • Set index to nullin most structures • Consider if we call remove(44) • get(31)called, what would happen? 15 18 20 32 76 22 31 0 1 2 3 4 5 6 7 8 11 12 9 10

Post-Removal Operations • What happens when we remove an Entry? • Set index to nullin most structures • Consider if we call remove(44) • get(31)called, what would happen? • First check index it is hashed to 15 18 20 32 76 22 31 0 1 2 3 4 5 6 7 8 11 12 9 10

Post-Removal Operations • What happens when we remove an Entry? • Set index to nullin most structures • Consider if we call remove(44) • get(31)called, what would happen? • First check index it is hashed to • Checks first probe indexed… 15 18 20 32 76 22 31 0 1 2 3 4 5 6 7 8 11 12 9 10

Post-Removal Operations • What happens when we remove an Entry? • Set index to nullin most structures • Consider if we call remove(44) • get(31)called, what would happen? • First check index it is hashed to • Checks first probe indexed… & stops at null 15 18 20 32 76 22 31 0 1 2 3 4 5 6 7 8 11 12 9 10

*Marker Value Explained • Mark cleared indices in hash table • Since collision could have happened, continue search • Index can be used to store new Entry • Ways to show that array index is clear • Entry with null key could be used if one is careful • Could try and make keywhich is never used • Use staticfinal field of type Entry

Why Use Hash Table & Probes? • Hash tables can require O(n) complexity • Provide O(1) time if you are really good • Ultimately depends on hash function used • Choose wisely and be rich

Before Next Lecture… • Get updated lab project into SVN directory • No need to e-mail, I will collect directories at 5PM • Finish working on week #4 assignment • Due at usual time tomorrow afternoon/evening • Start thinking of your design for the project • Due Tuesday a preliminary design & javadoc • Review sections for Map & Dictionary Quiz • ADTs, hash, probing and other ideas covered • Initially work on your own, groups get harder questions

Lecture 12: Collisions