620 likes | 745 Views
Hashing. Amihood Amir Bar Ilan University 2014. Direct Addressing. In old days: LD 1,1 LD 2,2 AD 1,2 ST 1,3 Today: C <- A+B. Direct Addressing. Compiler keeps track of: C <- A+B A in location 1 B in location 2 C in location 3 How?
E N D
Hashing Amihood Amir Bar Ilan University 2014
Direct Addressing In old days: LD 1,1 LD 2,2 AD 1,2 ST 1,3 Today: C <- A+B
Direct Addressing Compiler keeps track of: C <- A+B A in location 1 B in location 2 C in location 3 How? Tables? Linked lists? Skip lists? AVL trees? B-trees?
Encode? Consider ASCII code of alphabet: (American Standard Code for Information Interchange)
Encode In ASCII code: A to Z are 1000001 to 1011010
Encode In ASCII code: A to Z are 1000001 to 1011010 In decimal: 65 to 90. So if we subtract 64 from the ASCII We will get locations 1 to 26.
Map In general: Consider a function h: U {0,…,m}. Where U is the universe of keys, m is number of keys. When we want to access the record of key kєU just go to location h(k) and it will point to the location of the record.
Problems: • What happens if keys are not consecutive? E.g. parameters: pointer, variable, idnumber • What happens if not all keys are used? E.g. ID numbers of students: 211101793, 31264992, 001684421 3. What happens if h(k) is not bijective?
Hash Tables: If U is larger than m, h(k) can not be bijective, thus we may encounter collisions: k1≠k2 where h(k1)= h(k2) What do we do then? Solution 1:chaining.
Chaining Chaining puts elements that hash to the same slot in a linked list: T —— U(universe of keys) k1 k4 —— —— k1 —— k4 K(actualkeys) k5 —— k7 k5 k2 k7 —— —— k3 k2 k3 —— k8 k6 k8 k6 —— ——
Hash operations: CHAINED-HASH-INSERT(T,x) Insert x at end of list T[h(key[x])] CHAINED-HASH-SEARCH(T,k) Search for an element with key k in list T[h(k)] CHAINED-HASH-DELETE(T,x) Delete x from list T[h(key[x])]
Time: Depends on chains. We assume:simple uniform hashing. h hashes into any slot with equal uniform likelihood.
Load Factor: We will analyze average search time. This depends on how large is the hash table relative to the number of keys. Let n= number of keys, m= hash table size n/m is the load factor. Write n/m =α.
Average Complexity of Search Notation: length of list T[j] is nj. Σ nj = n. Average value of nj=E(nj)=n/m =α Assumptions: h(k) computed in constant time. h(k) simple uniform hash function.
Searching for new key Compute h(k) in time O(1). Get to list T[h(k)] (uniformly distributed) whose length is nj. Need to search list T[h(k)]. Time for search: nj. E(nj)=α. Conclude:Average time to search for new key is Θ(1+α). What is worst case time?
Searching for Existing Key Do we gain anything by searching for a key that is already in list? Note: Expected number of elements in successful search for k = Expected length of listT[h(k)] when k was inserted + 1. (since we insert elements at end)
Searching for Existing Key Let k1, … , kn be the keys in order of their insertion. When ki is inserted, the expected list length is (i-1)/m. So the expected length of a successful search is the average of all such lists:
Calculate: = = =
Conclude: Average time for searching in a hash table with chaining is Θ(1+α) Note: Generally m is chosen as O(n) Thus the average operation on a hash table with chaining takes constant time.
Choosing a Hash Function Assume keys are natural numbers: {0, 1, 2, … } A natural mapping of natural numbers to {0, 1, … , m} is by dividing by m+1 and taking the remainder, i.e. h(k) = k mod (m+1)
The Division Method What are good m’s? In ASCII example: A to Z are 1000001 to 1011010 In decimal: 65 to 90. We subtracted 64 from the ASCII and got locations 1 to 26. If we choose m=32 we achieve the same result.
What does it Mean? 32 = 25 Dividing by 32 and taking the remainder means taking the 5 LSBs in the binaryrepresentation. A is1000001= 1 Z is1011010 = 26 Is this good? – Here, yes. In general?
Want Even Distribution We want even distribution of the bits. If m=2bthen only the bleast significant bits participate in the hashing. We would like all bits to participate. Solution: Use for m a prime number not too close to a power of 2. Always good idea: Check distribution of real application data.
The Multiplication Method Choose: 0 < σ < 1 h(k) = Meaning: 1. multiply key by σ and take fraction. 2. then multiply by m and take floor.
Multiplication Method Example Choose:σ= 0.618 m=32 (Knuth recommends (√5 -1)/2 ) k=2391 h(k) = 2391 x 0.618 = 1477.638 1477.638-1477 = 0.638 0.638 x 32 = 20.416 So h(2391)=20.
Why does this involve all bits? Assume: kuses w bits, which is a computer word. Word operations take constant time. Assume:m = 2p. Easy Implementation ofh(k):
Implementing h(k) Easy Implementation ofh(k): w bits k X σ2w p bits Getting rid of this Means dividing by 2w. h(k)
Worst Case Problem: Malicious adversary can make sure all keys hash to same entry. Usual solution: randomize. But then how do we get to correct hash table entry upon search?
Universal Hashing We want to make sure that: Not every time we hash keys k1, k2, … , kn They hash to same table entry. How can we make sure of that? Use a number of different hash functions and employ them at random.
Universal Hashing A collection H of hash functions from universe U to {0,…,m} is universal if for every pair of keys x,yєU,where x≠y the number of hash functions hєH for which h(x)=h(y) is |H|/m. This means: If we choose a function hєH at random the chance of collision between x and y where x≠y is1/m.
Constructing a Universal Collection of Hash Functions Choose m to be prime. Decompose a key k as follows: k = [k0, k1, … , kr ] log m bits value < m r+1 pieces
Universal Hashing Construction Choose randomly mr+1 sequences of length r+1 of elements from {0,…,m-1}. Each sequence is a=[a0, … , ar], aiє{0,…,m-1}. Each sequence defines a hash function The universal class is: H = U{ha} a
This is a Universal Class Need to show that the probability that x≠y collide is 1/m. Because x≠y then there is some i for which xi≠yi . Wlog assume i=0. Since m is prime then for any fixed [a1,…,ar] there is exactly one value that a0 can get that satisfies h(x)=h(y). Why?
Proof Continuation… h(x)-h(y)=0 means: But x0-y0 is non-zero. For primem there is a unique multiplicative inverse modulo m. Thus there is only one value between 0 and m-1 that a0 can get.
End of Proof Conclude: Each pair x≠y may collide once for each [a1,…,ar]. But there are mr possible values for [a1,…,ar]. However: Since there are mr+1 different hash functions, the keys x and y collide with probability:
Open Addressing Solution 2to the collision problem. We don’t want to use chaining. (save space and complexity of pointers.) Where should we put the collision? Inside hash table, in an empty slot. Problem: Which empty slot? -- We assume a probing hashing function.
Idea behind Probing Hashing function of the form: h : U x {0,…,m} {0,…,m} The initial hashing function: h(k,0). h(k,i) for i=1,…,m, gives subsequent values. We try locations h(k,0), h(k,1), etc. until an open slot is found in the hash table. If none found we report an overflow error.
Average Complexity of Probing Assume uniform hashing – i.e. A probe sequence for key k is equally likely to be any permutation of {0,…,m}. What is the expected number of probes with load factor α? Note: since this is open address hashing, α = n/m <1
Probing Complexity – Case I Case I : key not in hash table. Key not in table every probe, except for last, is to an occupied slot. Let pi=Pr{exactly i probes access occupied slot} Then The expected number of probes is
Probing Complexity – Case I Note: For i>n, pi=0. Claim: , Where qi=Pr{at least i probes access occupied slots} Why?
Probing Complexity – Case I q2 q3 q4
Probing Complexity – Case I What is: qi? q1 = n/m there are n elements in table, the probability of having something in the first slot accessed is therefore n/m q2 = n/m ((n-1)/(m-1))
Probing Complexity – Case I Conclude: The expected complexity of insering an element into the hash table is: Example: If the hash table is half full, the expected number of probes is 2. If it is 90% full, the expected number is 10.
Probing Complexity – Case II Case II : key in hash table. As in the chaining case. Expected search time for key in table = Expected insertion time. We have shown: Expected time to insert key i+1 is:
Probing Complexity – Case II Conclude: Expected key insertion time: Need to compute
Probing Complexity – Case II In General: What is?
Probing Complexity – Case II Consider any monotonic function f(x)>0. Approximate By the area under the function.
Probing Complexity – Case II Consider any monotonic function f(x)>0. f(x) a-1 a b b+1
Probing Complexity – Case II Conclude: Approximate By