1 / 93

ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

Learn about hashing, a technique used for efficient data storage and retrieval. This article discusses the basic functionality of hashing, handling collisions, and choosing a hash function. It also explores factors affecting hashing performance and provides examples of hash functions.

Download Presentation

ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ ΤΕΧΝΙΚΕΣ ΚΑΤΑΚΕΡΜΑΤΙΣΜΟΥ ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  2. HASHING Αποτελεσματικός τρόπος για: ...αποθήκευση δεδομένων ...ανάκτηση δεδομένων Στόχος:αναζήτηση σε O(1) ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  3. Βασική λειτουργία • Αποθηκεύουμε το στοιχείο με κλειδί k, στη θέση h(k)του πίνακα κατακερματισμού. • Συνάρτηση κατακερματισμούh : • Όταν kδεν είναι ακέραιος, το h(k)είναι. • Έχει τιμές από 0 ως Ν -1. • Συγκρούσεις, ότανk1≠ k2καιh(k1) = h(k2) • Διαφορετικά κλειδιά αντιστοιχίζονται στην ίδια θέση του πίνακα κατακερματισμού. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  4. Συνάρτηση κατακερματισμού • Επιτελεί δύο λειτουργίες: • μετατρέπει την τιμή του κλειδιού σε ακέραιο, • περιορίζει την τιμή του ακεραίου που υπολόγισε στο προηγούμενο βήμα, εντός της περιοχής [0..N-1] • Η πιο προφανής μέθοδος είναι η μέθοδος της διαίρεσης ( mod N) • Μια εναλλακτική μέθοδος είναι η : h(k)= (a*k+b) mod N (MAD – Multiply Add and Divide), όπου το Nείναι ένας πρώτος αριθμός και a mod N <>0. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  5. Πίνακας Κατακερματισμού 0  Έστω Nτο μέγεθος του πίνακα, εδώ Ν = 10000 Χρησιμοποιείται η συνάρτηση κατακερματισμού: «πάρε τα 4 τελευταία ψηφία», δηλ. h(k) = k mod Ν 1 025-612-0001 2  981-101-0003 3 4 451-229-0004 … 9997  9998 200-751-9998 9999  ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  6. Αντιμετώπιση Συγκρούσεων • Όσο δεν έχουμε συγκρούσεις, ο κατακερματισμός αποδίδει σε χρόνο Ο(1) • Ορισμός: Παράγων Φόρτου (load factor) α (ή λ) = n / N • n= πλήθος στοιχείων που έχουν εισαχθεί • N = μέγεθος πίνακα • Στην περίπτωση συγκρούσεων τα στοιχεία μπορεί να φυλάσσονται: • σε άλλη δομή δεδομένων έξω από τον πίνακα • σε εναλλακτικές θέσεις του πίνακα κατακερματισμού ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  7. Questions to Ask When Analyzing Resolution Schemes • Are we guaranteed to find an empty cell if there is one? • Are we guaranteed we won’t be checking the same cell twice during one insertion? • What should the load factor be to obtain O(1) average-case insert, search, and delete? ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  8. Three factors affecting the performance of hashing • The hash function • Ideally, it should distribute keys and entries evenly throughout the table • It should minimise collisions, where the position given by the hash function is already occupied • The size of the table • Too big will waste memory; too small will increase collisions and may eventually force rehashing (copying into a larger table) • Should be appropriate for the hash function used – and a prime number is best • The collision resolution strategy • Separate chaining: chain together several keys/entries in each position • Open addressing: store the key/entry in a different position ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  9. 1. Choosing a hash function:turning a key into a table position • Truncation • Ignore part of the key and use the rest as the array index (converting non-numeric parts) • A fast technique, but check for an even distribution throughout the table • Folding • Partition the key into several parts and then combine them in any convenient way • Unlike truncation, uses information from the whole key • Modular arithmetic(used by truncation & folding, and on its own) • To keep the calculated table position within the table, divide the position by the size of the table, and take the remainder as the new position ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  10. Examples of hash functions • Truncation: If students have an 9-digit identification number, take the last 3 digits as the table position • e.g. 925371622 becomes 622 • Folding: Split a 9-digit number into three 3-digit numbers, and add them • e.g. 925371622 becomes 925 + 371 + 622 = 1923 • Modular arithmetic: If the table size is 1000, the first example always keeps within the table range, but the second example does not (it should be mod 1000) • e.g. 1923 mod 1000 = 923 (in C++: 1923 % 1000) ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  11. Using a telephone number as a key • The area code is not random, so will not spread the keys/entries evenly through the table (many collisions) • The last 3-digits are more random • Using a name as a key • Use full name rather than surname (surname not particularly random) • Assign numbers to the characters (e.g. a = 1, b = 2; or use Unicode values) • Strategy 1: Add the resulting numbers. Bad for large table size. • Strategy 2: Call the number of possible characters c (e.g. c = 54 for alphabet in upper and lower case, plus space and hyphen). Then multiply each character in the name by increasing powers of c, and add together. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  12. Modular arithmetic : Division • The key is subject to modular (remainder) division by an integer, which is usually prime. • This integer should be almost equal the desired size of the array. The result of the division – the remainder – determines which array location is used. • most common-used in combination with others ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  13. Modular arithmetic : Midsquare • The key is squared and the digits in the middle are retained for the address. This works better with smaller hash values (sizes less than 10000) EXAMPLE : number 9876 (9876)2 = 975 35376 ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  14. Modular arithmetic : Folding / Boundary Folding • Social Security Number : 387-58-1505 • hash as sum of three integers: • 387 + 58 + 1505 = 1950 • hash as sum of three integers: 387 + 85 (this number is reversed) + 1505 = 1977 ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  15. Bar Coding • A bar code consists of 10 digits: 1 234 567 890 • Store bar codes in a hash table • Suppose total number of bar codes is less than 10,000…. • Where and how do I store the codes? • What is the size of hash table? ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  16. Modular arithmetic : Folding • The key is divided into several parts, each of which are combined and processed to give an address. For example, if the bar code is70662 11001 • HashTable has 15000 entries • Group into pairs: 70 66 21 10 01 • Multiply the first three pairs together 70 x 66 x 21 = 97020 • Add this number to the last two pairs: 97020 + 10 + 01 = 97031 • Find the remainder of mod division by 14987 (15000 – 3) 97031 % 14987 = 7109 ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  17. bar code : 66702 10110 • Group into pairs: 66 70 21 01 10 • 66 x 70 x 21 = 97020 • 97020 + 1 + 10 = 97031 • 97031 % 14987 = 7109 OOPS….same value as last bar code! ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  18. Real problems • Suppose we are storing numeric id’s of customers, maybe 100,000 • Now, we want to check if a person is delinquent, usually less than 400 such people. • Use an array of size 1000, for the delinquents. • Put id in at id mod tableSize. • id = 987567 table index = 567 • Clearly fast for searching • But what happens if entries collide? ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  19. Suppose we are storing students by social security number • How many students? • How big should the table be? • How do I store the students? ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  20. 2. Choosing the table size to minimize collisions • As the number of elements in the table increases, the likelihood of a collision increases - so make the table as large as practical • If the table size is 100, and all the hashed keys are dividable by 10, there will be many collisions! • Particularly bad if table size is a power of a small integer such as 2 or 10 • More generally, collisions may be more frequent if: • Greatest Common Divisor (hashed keys, table size) > 1 • Therefore, make the table size a prime number (GCD = 1) • An excess of approximately 30% is typical. • This means that if sis the number of slots in the table and e is the number of elements, then s = a prime number >= 4/3 * e ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  21. 3. The collision resolution strategy • Collisions may still happen, so we need a collision resolution strategy. • Two principal strategies (techniques) : • Separate chaining • Open addressing ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  22. Open Addressing Strategy • To insert a key K, compute h0(K). If the location of the hash array, let T[h0(K)], is empty, insert it there. If collision occurs, probe alternative cell h1(K), h2(K), .... until an empty cell is found. hi(K) = (hash(K) + f(i)) mod m, withf(0) = 0 • f: collision resolution strategy ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  23. Linear probing: increase by 1 each time [mod table size!] • Quadratic probing: to the original position, add 1, 4, 9, 16,… Probing: If the table position given by the hashed key is already occupied, increase the position by some amount, until an empty position is found Use the collision resolution strategy when inserting and when finding (ensure that the search key and the found keys match) Double hash : result of linear probing  result of another hash function With open addressing, the table size should be double the expected number of elements ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  24. Linear Probing • f(i) = i • cells are probed sequentially (with wraparound) • hi(K) = (hash(K) + i) mod m • Quadratic Probing • f(i) = i2 • hi(K) = ( hash(K) + i2 ) mod m • Double Hashing • f(i) = i * hash2(K) • e.g. hash2(K) = R - (K mod R), with R is a prime smaller than m ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  25. 1 2 3 4 5 6 7 8 For a table of size n, then if the table is empty, the probability of the next entry going to any particular place is 1/n In the diagram, the probability of position 2 getting filled next is 2/n (either a hash to 1 or to 2 fills it) Once 2 is full, the probability of 4 being filled next is 4/n and then of 7 is 7/n (i.e. the probability of getting long strings steadily increases) Linear Probing suffers from primary clustering Linear Probing • If the table is fairly empty with many collisions, linear probingmay cluster (group) keys/entries • This increases the time to insert and to find ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  26. Primary Clustering • We call a block of contiguously occupied table entries a cluster • On the average, when we insert a new key K, we may hit the middle of a cluster. Therefore, the time to insert K would be proportional to half the size of a cluster. That is, the larger the cluster, the slower the performance. • Linear probing has the following disadvantages: • Once h(K) falls into a cluster, this cluster will definitely grow in size by one. Thus, this may worsen the performance of insertion in the future. • If two cluster are only separated by one entry, then inserting one key into a cluster can merge the two clusters together. Thus, the cluster size can increase drastically by a single insertion. This means that the performance of insertion can deteriorate drastically after a single insertion. • Large clusters are easy targets for collisions. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  27. Primary Clustering Consider inserting the following entries 81, 70, 97, 63, 76, 38, 85, 68, 21, 9, 55, 73, 57, 60, 72, 74, 85, 16, 61, 7, 49 Use the number modulo 25 to determine which bin it should occupy • The first five don’t cause any collisions ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  28. Primary Clustering Inserting 38 causes a collision in bin 13 The next seven do not cause any further collisions ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  29. Primary Clustering The next four insertions cause collisions: 60 (bin 10) 72 (bin 22) 74 (bin 24) 85 (bin 10) We can safely insert 16 into bin 16 ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  30. Primary Clustering The remaining insertions all cause collisions: 61 (bin 11) 7 (bin 7) 49 (bin 24) ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  31. Primary Clustering The length of these chains will affect the number of probes required to perform insertions, accesses, or removals It is possible to estimate the average number of probes for a successful search, where λ is the load factor: For example: if λ = 0.5, we require 1.5 probes on average ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  32. Primary Clustering The number of probes for an unsuccessful search or for an insertion is higher: For0 ≤ l ≤ 1, then(1 – l)2 ≤ 1 – l, and therefore the reciprocal will be larger • Again, if l= 0.5then we require 2.5 probes on average ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  33. Primary Clustering The following plot shows how the number of required probes increases ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  34. Primary Clustering • Our goal was to keep all operations O(1) • Unfortunate, as lgrows, so does the run time • One solution is to keep the load factor under a given bound • If we choose l = 2/3, then the number of probes for either a successful or unsuccessful search is 2 and 5, respectively. • Therefore, we have three choices: • ChooseMlarge enough so that we will not pass this load factor • Double the number of bins if the chosen load factor is reached • Choose a different strategy from linear probing ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  35. Primary Clustering • The first solution (choose M sufficiently large) is most useful if we know all the possible entries • The second (doubling) is only useful if we have an environment where we can dynamically allocate memory • For the third, we will look at quadratic probing and double hashing. Quadratic Probing is the most common technique to avoid clustering. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  36. Quadratic probing • Quadratic probing is a solution to the primary clustering problem • Linear probing adds 1, 2, 3, etc. to the original hashed key • Quadratic probing adds 12, 22, 32 etc. to the original hashed key • However, whereas linear probing guarantees that all empty positions will be examined if necessary, quadratic probing does not. • Two keys with different home positions will have different probe sequences • e.g. m=101, h(k1)=30, h(k2)=29 • probe sequence for k1: 30,30+1, 30+4, 30+9 • probe sequence for k2: 29, 29+1, 29+4, 29+9 ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  37. Example (1) - Quadratic Probing Use quadratic probing to insert the following numbers into an initially empty hash table with 11 bins where the hash value of a number is the least-significant digit. 81, 70, 34, 49, 50, 64 Startingwithaninitiallyemptytable: Weinsert 81in1, ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  38. 70in0, 34inbin4, and 49inbin9. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  39. Inserting 50, we note that bin 0 is occupied and therefore we check: • 0 + 1 ≡ 1 which is occupied, • 0 + 4 ≡ 4 which is occupied, • 0 + 9 ≡ 9 which is occupied, and • 0 + 16 ≡ 5 which is unoccupied. • Thus, 50 goes into bin 5. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  40. Inserting 64, we note that bin 4 is occupied, and therefore we check: • bin 4 + 1 ≡ 5 which is occupied, and • bin 4 + 4 ≡ 8 which is unoccupied. • Thus, 64 goes into bin 8. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  41. Example (2) – Quadratic Probing From the hash table in Example 1, search for the elements 60, 61, 62, 63, 64, 65, 66, 67, 68, and 69. 60. Searching for 60, we check 0, 0 + 1 ≡ 1, 0 + 4 ≡ 4, 0 + 9 ≡ 9, 0 + 16 ≡ 5, 0 + 25 ≡ 3, and 3 is empty, and therefore 60 is not in the hash table. 61. Searching for 61, we check 1 and 1 + 1 ≡ 2 and 2 is empty. Therefore 61 is not in the hash table. 62. 2 is empty, therefore 62 is not in the hash table. 63.3 is empty, therefore 63 is not in the hash table. 64. Searching for 64, we check 4, 4 + 1 ≡ 5, and 4 + 4 ≡ 8, and 64 is located in 8, and therefore 64 is in the hash table. 65. Searching for 65, we check 5 and 5 + 1 ≡ 6 and 6 is empty. Therefore 65 is not in the hash table. 66. 6 is empty, therefore 66 is not in the hash table. 67.7 is empty, therefore 67 is not in the hash table. 68. Searching for 68, we check 8, 8 + 1 ≡ 9, 8 + 4 ≡ 1, and 8 + 9 ≡ 6 and 6 is empty. Therefore 68 is not in the hash table. 69. Searching for 69, we check 9 and 9 + 1 ≡ 10 and 10 is empty. Therefore 69 is not located in the hash table. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  42. Example (3) - Quadratic Probing • For example, suppose an element was to be inserted in bin 23 in a hash table with 31bins • The sequence in which the bins would be checked is: 23, 24, 27, 1, 8, 17, 28, 10, 25, 11, 30, 20, 12, 6, 2, 0 • Even if two bins are initially close, the sequence in which subsequent bins are checked varies greatly • Again, with M = 31 bins, compare the first 16 bins which are checked starting with 22 and 23: 22 22,23,26,0,7,16,27,9,24,10,29,19,11,5,1,30 23 23,24,27,1,8,17,28,10,25,11,30,20,12,6,2,0 ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  43. Quadratic Probing: Properties • Thus, quadratic probing solves the problem of primary clustering • Unfortunately, there is a second problem which must be dealt with • Suppose we have M = 8bins: 12 ≡ 1, 22 ≡ 4, 32 ≡ 1 • In this case, we are checking bin h + 1twice having checked only one other bin ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  44. Disadvantage of this method: • After a number of probes the sequence of steps repeats itself (remember that the step will be probe number2mod the size of the hash table). This repetition occurs when the probe number is roughly half the size of the hash table. • e.g. Table size 16 and original hashed key 3 gives the sequence: 3, 4, 7, 12, 3, 12, 7, 4… • More generally, with quadratic probing, insertion may be impossible if the table is more than half-full! • Need to rehash ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  45. For any < 0.5, quadratic probing willfind an empty slot; for bigger , quadratic probing mayfind a slot • If the table size is prime, then a new key can always be inserted if the table is at least half empty • Keys that hash to the same home position will probe the same alternative cells • Simulation results suggest that it generally causes less than an extra half probe per search • Quadratic probing does not suffer from primary clustering: keys hashing to the same area are not bad Secondary clustering. Not obvious from looking at table. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  46. Secondary Clustering • The phenomenon of primary clustering will not occur with quadratic probing • Quadratic Probing suffers from a milder form of clustering called secondary clustering (if multiple items all hash to the same initial bin, the same sequence of numbers will be followed ). The effect is less significant than that of primary clustering. • As with linear probing, if two keys have the same initial probe position, then their probe sequences are the same, since h(k1,0) = h(k2,0) implies h(k1,1) = h(k2,1). So only m distinct probes are used. • Clustering can occur around the probe sequences. ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  47. Secondary Clustering • Secondary clustering may be a problem if the hash function does not produce an even distribution of entries • To avoid secondary clustering, the probe sequence need to be a function of the original key value, not the home position • One solution to secondary is double hashing: associating with each element an initial bin (defined by one hash function) and a skip (defined by a second hash function) ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  48. Review Linear probing: • Look at bins k, k + 1, k + 2, k + 3, k + 4, … • Primary clustering Quadratic probing: • Look at bins k, k + 1, k + 4 , k + 9, k + 16, … • Secondary clustering (dangerous for poor hash functions) • Expensive: • Prime-sized arrays • Euclidean algorithm for calculating remainders ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  49. Double Hashing An alternate solution • Give each object (with high probability) a different jump size • Associate with each object an initial bin and a jump size for ( int k = 0; k < M; ++k ) { bin = (initial + k*jump) % M; • The jump size and the number of bins Mmust be relatively prime ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

  50. Problem: • Willinitial + k*jump step through all of the bins? • The output of: M = 16; initial = 5 jump = 12; for ( int k = 0; k < M; ++k ) { cout << (initial + k*jump) % M << ' '; } is 5 1 13 9 5 1 13 9 5 1 13 9 5 1 13 9 ΔΠΘ-ΜΠΔ:ΔΟΜΕΣ ΔΕΔΟΜΕΝΩΝ & ΑΡΧΕΙΩΝ

More Related