1 / 47

1431227-3 File Organization and Processing

Learn about the benefits and drawbacks of using hash tables, a data structure that offers fast insertion and searching. Discover how hash functions and arrays are utilized to transform key values into array indices. Explore real-world examples and understand the importance of handling collisions and overflows in hash tables.

lbaer
Download Presentation

1431227-3 File Organization and Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1431227-3File Organization and Processing Hash Tables

  2. Introduction Hash tables is a data structure that offers very fast insertion and searching and sometimes deletion Computer programs typically use hash tables when they need to look up tens of thousands of items in less than a second Spelling checkers Hash tables are significantly faster O(1) than trees, Not only are they fast, hash tables are relatively easy to program.

  3. Hash Tables Disadvantages: They're based on arrays, and arrays are difficult to expand once they've been created. For some kinds of hash tables, performance may degrade catastrophically when the table becomes too full, so the programmer needs to have a fairly accurate idea of how many data items will need to be stored (or be prepared to periodically transfer data to a larger hash table, a time consuming process). There's no convenient way to visit the items in a hash table in any kind of order (such as from smallest to largest). If you need this capability, you'll need to look elsewhere.

  4. Hash Tables If you don't need to visit items in order, and you can predict in advance the size of your database, hash tables are unparalleled in speed and convenience. One important concept is how a range of key values is transformed into a range of array index values. In a hash table this is accomplished with a hash function. However, for certain kinds of keys, no hash function is necessary; the key values can be used directly as array indices.

  5. Examples- Employee ID as a key Suppose you're writing a program to access employee records for a small company with, say, 1,000 employees. Each employee record requires 1,000 bytes of storage. Thus you can store the entire database in only 1 megabyte, which will easily fit in your computer's memory. The company's personnel director has specified that she wants the fastest possible access to any individual record. Also, every employee has been given a number from 1 (for the founder) to 1,000 (for the most recently hired worker). These employee numbers can be used as keys to access the records; What sort of data structure should you use in this situation?

  6. Dictionary Problem A phone company wants to provide caller ID capability: Given a phone number, return the caller’s name. Phone numbers are in the range 0 <= r <=R where here R = 107 -1. They want to do this as efficiently as possible. A few suboptimal ways to design this dictionary: an array indexed by a key: takes O(1) time, O(n + R) – a huge amount of wasted space. a linked list: Takes O(n) time, O(n) space A balanced binary tree: O(log n) time, O(n) space.

  7. HASHING • Using balanced trees (2-3, 2-3-4, red-black, and AVL trees) we can implement table operations (retrieval, insertion and deletion) efficiently.  O(log2n) • Can we find a data structure so that we can perform these table operations better than balanced search trees?  O(1) YES  HASH TABLES • In hash tables, we have an array (index: 0..n-1) and an address calculator (hash function) which maps a search key into an array index between 0 and n-1.

  8. Hash Function – Address Calculator • e.g. • Integers • Names • Telephone Numbers • Locations Hash Function Hash Table

  9. A Better Solution We can do better with a hashtable – O(1) expected time, O(n + N) space, where N is the size of the hash table. We need a function to map the large range of keys into a smaller range of table indices. e.g., h(K) = K mod N Insert {402-3045, "CAJ (w)"} into a hashed array with, say, N = 5 slots. 4023045 mod 5 = 0, so {402-3045, "CAJ (w)"} goes in slot 0 of the hash table. A lookup uses the same process: hash the query key, then check the array at that slot. Insert {428-7971, "CAJ (h)"}

  10. Ideal Hashing Uses a 1D array (or table) table[0:b-1]. Each position of this array is a bucket. A bucket can normally hold only one dictionary pair. Uses a hash function f thatconverts each key k into an index in the range [0, b-1]. f(k) is the home bucket for key k. Every dictionary pair (key, element) is stored in its home bucket table[f[key]]. 12/20/2019 10

  11. Ideal Hashing Example Pairs are: (22,a),(33,c),(3,d),(73,e),(85,f). Hash table is table[0:7], b = 8. Hash function is key/11. Pairs are stored in table as below: [0] [1] [2] [3] [4] [5] [6] [7] (3,d) (22,a) (33,c) (73,e) (85,f) • get,put, andremove takeO(1)time. 12/20/2019 11

  12. What Can Go Wrong? Where does (26,g) go? Keys that have the same home bucket are synonyms. 22 and 26 are synonyms with respect to the hash function that is in use. The home bucket for (26,g) is already occupied. [0] [1] [2] [3] [4] [5] [6] [7] (3,d) (22,a) (33,c) (73,e) (85,f) 12/20/2019 12

  13. What Can Go Wrong? A collision occurs when the home bucket for a new pair is occupied by a pair with a different key. An overflow occurs when there is no space in the home bucket for the new pair. When a bucket can hold only one pair, collisions and overflows occur together. Need a method to handle overflows. (3,d) (22,a) (33,c) (73,e) (85,f) 12/20/2019 13

  14. Hashing • A hash function tells us where to place an item in an array of size n called a hash table. This method is known as hashing. • A hash function maps a search key into an integer between 0 and n-1. • We can have different hash functions. • Ex. h(x) = x mod n if x is an integer • The hash function is designed for the search keys depending on the data types of these search keys (int, string, ...)

  15. Hashing • Collisions occur when the hash function maps more than one item into the same array. • We have to resolve these collisions using certain mechanism. • A perfect hash function maps each search key into a unique location of the hash table. • A perfect hash function is possible if we know all the search keys in advance. • In practice (we do not know all the search keys), a hash function can map more than one key into the same location (collision).

  16. Hash Function • We can design different hash functions. • But a good hash function should: • be easy and fast to compute, • place items evenly throughout the hash table. • We will consider only hash functions that operate on integers. • If the key is not an integer, we map it into an integer first, and apply the hash function.

  17. Hash Function: Digit-Selection • If the search keys are big integers (Ex. nine-digit numbers), we can select certain digits and combine to create the address. • h(033475678) = 37 selecting 2nd and 5th digits (table size is 100) • h(023455678) = 25 • Digit-Selection is not a good hash function because it does not place items evenly throughout the hash table.

  18. Hash Functions – Folding • Folding: Selecting all digits and add them • h(033475678) = 0 + 3+ 3 + 4 +7 + 5 + 6 +7 + 8 = 43 0  h(nine-digit search key)  81 • Folding does not provide an uniformly distributed hash key. • We can select a group of digits and we can add these groups too, e.g. • 3-digit-groups  h(033475678)=033 + 475 + 678=1186.

  19. Hash Functions – Modula Arithmetic • Modula arithmetic provides a simple and effective hash function. h(x) = x mod tableSize

  20. Hash Functions Converting Character String into An Integer • If our search keys are strings, first we have to convert the string into an integer, and apply a hash function which is designed to operate on integers to this integer value to compute the address. • We can use ASCII codes of characters in the conversion. • Consider the string “NOTE”, assign 1 (00001) to ‘A’, .... • N is 14 (01110), O is 15 (01111), T is 20 (10100), E is 5 ((00101) • Concatenate four binary numbers to get a new binary number • 011100111111010000101  474,757 • apply x mod tableSize

  21. Collision Resolution • There are two general approaches to collision resolution in hash tables: • Open Addressing – Each entry holds one item • Chaining – Each entry can hold more than item • Buckets – hold certain number of items

  22. A Collision

  23. Open Addressing • During an attempt to insert a new item into a table, if the hash function indicates a location in the hash table that is already occupied, we probe for some other empty (or open) location in which to place the item.The sequence of locations that we examine is called the probe sequence.  If a scheme which uses this approach we say that it uses open addressing • There are different open-addressing schemes: • Linear Probing • Quadratic Probing • Double Hashing

  24. Open Addressing – Linear Probing • In linear probing, we search the hash table sequentially starting from the original hash location. • If a location is occupied, we check the next location • We wrap around from the last table location to the first table location if necessary.

  25. Linear Probing Linear probing is the simplest of the open addressing policies. If the current slot is already being used, just try the next slot. Algorithm linearProbingInsert(k, e)Input: Key k, element e    if(table is full)         error     probe¬ h(k) while table[probe] is occupied         probe ¬ (probe + 1) mod N    table[probe] ¬ (k,e) 12/20/2019 25

  26. Linear Probing - Example • Example: • Table Size is 11 (0..10) • Hash Function: h(x) = x mod 11 • Insert keys: • 20 mod 11 = 9 • 30 mod 11 = 8 • 2 mod 11 = 2 • 13 mod 11 = 2  2+1=3 • 25 mod 11 = 3  3+1=4 • 24 mod 11 = 2  2+1, 2+2, 2+3=5 • 10 mod 11 = 10 • 9 mod 11 = 9  9+1, 9+2 mod 11 =0

  27. Linear Probing- Example Table size N = 13. h(k) = k mod 13 Insert the keys 18  41  22  44  59  32  31  73 0    1    2   3    4    5   6   7   8  9  10  11  12 12/20/2019 27

  28. Linear Probing – Get And Put divisor = b (number of buckets) = 17. Home bucket = key % 17. 0 4 8 12 16 34 0 45 6 23 7 28 12 29 11 30 33 • Put in pairs whose keys are 6, 12, 34, 29, 28, 11, 23, 7, 0, 33, 30, 45

  29. Linear Probing – Remove remove(0) 34 0 45 6 23 7 28 12 29 11 30 33 0 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 0 4 8 12 16 0 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 • Search cluster for pair (if any) to fill vacated bucket.

  30. Linear Probing – remove(34) Search cluster for pair (if any) to fill vacated bucket. 34 0 0 45 4 6 23 8 7 28 12 12 29 11 30 16 33 0 45 6 23 7 28 12 29 11 30 33 0 4 8 12 16 0 45 6 23 7 28 12 29 11 30 33 0 4 8 12 16 0 4 8 12 16 0 45 6 23 7 28 12 29 11 30 33

  31. Linear Probing – remove(29) Search cluster for pair (if any) to fill vacated bucket. 34 0 0 45 4 6 23 8 7 28 12 12 29 11 30 16 33 34 0 45 6 23 7 28 12 11 30 33 0 4 8 12 16 34 0 45 6 23 7 28 12 11 30 33 0 4 8 12 16 0 4 8 12 16 34 0 45 6 23 7 28 12 11 30 33 0 4 8 12 16 34 0 6 23 7 28 12 11 30 45 33

  32. Linear Probing – Clustering Problem • One of the problems with linear probing is that table items tend to cluster together in the hash table. • This means that the table contains groups of consecutively occupied locations. • This effect is called (primary) clustering. • Clusters can get close to one another, and merge into a larger cluster. • Thus, the one part of the table might be quite dense, even though another part has relatively few items. • (Primary) Clustering causes long probe searches and therefore decreases the overall efficiency.

  33. Linear Probing – Other Step-Sizes • The next location to be probed is determined by the so-called step-size, where other step-sizes (than one) are possible. • The step-size should be relatively prime to the table size, i.e. their greatest common divisor should be equal to 1. • If we chose the table size to be a prime number, then any step-size is relatively prime to the table size. • Clustering cannot be avoided by larger step-sizes.

  34. Open Addressing – Quadratic Probing • Primary clustering problem can be almost eliminated if we use quadratic probing scheme. • In quadratic probing, • We start from the original hash location i • If a location is occupied, we check the locations i+12 , i+22 , i+32 , i+42 ... • We wrap around from the last table location to the first table location if necessary.

  35. Quadratic Probing - Example • Example: • Table Size is 11 (0..10) • Hash Function: h(x) = x mod 11 • Insert keys: • 20 mod 11 = 9 • 30 mod 11 = 8 • 2 mod 11 = 2 • 13 mod 11 = 2  2+12=3 • 25 mod 11 = 3  3+12=4 • 24 mod 11 = 2  2+12, 2+22=6 • 10 mod 11 = 10 • 9 mod 11 = 9  9+12, 9+22 mod 11, 9+32 mod 11 =7

  36. Quadratic Probing – Clustering Problem • Even though(primary) clustering is avoided by quadratic probing, so-called secondary clustering may occur. • Secondary clustering is caused by multiple search keys mapped to the same hash key. • Thus, the probing sequence for such search keys is prolonged by repeated conflicts along the probing sequence. • Both linear and quadratic probing use a probing sequence that is independent of the search key.

  37. Open Addressing – Double Hashing • Double hashing reduces clustering in a better way. • The increments for the probing sequence are computed by using a second hash function. The second hash function h2 should be: h2(key)  0 h2 h1 • We first probe the location h1(key) • If the location is occupied, we probe the location h1(key)+h2(key), h1(key)+2*h2(key), ...

  38. Double Hashing Uses two hash functions, h1(k) and h2(k). Typically h1(k) = k mod N. Typically, h2(k) = q – (k mod q), where q is a prime number and q < N. If N is prime, all slots in the table will eventually be examined. Many of the same (dis) advantages as linear probing. Space-efficient but slow compared with separate chaining. Distributes keys more uniformly than linear probing. Algorithm doubleHashInsert(k,e) Input: key k, element e    if (table is full)         error     probe¬ h1(k)     offset¬ h2(k) while (table[probe] is occupied)         probe ¬ (probe + offset) mod N    table[probe] ¬ (k,e) 12/20/2019 38

  39. Double Hashing - Example • Example: • Table Size is 11 (0..10) • Hash Function: h1(x) = x mod 11 h2(x) = 7 – (x mod 7) • Insert keys: • 58 mod 11 = 3 • 14 mod 11 = 3  3+7=10 • 91 mod 11 = 3  3+7, 3+2*7 mod 11=6 • 25 mod 11 = 3  3+3, 3+2*3=9

  40. Double Hashing- Example Table size N = 13 h1(k) = k mod 13 h2(k) = 7 – k mod 7 Keys to be inserted: 18  41  22  44  59  32  31  73 0    1    2   3    4    5   6   7   8  9  10  11  12 12/20/2019 40

  41. Theoretical Results The load factor a is the average number of keys per array index: a = n/N. The analysis is probabilistic rather than worst-case. Expected number of probes in a search not found found Chaining Linear probing Double hashing 12/20/2019 41

  42. Expected Number of Probes in a Search vs. Load Factor 12/20/2019 42

  43. Hash Table Design Performance requirements are given, determine maximum permissible loading density. We want a successful search to make no more than 10 compares (expected). Sn ~ ½(1 + 1/(1 – alpha)) alpha <= 18/19 We want an unsuccessful search to make no more than 13 compares (expected). Un ~ ½(1 + 1/(1 – alpha)2) alpha <= 4/5 So alpha <= min{18/19, 4/5} = 4/5.

  44. Hash Table Design Dynamic resizing of table. Whenever loading density exceeds threshold (4/5 in our example), rehash into a table of approximately twice the current size. Fixed table size. Know maximum number of pairs. No more than 1000 pairs. Loading density <= 4/5 => b >= 5/4*1000 = 1250. Pick b (equal to divisor) to be a prime number or an odd number with no prime divisors smaller than 20.

  45. Linear List of Synonyms Each bucket keeps a linear list of all pairs for which it is the home bucket. The linear list may or may not be sorted by key. The linear list may be an array linear list or a chain.

  46. [0] 0 34 [4] 6 23 7 [8] 11 28 45 [12] 12 29 30 33 [16] Sorted Chains • Put in pairs whose keys are 6, 12, 34, 29, 28, 11, 23, 7, 0, 33, 30, 45 • Home bucket = key % 17.

  47. Expected Performance Note that alpha >= 0. Expected chain length is alpha. Sn ~ 1 + alpha/2. Un <= alpha, when alpha < 1. Un ~ 1 + alpha/2, when alpha >= 1.

More Related