1 / 35

Hashing Indirect Address Translation

Hashing Indirect Address Translation. Chapter 11. Indirect Address Translation. Direct translation Primary Key (PK) and the relative record position (RRP) are the same, we say there is a direct translation. Simple direct access file systems use this technique. Indirect Address Translation.

akamu
Download Presentation

Hashing Indirect Address Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HashingIndirect Address Translation Chapter 11

  2. Indirect Address Translation • Direct translation • Primary Key (PK) and the relative record position (RRP) are the same, we say there is a direct translation. • Simple direct access file systems use this technique.

  3. Indirect Address Translation • Direct translation - problems • The PKs may not be numeric. • Names • Alpha numeric IDs

  4. 2 0 0 r e c o r d s  u s e d 2 % = = . 0 0 0 0 0 2 9 7 1 0 r e c o r d s a l l o c a t e d 1 0 Indirect Address Translation • Direct translation - problems • Only a small percent of the possible range of PK's may actual have records assigned to them: • Consider a keyfield for an employee file is a 9 digit ID number. (E.g. Social Security Number) • The company has 200 employees. • Since the ID's may have any of the 109 values, The file will have to be huge (109 records!). Thus the file will have a packing density of:

  5. Indirect Address Translation • Hashing • A common technique of indirect translation is hashing. • A solution in which the broad range of PK values are transformed into the smaller range of RRP values. • Hashing uses a hashing function to map translate thne key values into the smaller range of the RRP values.

  6. Indirect Address Translation • Hashing Algorithms • Development of a hashing function requires careful attention • The algorithm should distribute the keys as evenly as possible across the range of address. • Some different key MUST necessarily map to the same addresses

  7. Key Transformation Algorithms • 3 general steps to convert a key to a RRP address: 1) If key is not numeric, convert it into a numeric form, without losing information. 2) Operate on the numeric key using an algorithm which converts the keys into a spread of numbers of the order of magnitude of the address numbers required. 3) The resulting numbers are multiplied by a constant which compresses the address into the precise range of addresses.

  8. Key Transformation Algorithms • Example: • Key is a 9 Digit Number. • Destination file has 7000 records • Step 1 - Not needed (already a number) • Step 2 - Divide Key by 10000 to get remainder between 0 - 9999 • Step 3 - we multiply the value from 2 by .7 to put number within the range 0000 to 6999.

  9. Key Transformation Algorithms • Example: • What would happen if we simply skip step 2 , and simply compress the number from step 1? • What about clustered insertions? (Keys with contiguous values.)

  10. Key Transformation Algorithms - Division • The key is divided by a number approximately equal to the number of available addresses, and the remainder is taken as the RRP. • A prime number or number with no small factors is used.

  11. 142536 • 4997 Key Transformation Algorithms - Division • Example: • records have 6-digit key, 5000 RRPs desired. • divide by 4997 and use remainder • consider key: 142536 • = 28 remainder 2620. • Use 2620 as RRP. • How do you suppose this method would work with clustered insertions?

  12. Key Transformation Algorithms - Extraction • Select digits from different parts of key. • Example: • Records with 10-digit key, 5000 RRPs desired. • Choose 3rd, 5th, 8th and 9th digits: • Consider key = 3865324567 • Compress into RRP range: INT(8625 * .5) = 4312. Use 4312 as RRP.

  13. Key Transformation Algorithms - Folding • Digits in the key are folded inward like folding paper. Then the digits are added. • Folding tends to be more appropriate for large keys.

  14. Key Transformation Algorithms - Folding • Example • Let key be 142537. • Fold left at 4th digit, right at 3rd digit: • Results in 4137 and 735 • Add the two resulting values: 4137 + 735 = 4872 • Compress into RRP range: • 4872 x .5 = 2436. Use 2436 as RRP.

  15. Key Transformation Algorithms - Mid-square method • Square the key, and use the central digits of the result. • Example: • Let records have 6-digit key, and 5000 RRP's desired. • Key value of 142536. • 1425362 --> 020316511296 • 1651 - central digits • Compress into RRP range: • 1651 x.5 = 825. Use 825 as RRP.

  16. Key Transformation Algorithms - Selection • The best way to choose a transform is to take the key set for the file and simulate using different transforms. • Choose the one which distributes the records most evenly. • The division method seems to be the best general transform.

  17. Important hashing considerations • When designing a practical hashing scheme, several important issues must be addressed: • record distribution • A hashing function needs to be picked which will evenly distribute the records throughout the RRP range. • Different key sets will have different distribution patterns. • Thus the hashing function chosen will depend on the patterns of keys in the data set.

  18. Important hashing considerations • synonyms • two or more PKs which transform to the same RRP address. • The the goal is to devise a hashing function for a given key set of keys which will minimize synonyms. • It is, however, statistically beyond reason to totally avoid synonyms. • Not only would all keys need to be known in advance, but only one algorithm in 1012000 will work!

  19. Important hashing considerations • collisions • When a new record hashes to a record already in use by another record. • The new record and the existing record are called synonyms. • The result is called an overflow. • A scheme must be devised to handle overflows efficiently.

  20. Important hashing considerations • packing density • ratio of records stored in a file to addresses available in the file. • Typically the best packing density is 80-90%. • The larger the file, the less the probability of an overflow. • There is thus a trade-off between space and efficiency. efficiency space

  21. Techniques for handling collisions • Strategies for collision resolution: 1. Create the file so that each address (physical record) can hold several logical records (usually synonyms). Called Composite Records or buckets. 2. Develop algorithms for relocating records which collide.

  22. Composite Records or buckets • Reduce number of RRP’s, but increase the size of each to hold several records. • Each RRP (called a bucket) now holds several logical records. 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4

  23. Composite Records or buckets • buckets are arrays of logical records. • bucket size - number of records/bucket • Now room for several synonyms in each bucket. • Probability of overflow is reduced. • Overflow now only occurs when bucket is full. • Overall file size need not increase, if bucket size 5, then reduce number of physical records by 5.

  24. 1 2 3 4 5 6 7 8 9 10 11 12 rec 1 2 3 4 rec rec rec rec rec Composite Records or buckets • May be implemented by having file record be arrays of logical records • Example: Consider two half full files rec rec Probabity of Overflow? rec rec rec rec

  25. Composite Records or buckets • Trade-offs • as bucket size increases, probability of a overflow is greatly reduced. • as bucket size increases, time to read in and scan bucket increases • Typical bucket sizes range from 5 to 30. • Ideal bucket size often a multiple of the disk sector or track size. • What is the extreme case of having the longest possible bucket?

  26. Handling overflows • Increasing bucket size will reduce, but not eliminate overflows. They must be dealt with. • Many algorithms exist for handling overflows , including: 1. Progressive overflow 2. Separate overflow area 3. Chained Progressive overflow

  27. Progressive overflow • Adding new record • If home address is full, try the next record. • If next address full, try next, and so one. • If at end of file, wrap around to record 0 • If search continues until home address again reached, file full.

  28. Progressive overflow • Finding a record • If in home bucket, success! • Else if home bucket not full, search fails. • Else if home bucket full, go search next bucket. • Keep searching successive buckets until either found, or a non-full bucket is searched.

  29. Progressive overflow • Finding a record • Note that as file fills, search length will increase. • What are some enhancements? • Each bucket has flag indicating if bucket has really overflowed

  30. Progressive overflow • Delete record • Can't simply remove, or find may not work correctly • Must mark each record as used, unused, or deleted.

  31. Progressive overflow • Evaluation • simple • robust • searches may get very long • clustering

  32. Progressive overflow • Alternate version - skip x records each time, where x is prime relative to the number of records. • Reduces the problem of record clustering

  33. Separate overflow area • Buckets contain pointers which may point to a record in a special overflow area. • Records (or buckets) are linked together in the overflow area as a linked list. • What happens if there are a lot of synonyms for a few home addresses?

  34. Separate overflow area

  35. Chained Progressive overflow • similar to progressive, but pointers link synonyms together for quicker searches.

More Related