160 likes | 451 Views
Hash File. Considerations. Hashing - Hash File Considerations. Statistical Considerations Record Distribution is important Ideal - one record per location Load Factor - How full the file is Load Factor = r / b * m r - number of records stored b - bucket size m - number of addresses.
E N D
Hash File Considerations
Hashing - Hash File Considerations • Statistical Considerations • Record Distribution is important • Ideal - one record per location • Load Factor - How full the file is • Load Factor = r / b * m • r - number of records stored • b - bucket size • m - number of addresses
Hashing - Statistical Considerations • Graphing Record Distribution • Frequency Distribution Graph • Y axis - records per address • X axis - RRP • Alternate Frequency Distribution Graph • Y axis - Number of address with x records • X axis - x records assigned • Example - (x DIV 5) MOD 4, • Data: 22, 1, 14, 56, 25, 13, 43, 62, 11
Hashing - Overall Guidelines • If possible, use uniformly distributed Keys • Use a carefully designed hashing scheme • Do statistical studies if possible • Monitor performance • Should be computationally efficient • Taylor bucket size and load factor to particular I/O device
Hashing - Advantages • Flexibility • Adaptable to a variety of situations • Useful both for disk and memory based retrieval • Efficiency of record access • Can achieve O(1) access times
Hashing - Disadvantages • No ordered record access by PK • Data (key set) dependency • Must be specifically tailored for each key distribution and form • If characteristics change, hashing scheme may need to change • Fixed upper limit on file size • Size determined at creation time • Must "rehash" to larger file if expansion needed • May need to redesign hash algorithm as well
Hashing Considerations • Static vs. Dynamic Files • Static files • fixed key data • entire domain of keys known a priori (key set) • By experimentation, my be able to find collision free solution • Examples • Assembler OP code table • FAX group three compression table
Hashing Considerations • Static vs. Dynamic Files • Dynamic files • Key set not known in advance • Patterns/samples of keys may be known • Collision free solution not generally possible • Experimentation may be used to to fine good hash algorithm and configuration. • Hash Algorithm technique • File size • bucket size • Overflow strategy
Hashing Considerations • Static vs. Dynamic Hashing • Static Hashing • file size fixed over life of file • must rebuild to make larger • Dynamic Hashing • file may expand and contract over time • called extensible hashing
Hashing Considerations • Distribution of keys • May know some information about key distribution in advance • Complete set • patterns are predicable • completely unpredictable
Hashing Considerations • Files versus arrays • Hashing suitable for both primary and secondary retrieval purposes. • Primary memory based systems • I/O time not a consideration • buckets not really helpful • Other factors gain in importance • Hash algorithm complexity • overflow technique
Hashing Considerations • Hash Algorithms - general forms • Division • Division remainder scheme an example. • Choice of divisor importance • Should be prime relative to the file size. • Should not be a power of two. • Bad choices result in simple truncation, thus part of the key is simply discarded.
Hashing Considerations • Hash Algorithms - general forms • Multiplication • Multiplicative techniques tend to use ALL of the information in the key (no truncation) • Mid-square technique is an example. • Compression. extraction, folding • Useful for large keys
Hashing Considerations • Hash Algorithms - general forms • Double Hashing • Rather then progressive overflow on collision, use a secondary hash function to generate a step length for the next probe • Helps reduce secondary clustering of linear probing with step size greater then one. • Non-linear, or random probing
Hashing Considerations • Hash Algorithms - general forms • Multi-Attribute hashing • Base the calculation for home address on more than the primary key attribute. • Useful if the primary key exhibits certain bad hashing attributes (clustering, etc.) • Example - use part number (PK) and distributor fields. • Extendible Hashing • See text