190 likes | 298 Views
CpSc 3220 File and Database Processing. Hashing. Exercise – Build a B + - Tree. Construct an order-4 B + -tree for the following set of key values: (2, 3, 5, 7, 11, 17, 9 , 6, 29, and 4) Assume the tree is initially empty and values are added in ascending order.
E N D
Exercise – Build a B+-Tree • Construct an order-4 B+-tree for the following set of key values: (2, 3, 5, 7, 11, 17, 9, 6, 29, and 4) • Assume the tree is initially empty and values are added in ascending order. • Now delete keys 2, 5, and 17
Objectives • Survey Hashing Concepts • Investigate Hashing Algorithms • Study Collision Reduction • Analyze Performance • Investigate File Deterioration • Look at Patterns of Access
Schematic View of Hash File 0 0 1 Record for Key 0 hash Key 1 Record for Keyx . . 0
Basic Hashing Concepts • A hash file contains a fixed number of record spaces • Each record space is of a fixed size • A hash function determines the address of a record space for a given key • A hash function may give same address for two different records • A single address for different keys is called a collision. • Different keys that give identical addresses are called synonyms. • A hash function that gives no collisions is called a perfect hash function.
Objectives for a Hash File Package • Keep collisions ‘low’ • Spread out (distribute) records over address space • Use extra memory (increase address space) • Put more than one record per address • Handle collisions efficiently
Outline for a Simple Hashing Algorithm • Put Key in numerical form • Fold and Add to reduce numerical form to ‘integer’ size • Divide by the size of the address space and use remainder as RRN address (offset) of Key
Simple Hash Function(when Key is an alphanumeric string) int Hash (string key) { intsum = 0; intlen = strlen(key); if (len % 2 == 1) key = concat(key, ‘ ‘)// make len even for (int j = 0; j < len; j += 2) sum = (sum + 256 * (ord)key[j] + (ord)key[j+1]) % FILE_SIZE; return sum; }
Hash Function Distribution • Uniform (Perfect) • Random • Worse than random We will look at random distributions
Predicting Record Distribution If r records are distributed randomly into N spaces, the probability that a given address will have exactly x records assigned to it is p(x) = (r!/( (r-x)! x! ) )/(1-(1/N))r-x(1/N)x p(0) – probability that an address is not used p(1) – probability that no collision occurs p(2) – probability that 1 collision occurs etc. Difficult to compute for large values of r and N.
Poisson’s Function For large values of r and N, p(x) can be approximately by this function p(x) = ( (r/N)x e-(r/N) ) / x! The value r/N is the ratio of the number of records to the number of address spaces. If only one record is placed in each space it is a measure of the percent of storage space that will be used (the packing density).
From Page 484 of File Structures by Folk, Zoellick, and Riccardi
Collision Resolution Using Progressive Overflow ( Linear Probing) 0 0 1 Record for Key0 1 Record for Key1 hash Key3 1 Record for Key2 . . 0 Hi = (hash(key) + i) mod TableSize
Address Spaces Can Hold More Than One Record 2 Key a Key d 0 1 Key r 2 Key k Key b 0 2 Key x Key t 1 Key w Packing Density = r/(bN) Address Density = r/N
Implementation Issues • Loading a Hash File • Deletions • Tombstones • Performance Effects
Other Collision Resolution Techniques • Quadratic Hashing • H(i) = (hash(key) + i2) mod TS • Double Hashing • H(i) = (hash(key) + f(i)) mod TS where f(i) = i*hash2(key) • Note that hash2(key) must never be zero • Separate Overflow Area • Chained Overflow with Separate Overflow Area • Scatter Tables
Patterns of Record Access • 20 percent of records account for 80 percent of activity • Most active records must be in home address or performance deteriorates
Summary • Hashing provides O(1) direct access performance. • If hash function gives collisions ASL may increase. • Collisions can be reduced by: • Spreading out records (choosing a better hash fct) • Using extra memory • Using buckets • Poisson Distribution allows us to analyze hash file performance • Better overflow handling can reduce ASL • Record Deletion requires special handling • Consider record access patterns • Hashing does not provide efficient sequential access • Hashing requires that we fix file size in advance