250 likes | 406 Views
Chapter 8 Hashing . Part II. Introduction. Consider we may perform insertion, searching and deletion on a dictionary (symbol table). Is it possible to perform these operations in O(1) ? . Introduction.
E N D
Chapter 8 Hashing Part II
Introduction • Consider we may perform insertion, searching and deletion on a dictionary (symbol table). Is it possible to perform these operations in O(1) ?
Introduction • If we find a mapping from a key to an index, then we can locate a record quickly according its key and perform random access. S1 S2 S3 … 0 1 2 …
Introduction • This mapping can be illustrated as follows: • Hashing: define a function h so that h(Key) = i, where h is called a hash function. • Two kinds • Static hashing • Dynamic hashing h Key i
Definition • In static hashing, identifiers/keys are stored in table with a fixed size that is called hash table. slot1 slot2 • Bucket: • Each bucket has its own address and is capable of holding a key. Bucket 0 Bucket 1 Bucket 2 h x h(x) Identifier Bucket address Bucket n Hash function
Definition • Slot: Each bucket may consists of s slots to hold synonym (同義字) • i1 and i2 are synonyms if h(i1) = h(i2). • Distinct synonyms enter into the same bucket as long as the bucket has slots available.
Example • Number of buckets: • Number of slots for each bucket: • Define hashing function f(x) f(x) = {i | i is the order of the initial of x}. • A and A2 are synonyms. • GA and GB are synonyms. • If “Doll” enters, it will be put at buckect _______ (according to the hash function). slot1 slot2 A A2 Bucket 0 Bucket 1 Bucket 2 Bucket 3 D GA GB Bucket 25
Overflow and Collision • Overflow occurs when a new identifier is mapped into a full bucket. • Collision occurs when two non-identical identifiers are hashed into the same bucket. • If the number of slot is 1, then overflow and collision occur simutaneously. slot1 slot2 Bucket 0 A A2 If A3 enters bucket 0, A3 collides with A and A2. The bucket overflows as well. Bucket 1 Bucket 2
8.2.2 Hash Functions • Ideally, we expect to find a hash function that is one-to-one and easy to compute. • The hash function f(x) where f(x) = {i | i is the order of the initial of x}. The hash function can result in a lot of collisions because it only considers the initial character. • Key points:use every character in the identifier as possible.
Common Approaches • Division • Mid-square • Folding • Digit Analysis
Division • The most widely used hash function • The key k is divided by some number D, and the remainder is used as the bucket address. h(k) = k % D • Since the bucket address is from 0 to b-1 if there are b buckets, D is usually selected as the number of buckets.
Selecting The Divisor • When the divisor is an even number, odd integers hash into odd home buckets and even integers into even home buckets. • 20%14 = 6, 30%14 = 2, 8%14 = 8 • 15%14 = 1, 3%14 = 3, 23%14 = 9 • When the divisor is an odd number, odd (even) integers may hash into any home. • 20%15 = 5, 30%15 = 0, 8%15 = 8 • 15%15 = 0, 3%15 = 3, 23%15 = 8 • The bias in the keys does not result in a bias toward either the odd or even home buckets. • Better chance of uniformly distributed home buckets. • So do not use an even divisor.
Selecting The Divisor • Similar biased distribution of home buckets is seen, in practice, when the divisor is a multiple of prime numbers such as 3, 5, 7, … • The effect of each prime divisor p of b decreases as p gets larger. • Ideally, choose b so that it is a prime number. • Alternatively, choose b so that it has no prime factor smaller than 20.
Mid-square • Squaring the key and then using an appropriate number of bits from the middle of the square. • Example: • Suppose a character is represented in 6 bits and the bucket size is 2r. A 1 0 1 3 4 0 0 0 0 0 1 0 1 1 0 1 0 92 92x92=8464 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 r bits
Mid-square • Example • Key = 113586, m =10000, where 9999 is the largest bucket address. • Squaring the key, and then we have 1 2 9 0 1 7 7 9 3 9 6 h(x) = 1779
Folding • The key k is partitioned into several parts, all of the same length. These partitions are then added together to obtain the hash address of k. • Two schemes • Shift folding • Folding at the boundaries P1 P2 P3 P4 P5 1 2 3 2 0 3 2 4 1 1 1 2 2 0
Folding P1 P1 1 2 3 1 2 3 P2 P2 2 0 3 3 0 2 P3 P3 2 4 1 2 4 1 1 1 2 P4 2 1 1 P4 2 0 2 0 P5 P5 6 9 9 8 9 7 Shift folding Folding at the boundaries
Overflow Handling • An overflow occurs when the home bucket for a new pair (key, element) is full. • We may handle overflows by: • Search the hash table in some systematic fashion for a bucket that is not full. • Linear probing (linear open addressing). • Quadratic probing. • Rehashing. • Eliminate overflows by permitting each bucket to keep a list of all pairs for which it is the bucket address. • Array linear list. • Chain.
Linear Probing • Also called linear opening addressing • Search one by one until a empty slot is found. • Procedures: suppose b denotes the bucket size. • Compute h(k). • Examine the hash table buckets in the order ht[h(k)], ht[(h(k)+1)%b],…, ht[(h(k)+j)%b] until one of the following happens: • ht[(h(k)+j)%b] has a pair whose key is k; k is found. • ht[(h(k)+j)%b] is empty; k is not in the table. • Return to ht[h(k)]; the table is full.
0 4 8 12 16 Linear Probing • divisor = b (number of buckets) = 17. • Bucket address = key % 17. 34 0 45 6 23 7 28 12 29 11 30 33 • Insert pairs whose keys are 6, 12, 34, 29, 28, 11, 23, 7, 0, 33, 30, 45
34 0 45 6 23 7 28 12 29 11 30 33 0 4 8 12 16 Linear Probing Consider: when 51 enters, how many comparisons are required? Linear opening addressing tends to create “cluster”. These clusters become larger as more synonyms enter.
Quadratic Probing • Suppose i is used as the increment. • When overflow occurs, the search is carried out by examining h(x), (h(x)+i2)%b, and (h(x)-i2)%b. • For 1≦i ≦(b-1)/2 and b is a prime number of 4j+3. • For example, b=3, 7, 11,…,43, 59..
Rehashing • If overflow occurs at hi(x), then try hi+1(x). • Use a series of hash function h1, h2, …, hm to find an empty bucket. h1 h2 hm x hm(x)
[0] 0 34 [4] 6 23 7 [8] 11 28 45 [12] 12 29 30 33 [16] Chaining • Disadvantage of linear probing • Comparison of identifiers with different hash values. • Use linked list to connect the identifiers with the same hash value and to increase the capacity of a bucket.