720 likes | 908 Views
Chapter 7. Skip Lists and Hashing Part 2: Hashing. Sorted Linear Lists. For formula-based implementation Insert: O(n) comps & data moves Delete: O(n) comps & data moves Search: O(log(n)) comps For chained implementation: Insert: O(n) comps Delete: O(n) comps Search: O(n) comps.
E N D
Chapter 7 Skip Lists and Hashing Part 2: Hashing
Sorted Linear Lists • For formula-based implementation • Insert: O(n) comps & data moves • Delete: O(n) comps & data moves • Search: O(log(n)) comps • For chained implementation: • Insert: O(n) comps • Delete: O(n) comps • Search: O(n) comps
Dictionary • A dictionary is a collection of elements, each element has a field called key. • Key is unique for each element • Operations: • Insert an element with a specified key value • Search the dictionary for an element with a specified key value • delete an element with a specified key value • The access mode for elements in a dictionary is random access (or direct access) mode: i.e. any element may be retrieved by performing a search on its key.
Ideal hashing • Hash table: table used to store elements • Hash function: function to map keys to positions: k => f(k) • Search for an element with key k: if f(k) is not empty, found; otherwise, failed • Insert: f(k) must be empty • Delete: f(k) cannot be empty
Example: Student record dictionary • Use student ID (6 digit number) as the key • ID range 951000 and 952000 • f(k) = k - 951000 • Table size: 1001 i.e. ht[0..1000] • ht[i].key = 0 indicates an empty entry
Evaluation: Ideal Hashing • Initialize an empty dictionary: Θ(b) where b is the size of the table • Search, insert, and delete: Θ(1) • Property: 1 key <=> 1 position • Problem: the range of the keys may be very large resulting in large hash table, e.g. if the key is a 9 digit integer (ex SSN), the size of the table will be 109
Hashing with linear open addressing • Used when the size of the hash table (D) is smaller than the key range • f(k) = k % D • Positions in hash table are indexed 0..D-1 • bucket - position in a hash table • If key values are not integral type, they need to be converted first. • two keys k1 and k2 map into the same bucket if f(k1) = f(k2) collision • home bucket - position numbered f(k) is the home bucket for k • In general a bucket may contain space for more than one element. • An overflow occurs if there is not room in the home bucket for the new element. • If bucket has space for only one element, collision and overflow are the same.
Collision, overflow and linear open addressing 80, 58, &35 map into home bucket ht(3). In case of collision, insert in next available bucket in sequence.
Search • To search for an element with key k, begin at bucket f(k) and continue in successive bucket regarding the table as circular, until: • a bucket containing an element with k is found (successful) • an empty bucket is reached (unsuccessful) • return to the home bucket (unsuccessful)
deletion • After deletion, must move successive elements until: • am empty bucket is reached • return to the bucket from which the deletion took place • To improve performance, use a NeverUsed field. May need reorganization when many buckets have their NeverUsed field set to false
Performance analysis • b - the number of buckets in the hush table, b = D • initialization - Θ(b) • worst-case insert and search - Θ(n), where n is the number of elements in the table • worst-case happens when all n keys have the same home bucket
Performance analysis (continue) Average performance • Let α=n/b denote the loading factor • Un and Sn - average number of buckets examined during and unsuccessful and successful search, respectively, then
Performance analysis (continue) • The performance of hashing with linear open addressing is superior: • when α=0.5 table is half full Un=2.5 and Sn=1.5 • when α=0.9 table is 90% full Un=50.5 and Sn=5.5
Determining D • either a prime number or has no prime factors less than 20 • two methods: • begin with the largest possible value for b • Then find the largest D (<= b) that is either a prime or has no factors smaller than 20 • e.g., when b = 530, then D = 23*23 = 529
Determining D Second method: • determine your accepted Un and Sn • Estimate n • determine α • determine smallest b for the above α • determine smallest integer D >= b that either prime or has no factor smaller than 20.
Determining D • n = 1000 • S 4 and U 50.5 • S = 4 ==> α = 6/7 • U = 50.05 ==> α = 0.9 • α = min(6/7 , 0.9) = 6/7 • b = n/ α = 7000/6 = 1167 • note: 23*51 = 1173 • ==> select D = b = 1173
Comparison with Linear Open Addressing • Space complexity • Let s be the space required by an element • Let b and n denote the number of buckets and number of elements, respectively • Linear open addressing: b(s+2) bytes (2 for an element of empty array) • chaining: 2b+2n+ns bytes • when n < bs/(s+2), chaining takes less space
Search time complexity • Worst-case time complexity= noccurs when all elements map to same bucket (equal to that of linear open addressing) • Average • average length of a chain is α=n/b • average number of nodes examined in an unsuccessful search: * if chain has i nodes, it may take 1, 2, 3, …,I examinations. Assuming equal probability, on average search time =
Search time complexity Ctnd • If α=0, Un=0 • If α<1, Un<= α • If α>=1,
Average time complexity for successful search • Need to know the expected distance of each of the n elements from the head of its chain • Without losing generality, we assume elements are inserted into the chain in increasing order • When the ith element is inserted, the expected length of the chain is (i-1)/b; and the ith element is added into the end of the chain • A search for this element will require examination of 1+(i-1)/b nodes • Assuming n elements are searched for with equal probability, then
Comparison with linear open addressing • The expected performance of chaining is superior, e.g., • when α=0.9 • Chaining: Un=0.9, Sn=1.45 • Linear open addressing: Un=50.5, Sn=5.5
20 24 30 40 80 75 60 20 24 30 40 80 75 60 A sorted chain with head and tail nodes Pointers to middle are added
20 30 40 80 60 24 75 Pointers to every second node
An application • Text compression • compressor: file coding • run-length coding: 1000 xs + 2000 ys => 1000x2000y • space needed: 3002 bytes (2 bytes for string ends) => 12 bytes • decompressor: decoding • LZW Compression (Lempel, Ziv, and Welch)