120 likes | 292 Views
Chapter 5: Hashing. Hash Table ADT. Hash Functions. Collision Resolution. Rehashing. CS 340. Page 89. Hashing. Hashing is a technique for performing searches, insertions, and deletions from a list in constant time.
E N D
Chapter 5: Hashing • Hash Table ADT • Hash Functions • Collision Resolution • Rehashing CS 340 Page 89
Hashing Hashing is a technique for performing searches, insertions, and deletions from a list in constant time. A particular component of each data element being stored is used as a key which is mapped to a particular cell in a hash table. Problems arise when collisions occur, i.e., when multiple data elements are mapped to the same cell. The term “hash” was coined to illustrate the analogy between hashing and the culinary practice of chopping and mixing ingredients to make a hash. Essentially, the input domain is “chopped” into several subdomains, which are then “mixed” into the output range to improve the uniformity of their distribution. CS 340 Page 90
The Hash Table Abstract Data Type A hash table is a list of keys, mapped to particular cells via a hash function. • The table is implemented as a fixed-size array. • The table size and hash function are strategically chosen to avoid collisions. Example: A hash table to hold the CS Department faculty and staff Hash function: length of last name Table size: 11 Collision-free keys: 9 Ave. # comparisons per name: 1.50 Hash function: ((Sum of office room # digits) * (# of vowels in last name) + (Last 2 digits of office phone #)) % 25 Table size: 25 Collision-free keys: 24 Ave. # comparisons per name: 1.08 Hash function: (Office room #) % 15 Table size: 15 Collision-free keys: 12 Ave. # comparisons per name: 1.42 CS 340 Page 91
Choosing The Hash Table Size Define the load factor, , of a hash table to be the ratio of the number of elements in the hash table to the table size. • If > 1, then collisions are inevitable, so it is wise to choose a table size greater than the number of anticipated elements. • If << 1, then there will be a very large number of empty slots, lessening the probability of a collision, but wasting a lot of memory. Example: A hash table to hold the 2011 holidays Table size: 12 Hash function: Month # Load factor: 1.5 Table size: 43 Hash function: (Month #) + (Day #) Load factor: 0.419 CS 340 Page 92
Choosing The Hash Function Given a particular hash table size and a particular type of data, the hash function should be chosen to minimize the number of collisions. This usually requires an in-depth analysis of the keys expected to go in the table. Example: A hash table to hold CS undergraduate course enrollment statistics There are 24 active courses: 108, 140, 145, 150, 240, 275, 312, 314, 321, 325, 330, 340, 390, 423, 425, 434, 438, 447, 454, 456, 482, 490, 495, and 499. Summing the digits yields: 9, 5, 10, 6, 6, 14, 6, 8, 6, 10, 6, 7, 12, 9, 11, 9, 15, 15, 13, 15, 14, 13, 18, and 22. (Table size: 18; 7 non-collisions.) Summing (one’s digit) + 2*(ten’s digit) + 4*(100’s digit) yields: 12, 12, 17, 14, 16, 27, 16, 20, 17, 21, 18, 28, 30, 23, 26, 30, 31, 30, 32, 34, 34, 39, and 43. (Table size: 32; 10 non-collisions.) Summing (one’s digit) + 3*(ten’s digit) + 9*(100’s digit) yields: 17, 21, 26, 24, 30, 44, 32, 34, 34, 38, 36, 39, 54, 45, 47, 49, 53, 55, 55, 57, 62, 63, 68, and 72. (Table size: 56; 20 non-collisions.) Summing (100’s digit) + 3*(ten’s digit) + 9*(one’s digit) yields: 73, 13, 58, 16, 14, 68, 24, 26, 18, 54, 12, 15, 30, 37, 55, 49, 85, 79, 55, 73, 46, 31, 76, and 112. (Table size: 100; 20 non-collisions.) CS 340 Page 93
New Year’s Day Martin Luther King, Jr. Day Lincoln’s Birthday Valentine’s Day Washington’s Birthday St. Patrick’s Day Easter Sunday Mother’s Day Memorial Day Flag Day Father’s. Day Independence Day Labor Day Columbus Day Halloween Veterans Day Thanksgiving Day Christmas Day Collision Resolution What should be done when a collision does occur? There are two main strategies: separate chaining and probing. Separate Chaining With separate chaining, the hash table is an array of linked lists, with each linked list containing all of the elements that map to the same value. Disadvantages: Average successful search: 1+(/2) comparisons Average unsuccessful search: comparisons Worst case search: n comparisons (for a bad hash function) CS 340 Page 94
insert 1492 insert 1776 insert 1812 insert 1945 insert 1968 insert 1992 1968 1968 1492 1492 1492 1492 1492 1492 (slot 2) (slot 6) (slot 2 slot 5) (slot 5 slot 8) (slot 8 slot 1) (slot 2 slot 5 slot 8 slot 1 slot 4) 1992 1812 1812 1812 1812 1776 1776 1776 1776 1776 1945 1945 1945 Collision Resolution (Continued) Probing With probing, the hash table is an array of values, with a whole series of cells probed until no collision occurs (i.e., cells h0(x), h1(x), h2(x),… are tried, where hi(x) = (Hash(x) + f(i)) mod tablesize, with f(0) = 0). Linear Probing: f(i) is a linear function Example: f(i) = 3i and Hash(x) = x Problems With Linear Probing: • Coefficient and table size must be relatively prime or free cells may not be found. • Bad tendency to experience primary clustering, resulting in many collisions. CS 340 Page 95
1992 1492 1492 1492 1492 insert 1492 insert 1776 insert 1812 insert 1945 insert 1968 insert 1992 1812 1812 1812 1812 (slot 2) (slot 6) (slot 2 slot 4) (slot 5) (slot 8) (slot 2 slot 4 slot 0) 1945 1945 1945 1776 1776 1776 1776 1492 1492 1968 1968 1776 Collision Resolution (Continued) Quadratic Probing: f(i) is a quadratic function Example: f(i) = 2i 2 and Hash(x) = x Problems With Quadratic Probing: • Coefficient and table size must be carefully chosen or free cells may be ignored. • Bad tendency to experience secondary clustering, since keys with the same original hashed value will follow the same sequence of cells through the table. CS 340 Page 96
1992 insert 1492 insert 1776 insert 1812 insert 1945 insert 1968 insert 1992 1492 1812 1492 1492 1492 (slot 2) (slot 6) (slot 2 slot 3) (slot 5) (slot 8) (slot 2 slot 5 slot 8 slot 1) 1812 1812 1812 1945 1776 1945 1945 1776 1776 1776 1492 1492 1968 1968 1776 Collision Resolution (Continued) Double Hashing: f(i) is a second hash function, multiplied by an iterative value Example: f(i) = iHash2(x), where Hash2(x) = 7 - x mod 7, and Hash(x) = x Problems With Double Hashing: • A strategic choice must be made for both hashing functions. • Calculation will be much more expensive in the event of a collision. CS 340 Page 97
Bush2 Nixon Reagan Reagan Clinton Obama Kennedy Kennedy Ford Johnson Johnson Eisenhower Carter Bush Bush Bush2 Nixon Clinton Ford Eisenhower Carter Rehashing When a hash table starts getting too full, with many delays caused by repeated collisions, rehashing the values into a new, larger table with a new hash function may alleviate the problem. Inserting Obama (2009) would cause a collision in slot 7, so… Insert: Bush2 (2001) Clinton (1993) Bush (1989) Reagan (1981) Carter (1977) Ford (1974) Nixon (1969) Johnson (1963) Kennedy (1961) Eisenhower (1953) REHASH Hash (president) = first_year_in_office mod 23 Hash (president) = first_year_in_office mod 11 CS 340 Page 98
C++ “map”: #include <map> #include <string> using namespace std; voidmain() { map<string, string> phone_book; phone_book["Sally Smart"] = "555-9999"; phone_book["John Doe"] = "555-1212"; phone_book["J. Random Hacker"] = "553-1337"; } Associative Arrays Hashing is used extensively in modern programming, particularly in database management, network security, and operating systems. Consequently, most modern programming languages have built-in mechanisms for implementing associative arrays, i.e., dictionaries based on the key-value concept of hash tables. Java “map”: Map<String, String> phoneBook = new HashMap<String, String>(); phoneBook.put("Sally Smart", "555-9999"); phoneBook.put("John Doe", "555-1212"); phoneBook.put("J. Random Hacker", "555-1337"); Lua “table”: phone_book= { ["Sally Smart"] ="555-9999", ["John Doe"] = "555-1212", ["J. Random Hacker"] = "553-1337", -- Trailing comma is OK } aTable= { -- Table as value subTable= { 5, 7.5, k =true}, -- key is "subTable " -- Function as value ['John Doe'] =function(age)if age < 18 then return "Young" else return "Old!" end end, -- Table and function (and other types) can also be used as keys } Python “dictionary”: phonebook = { 'Sally Smart' : '555-9999', 'John Doe' : '555-1212', 'J. Random Hacker' : '553-1337' } Perl “hash”: %phone_book = ( 'Sally Smart' =>'555-9999', 'John Doe' =>'555-1212', 'J. Random Hacker'=>'553-1337', ); Ruby “hash”: phonebook = { 'Sally Smart' => '555-9999', 'John Doe' => '555-1212', 'J. Random Hacker' => '553-1337' } CS 340 Page 99