310 likes | 433 Views
WEEK 1 Hashing Part I. CE222 – Data Structures & Algorithms II Chapter 5.1-5.3 (based on the book by M. A. Weiss, Data Structures and Algorithm Analysis in C++, 3rd edition, 2006). GOAL. Develop a structure that will allow users to insert / delete / find records in
E N D
WEEK 1 Hashing Part I CE222 – Data Structures & Algorithms II Chapter 5.1-5.3 (based on the book by M. A. Weiss, Data Structures and Algorithm Analysis in C++, 3rd edition, 2006)
GOAL • Develop a structure that will allow users toinsert / delete / find records in constantaveragetime (e.g O(1)) • Structurewill be a table (relatively small) • Table completely contained in memory • Implementedby an array • Capitalizeson ability to access any element ofthe array in constant time CE 222-Data Structures & Algorithms II, Izmir University of Economics
General Idea • A stored item needs to have a data member, called key, that will be used in computing the index value for the item. • Key could be an integer, a string, etc • e.g. a name or Id that is a part of a large employee structure • If the size of the array is N, the items that are stored in the hash table are indexed by values from 0 to N – 1. • Each key is mapped into some number in the range 0 to N – 1. • The mapping is called a hash function. CE 222-Data Structures & Algorithms II, Izmir University of Economics
Example Hash Table 0 1 Items Hash Function 2 linda 25000 3 linda 25000 key joe 31250 4 joe 31250 5 dave 27500 6 mary 28200 7 mary 28200 8 dave 27500 key 9 CE 222-Data Structures & Algorithms II, Izmir University of Economics
Hash Function • Determines position of keys in the array (Maps items to cells in array) • The hash function: • must be simple to compute. • must distribute the keys evenly among the cells. • If all the keys are known, then it is possible to write perfect hash functions !! not possible CE 222-Data Structures & Algorithms II, Izmir University of Economics
An example 1/2 • Assume that keys are non-negative integers between 0 and MAX_INT and table size N is 5. x key hash(x) hashing function hash(x)= x mod(N) hash(x)=x%5 CE 222-Data Structures & Algorithms II, Izmir University of Economics
An example 2/2 hash(x)= x mod(N) hash(x)=x%5 Assumethatkeysare 23,14, 25, 46 82 in order. Steps : 1. hash(23)=23%5=3 2. hash(14)=14%5=4 3. hash(25)=25%5=0 4. hash(46)=46%5=1 5. hash(82)=82%5=2 CE 222-Data Structures & Algorithms II, Izmir University of Economics
Hash Functions Problems: • Keys may not be numeric. • Number of possible keys is much larger than the space available in table. • Different keys may map into same location (What happens if keys are 25, 30, and 40 in the previous example? ) • Hash function is not one-to-one => collision. • If there are too many collisions, the performance of the hash table will suffer dramatically. CE 222-Data Structures & Algorithms II, Izmir University of Economics
Hash Functions • If the input keys are integers then simply key mod TableSizeis ageneral strategy. • Unless key happens to have some undesirable properties. Make Table size a prime !!! (Assume that table size is 10 and all keys=10*i ? ) CE 222-Data Structures & Algorithms II, Izmir University of Economics
Hash Functions • If the input keys are strings then hash function needs to convert keys into a numeric value. • How to convert a string to a numeric value ? • Use ASCII codes of chars (127 different chars) CE 222-Data Structures & Algorithms II, Izmir University of Economics
Hash Function for Strings 1 • Add up the ASCII values of all characters of the key Example : tableSize= N andkey =“john” hashVal= 106 + 111 + 104 +110 = 431 index= 431%N int hash(const string &key, int tableSize) { int hasVal = 0; for (int i = 0; i < key.length(); i++) hashVal += key[i]; return hashVal % tableSize; } CE 222-Data Structures & Algorithms II, Izmir University of Economics
Hash Function for Strings 1 int hash(const string &key, int tableSize) { int hasVal = 0; for (int i = 0; i < key.length(); i++) hashVal += key[i]; return hashVal % tableSize; } • Easytoimplement!! • However, if the table size is large, the function does not distribute the keys well. e.g. Table size =10000, key length <= 8, the hash function can assume values only between 0 and 8*127=1016 CE 222-Data Structures & Algorithms II, Izmir University of Economics
Hash Function for Strings 2 • Examine only the first 3 characters of the key. • In English we have 26 different letters int hash (const string &key, int tableSize) { return (key[0]+27 * key[1] + 272*key[2]) % tableSize; } • In theory, 26 * 26 * 26 = 17576 different combinations(ignoring blanks) can be generated. However, English is not random, only 2851 different combinations are possible. • Thus, this function although easily computable, is also not appropriate if the hash table is reasonably large. • e.g TableSize=10007 without any collisions 28.4% (2851/10007) of table can be hashed to. CE 222-Data Structures & Algorithms II, Izmir University of Economics
Hash Function for Strings 3 int hash (const string &key, int tableSize) { int hashVal = 0; for (int i = 0; i < key.length(); i++) hashVal = 37 * hashVal + key[i]; hashVal %=tableSize; if (hashVal < 0) /* in case overflows occurs */ hashVal += tableSize; return hashVal; }; CE 222-Data Structures & Algorithms II, Izmir University of Economics
Hash function for Strings 3 key[i] 108 105 98 key a l i i 0 1 2 KeySize = 3; TableSize=10007 // hashVal =0; // for (int i = 0; i < key.length(); i++) // hashVal = 37 * hashVal + key[i]; hashVal=0; hashVal=37*0 +key[0]; // 0+98 hashVal=37*98 +key[1]; // 37*98+108 hashVal=37*(37*98+108)+key[2]; // 37*37*98+37*108+105 hash(“ali”) = (105 * 1 + 108*37 + 98*372) % 10,007 = 8172 CE 222-Data Structures & Algorithms II, Izmir University of Economics
Hash Function : Collision • Let hash(x) = x % 15 • Then, • if x = 25 129 35 2501 47 36 • hash(x) = 10 9 5 11 2 6 • Storing the keys in the array is straightforward: • Thus, delete and find can be done in O(1), andalso insert, except… CE 222-Data Structures & Algorithms II, Izmir University of Economics
Hash Function : Collision • What happens when you try to insert: x = 65 ? • x = 65 • hash(x) = 5 ? • If, when an element is inserted, it hashes to the same value as an already inserted element, this is called a collision. CE 222-Data Structures & Algorithms II, Izmir University of Economics
Handling Collisions • Separate Chaining • Open Addressing • Linear Probing • Quadratic Probing • Double Hashing CE 222-Data Structures & Algorithms II, Izmir University of Economics
Separate Chaining • The idea is to keep a list of all elements that hash to the same value. • The array elements are pointers to the first nodes of the lists. • A new item is inserted to the front of the list. • Advantages: • Better space utilization for large items. • Simple collision handling: searching linked list. • Overflow: we can store more items than the hash table size. • Deletion is quick and easy: deletion from the linked list. CE 222-Data Structures & Algorithms II, Izmir University of Economics
0 0 1 81 1 2 3 4 64 4 5 25 6 36 16 7 8 9 49 9 Separate Chaining Example Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81 hash(key) = key % 10. CE 222-Data Structures & Algorithms II, Izmir University of Economics
Separate Chaining : Operations • Initialization: all entries are set to NULL • Find: • locate the cell using hash function. • sequential search on the linked list in that cell. • Insertion: • Locate the cell using hash function. • (If the item does not exist) insert it as the first item in the list. • Deletion: • Locate the cell using hash function. • Delete the item from the linked list. CE 222-Data Structures & Algorithms II, Izmir University of Economics
Separate Chaining: Disadvantages • Parts of the array might never be used. • As chains get longer, search time increases to O(N) in the worst case. • Constructing new chain nodes is relatively expensive (still constant time, but the constant is high). • Is there a way to use the “unused” space in the array instead of using chains to make more space?(Later!) CE 222-Data Structures & Algorithms II, Izmir University of Economics
Analysis of Separate Chaining • Collisions are very likely. • How likely and what is the average length of lists? • Load factor l definition: • Ratio of number of elements (N) in a hash table to the hash TableSize. • i.e. l = N/TableSize • The average length of a list is also l. • For chaining l is not bounded by 1; it can be > 1. CE 222-Data Structures & Algorithms II, Izmir University of Economics
Cost of searching 1/4 Search Time(or Cost) = Time to evaluate hash function + the time to traverse the list • Search can be • either unsuccessful or successful? CE 222-Data Structures & Algorithms II, Izmir University of Economics
Cost of searching 2/4 Unsuccessful search: • We have to traverse the entire list, so we need to compare l nodes on the average CE 222-Data Structures & Algorithms II, Izmir University of Economics
Cost of searching 3/4 • Successful search: • Successful search time to traverse the list = the node searched + half the expected # of other nodes in the list) • N=# of elements; M= Number of Lists (TableSize) • Expected # of other nodes = (N-1)/M = l -1/M (which is essentially l, since M is presumed large) • On the average, we need to check half of the other nodes while searching for a certain element • Thus average search cost = 1 + l /2 CE 222-Data Structures & Algorithms II, Izmir University of Economics
Cost of searching 4/4 Observation: Table size is not important but load factor is. For separate chaining make λ ~ 1 CE 222-Data Structures & Algorithms II, Izmir University of Economics
How to implement Hashing ?EXAMPLE CE 222-Data Structures & Algorithms II, Izmir University of Economics
Implementation :Example p1/3 class Node { public : int key; // EASY put all members to public Node(int a) {key=a; next=NULL;} Node * next; }; class List { public : Node * head; List() {head=NULL;} bool searchList(int x) { for(Node * p=head; p!=NULL && p->key !=x ;p=p->next); if(p==NULL) return false; else return true;} void insertList(int x) { if (head==NULL) head= new Node(x); else { Node * p=new Node(x); p->next=head; head=p; }} } CE 222-Data Structures & Algorithms II, Izmir University of Economics
Implementation :Example p2/3 const int TABLE_SIZE = 5; int hash(int x); // hash function to generate an index number // between 0 - TableSize class HashTable { public: HashTable (); void makeEmpty(); // remove all entries in the table void insert(int x); // insert x to table void remove(int x); // remove x from table private: List table[TABLE_SIZE]; } CE 222-Data Structures & Algorithms II, Izmir University of Economics
Implementation :Example p3/3 void HashTable:: insert(int x) { int value= hash(x); // table[value] is the head of corresponding list if(table[value].searchList(x)==false) table[value].insertList(x); } void HashTable:: remove(int x) { int value= hash(x); // table[value] is the head of corresponding list if(table[value].searchList()==false) cout<< “cannot remove”; else } What in here? CE 222-Data Structures & Algorithms II, Izmir University of Economics