300 likes | 540 Views
CSE331- Hashing. 1. Main Index. Contents. Chapter 12: Advanced Associative Containers BST vs Hash Table Hash Function Collision Coping with Collisions Open addressing & linear probing Chaining with separate lists. Hash Functions What works C++ Function Objects Hash Iterators
E N D
CSE331- Hashing 1 Main Index Contents • Chapter 12: Advanced Associative Containers • BST vs Hash Table • Hash Function • Collision • Coping with Collisions • Open addressing & linear probing • Chaining with separate lists • Hash Functions • What works • C++ Function Objects • Hash Iterators • Efficiency of Hash Methods • Summary Slides
BST vs Hash Table • Both used to implement Sets & Maps • Binary Search Tree – ordered associative container • Order (log N) access (average & worst) • Hash Table – unordered associative container • Order(1) access (average case)
Hash Function • A hash function converts a key into a numeric (unsigned int) table index • Ideal hash functions uniformly distribute keys to all available indices • When two keys hash to the same index a collision occurs • Keys are not in any particular order (numeric, alphabetical, ...) within the table
4 Main Index Contents Example Hash Function
5 Main Index Contents Example Hash Function hf(36)=36 ---- 36%7 = 1
Coping with Collisions • Three primary methods exist for coping with collisions • Rehashing: use same key but different hash function • Linear Probing: examine succesive locations (index, index+1, index+2, ...) • Chaining: implement table with separate list at each table[index] location • Note: Except for the last case, the table is a fixed size.
7 Main Index Contents Hash Table Using Linear Probing – Open Addressing
Linear Probing PseudoCode // insert item into table of size n using hashFunc() to // calculate index. this assumes no duplicate keys, and some // method of indicating that a hash table location is empty int index = hashFunc(item) % n; int origIndex = index; do { if table[index] is empty insert item as table[index] and return else if table[index] matches item return index = (index+1) % n; // this is next location to probe } while (index != origIndex); throw overflowError; // if we get here, table is full & does // not contain item
Problems with Linear Probing • Clustering of items occurs & degrades performance as number of items approaches size of table • Colliding items fill in gaps between other entries • This forms runs or clusters within the table • Items in the cluster are a mix of items that hash to different indices, so long sequences of repeated probes are required to find what we seek
Chaining – Uses Lists or Buckets • Implement the hash table as a vector of lists • Each list (bucket, chain, ...) contains all items that hash to the associated table location • Buckets are not mixed like clusters in linear probing • Table size can grow easily by expanding individual buckets as necessary • The number of buckets stays constant • Within a bucket, items are unordered and must be searched linearly
C++ Function Objects • Function object is an instance of a class that contain only a single function – operator() • Function objects are easily passed as parameters to other functions • Commonly used to implement hash functions and comparison operations template <typename T> class greaterThan { public: bool operator() (const T& x, const T& y) const { return x > y; } };
Reasonable Hash Functions • Integer keys • Identity function is good if key number is random or a portion of it is random class hfIntKey { public: bool operator() (int key) const { return key; } };
Reasonable Hash Functions • Integer keys • Midsquare technique (extracts middle two bytes of 4 byte square of key) -- works if key is not random class hfMidSq { public: bool operator() (int key) const { unsigned int n = key; return ((n*n)/256) % 65536; // 0 .. 2^16-1 } };
Reasonable Hash Functions • String keys • Simple function uses ASCII codes for the string characters to build n-digit unsigned integers out of n-digit strings class hfString { public: bool operator() (string key) const { unsigned int prime = 2049982463; int n(0); for (int i=0; i < key.length(); i++) n = n*8 + item[i]; return (n > o ? (n % prime) : (-n % prime) ); } };
Reasonable Hash Functions • String keys • Folding uses substrings as numbers and combines them by addition or multiplication or … • Example: use 3 character substrings of SSN class hfSSN { public: bool operator() (string ssn) const { return ( atoi(ssn.substr(0,3).cstr()) + atoi(ssn.substr(3,3).cstr()) + atoi(ssn.substr(6,3).cstr()) ); } };
Hash Class – not in STL • See headers in textbook include folder • d_hash.h – for the hash table using buckets • d_hashf.h – for hash function object • d_uset.h – for unordered set based on hash class • d_hiter.h – for hash class iterator and const_iterator
Hash Class – Not in STL template <typename T, typename HashFunc> class hash { public : hash (int nbuckets, const HashFunc& hfunc = HashFunc()); hash (T *first, T *last, int nbuckets, const HashFunc& hfunc = HashFunc()); bool empty() const; int size() const; iterator find(const T& item); pair<iterator,bool> insert(const T& item); int erase(const T& item); void erase(iterator pos); void erase(iterator first, iterator last); iterator begin(); const_iterator begin() const; iterator end(); const_iterator end() const; private: int numBuckets; // number of buckets vector<list<T> > bucket; // table is vector of lists HashFunc hf; // hash function int hashtableSize; // number of elements };
Hash::find(item) template <typename T, typename HashFunc> hash<T,HashFunc>::iterator hash<T,HashFunc>::find( const T& item) { int hashIndex = int(hf(item) % numBuckets); list<T>& myBucket = bucket[hashIndex]; list<T>::iterator bucketIter; // traverse list and look for a match with item bucketIter = myBucket.begin(); while(bucketIter != myBucket.end()) { if (*bucketIter == item) // return iterator to found item return iterator(this, hashIndex, bucketIter); bucketIter++; } // did not find item, so return iterator to table end return end(); }
Hash::insert(item) template <typename T, typename HashFunc> pair<hash<T, HashFunc>::iterator,bool> hash<T, HashFunc>::insert(const T& item) { int hashIndex = int(hf(item) % numBuckets); list<T>& myBucket = bucket[hashIndex]; list<T>::iterator bucketIter; bool success; bucketIter = myBucket.begin(); while (bucketIter != myBucket.end()) if (*bucketIter == item) break; // found the item already in bucket else bucketIter++; if (bucketIter == myBucket.end()) { bucketIter = myBucket.insert(bucketIter, item); success = true; hashtableSize++; } else success = false; // item already in table return pair<iterator,bool> (iterator(this, hashIndex, bucketIter), success); }
21 Main Index Contents Hash Iterator hIter Referencing Element 22 in Table ht
Determining Performance • Load factor (m = size of table, n = items in table) • Measures table density • Linear addressing (m = size of vector, maxitems) • Chaining (m = number of buckets) • Worst case • (all items hash to same table location or bucket) • Linear search is O(n) • Making table size prime helps prevent nonuniform distribution causing this worst case
Average Case - Chaining • Finding bucket is O(1) – using hash function • Uniform hashing implies each bucket has n/m items • Assuming uniform hash distribution • The ith item was inserted at the end of its bucket when the previous (i-1) items were spread evenly over the m buckets • To find this item takes 1+(i-1)/m comparisons since there are (on average) (i-1)/m items ahead of it in its bucket • Average performance of search for an arbitrary item is the average of the number of comparisons required to find each item in the list
Hash table size = m, Number of elements in hash table = n, Load factor = Average Probes for Successful Search Average Probes for Unsuccessful Search Open Probe Chaining 24 Main Index Contents Efficiency of Hash Methods
25 Main Index Contents Summary Slide 1 §- Hash Table - simulates the fastest searching technique, knowing the index of the required value in a vector and array and apply the index to access the value, by applying a hash function that converts the data to an integer - After obtaining an index by dividing the value from the hash function by the table size and taking the remainder, access the table. Normally, the number of elements in the table is much smaller than the number of distinct data values, so collisions occur. - To handle collisions, we must place a value that collides with an existing table element into the table in such a way that we can efficiently access it later.
26 Main Index Contents Summary Slide 2 §- Hash Table (Cont…) - average running time for a search of a hash table is O(1) - the worst case is O(n)
27 Main Index Contents Summary Slide 3 §- Collision Resolution - Two types: 1) linear open probe addressing - the table is a vector or array of static size - After using the hash function to compute a table index, look up the entry in the table. - If the values match, perform an update if necessary. - If the table entry is empty, insert the value in the table.
28 Main Index Contents Summary Slide 4 §- Collision Resolution (Cont…) - Two types: 1) linear open probe addressing - Otherwise, probe forward circularly, looking for a match or an empty table slot. - If the probe returns to the original starting point, the table is full. - you can search table items that hashed to different table locations. - Deleting an item difficult.
29 Main Index Contents Summary Slide 5 §- Collision Resolution (Cont…) 2) chaining with separate lists. - the hash table is a vector of list objects - Each list is a sequence of colliding items. - After applying the hash function to compute the table index, search the list for the data value. - If it is found, update its value; otherwise, insert the value at the back of the list. - you search only items that collided at the same table location
30 Main Index Contents Summary Slide 6 §- Collision Resolution (Cont…) - there is no limitation on the number of values in the table, and deleting an item from the table involves only erasing it from its corresponding list