Data Structures CSCI 132, Spring 2014 Lecture 34 Analyzing Hash Tables

Data StructuresCSCI 132, Spring 2014Lecture 34Analyzing Hash Tables

Recall Hash Tables • Hash tables use an index function that maps many possible keys to a single location. • If the table is sparse, then most of the time only 1 key will go to each location. • If 2 records do get assigned to the same location (a collision), we use a method for reassigning the second record (collision resolution). A hash table

The C++ Hash Table Specification const int hash_size = 997; // a prime number of appropriate size class Hash_table { public: Hash_table( ); void clear( ); Error_code insert(const Record &new_entry); Error_code retrieve(const Key &target, Record &found) const; private: Record table[hash_size]; };

Implementation of insert( ) Error_code Hash_table :: insert(const Record &new_entry) { Error_code result = success; int probe_count, // Counter to be sure that table is not full. increment, // Increment used for quadratic probing. probe; // Position currently probed in the hash table. Key null; // Null key for comparison purposes. null.make_blank( ); probe = hash(new_entry); //Find location to insert new_entry probe_count = 0; increment = 1;

insert( ) continued while (table[probe] != null // Is the location empty? && table[probe] != new_entry // Duplicate key? && probe_count < (hash_size + 1)/2) { // Has overflow occurred? probe_count++; probe = (probe + increment)%hash_size; increment += 2; // Prepare increment for next iteration. } if (table[probe] == null) table[probe] = new_entry; // Insert new entry. else if (table[probe] == new_entry) result = duplicate_error; else result = overflow; // The table is full. return result; }

Likelihood of collisions • How many people have to be in a room before the probability that two of them have the same birthday reaches 50%? • P = (1 - (364/365)*(363/365)*(362/365)* ...*(365-m+1)/365 > 0.5 • when m >= 23 • The calculation for a probability of a collision in a table is similar. • The table does not have to be very full for the probability of a collision to reach at least 50%. • Therefore: Collisions happen! We must handle them efficiently.

Counting Probes • We can analyze the running time of hash tables by counting comparisons. • Comparisons take place when "probing" an entry: Looking at an entry and comparing its key to a target. • The number of probes done depends on how full the table is. • n = number of entries in the table • t = number of total positions in table (= hash_size) • l = n/t = Load Factor • l = 0 means no entries in table • l = 0.5 means the table is 1/2 full • l <= 1 for contiguous table without chaining (open addressing) • l can be greater than 1 if using chaining

Number of comparisons for chaining • Unsuccessful searches: • If entries distributed evenly over the table, then the expected number of entries in each chain is: n/t = l. • For an unsuccessful search, we must do one probe for each entry in the list, so the average number of probes (or comparisons) is l. • Successful searches: • Average number of comparisons for sequential search of a list with k items is: (k + 1)/2 • The node we are looking for is in our list, the other n-1 nodes are distributed evenly over the table so the average number of nodes will be: k = (n-1)/t + 1 ~ n/t + 1 = l + 1. • Average number of comparisons will be (l + 1 + 1)/2 = l/2 + 1

Open addressing (without chaining) Evenly distributed entries, Random probing: Number of Comparisons (approx) Successful case: (1/l)ln(1/(1-l)) Unsuccessful case: 1/(1 - l) Linear Probing: Successful case: 0.5(1 + 1/(1-l) ) Unsuccessful case: 0.5(1 + 1/(1-l)2 )

Theoretical and empirical results

Hash Tables vs. Other Methods • Speed of retrieval from a hash table does not depend on the total number of entries, but on the ratio of entries/table-size (l). • A table of size 40 with 20 entries has the same performance as a table of size 4000 with 2000 entries. Sequential Search: Q(n) Binary Search: Q( lg(n)) Hash Table retrieval: O (1) for small l. • Read section 9.8 on choosing a method for storage and retrieval of data.

Radix sort Radix sort creates a table of queues. Each queue corresponds to a letter of the alphabet. Sort from least significant letter to most significant letter.

Implementation of Radix Sort const int key_size = 5; const int max_chars = 28; template <class Record> void Sortable_list<Record> :: radix_sort( ) { Record data; Queue queues[max_chars]; for (int position = key_size - 1; position >= 0; position--) { // Loop from the least to the most significant position. while (remove(0, data) == success) { int queue_number = alphabetic_order(data.key_letter(position)); queues[queue_number].append(data); // Queue operation. } rethread(queues); // Reassemble the list. } }

Data Structures CSCI 132, Spring 2014 Lecture 34 Analyzing Hash Tables