241-423 Advanced Data Structures and Algorithms

241-423 Advanced Data Structures and Algorithms Objectives introduce hashing, hash functions, hash tables, collisions, linear probing, double hashing, bucket hashing, JDK hash classes Semester 2, 2013-2014 10. Hashing

Contents 1. Searching an Array 2. Hashing 3. Creating a Hash Function 4. Solution #1: Linear Probing 5. Solution #2: Double Hashing 6. Solution #3: Bucket Hashing 7. Table Resizing 8. Java's hashCode() 9. Hash Tables in Java 10. Efficiency 11. Ordered/Unordered Sets and Maps

1. Searching an Array • If the array is not sorted, a search requires O(n) time. • If the array is sorted, a binary search requires O(log n) time • If the array is organized using hashing then it is possible to have constant time search: O(1).

2. Hashing • A hash function takes a search item and returns its array location (its index position). • The array + hash function is called a hash table. • Hash tables support efficient insertion and search in O(1) time • but the hash function must be carefully chosen

A Simple Hash Function kiwi 0 1 2 3 4 5 6 7 8 9 • The simple hash function: • hashCode("apple") = 5hashCode("watermelon") = 3hashCode("grapes") = 8hashCode("cantaloupe") = 7hashCode("kiwi") = 0hashCode("strawberry") = 9hashCode("mango") = 6hashCode("banana") = 2 banana watermelon apple mango cantaloupe grapes strawberry

0  1 (025610001, jim) 2 (981100002, ad) 3  4 (451220004, tim) … 9997  9998 (200759998, tim) 9999  A Table with Key-Value Pairs • A hash table for a map storing (ID, Name) items, • ID is a nine-digit integer • The hash table is an array of size N= 10,000 • The hash function ishashCode(ID) = last four digits of the ID key

Applications of Hash Tables • Small databases • Compilers • Web Browser caches

3. Creating a Hash Function • The hash function should return the table index where an item is to be placed • but it's not possible to write a perfect hash function • the best we can usually do is to write a hash function that tells us where to start looking for a location to place the item

A Real Hash Function kiwi 0 1 2 3 4 5 6 7 8 9 • A more realistic hash function produces: • hash("apple") = 5hash("watermelon") = 3hash("grapes") = 8hash("cantaloupe") = 7hash("kiwi") = 0hash("strawberry") = 9hash("mango") = 6hash("banana") = 2hash("honeydew") = 6 banana watermelon apple mango cantaloupe grapes strawberry • Now what?

Collisions • A collision occurs when two item hash to the same array location • e.g "mango" and "honeydew" both hash to 6 • Where should we place the second and other items that hash to this same location? • Three popular solutions: • linear probing, double hashing, bucket hashing (chaining)

4. Solution #1: Linear Probing • Linear probing handles collisions by placing the colliding item in the next empty table cell (perhaps after cycling around the table) • Each inspected table cell is called a probe • in the above example, three probes were needed to insert 32 Key = 32 to be added [hash(32) = 6] 41 18 44 59 32 22 0 1 2 3 4 5 6 7 8 9 10 11 12

Example 1 • A hash table with N = 13 and h(k) = k mod13 • Insert keys: 18, 41, 22, 44, 59, 32, 31, 73 • Total number of probes: 19 0 1 2 3 4 5 6 7 8 9 10 11 12 41 18 44 59 32 22 31 73 0 1 2 3 4 5 6 7 8 9 10 11 12

. . . 141 142 143 144 145 146 147 148 . . . robin sparrow hawk bluejay owl Example 2: Insertion • Suppose we want to add "seagull": • hash(seagull) = 143 • 3 probes:table[143] is not empty;table[144] is not empty;table[145] is empty • put seagull at location 145 seagull

. . . 141 142 143 144 145 146 147 148 . . . robin sparrow hawk seagull bluejay owl Searching • Look up "seagull": • hash(seagull) = 143 • 3 probes:table[143] != seagull;table[144] != seagull;table[145] == seagull • found seagull at location 145

. . . 141 142 143 144 145 146 147 148 . . . robin sparrow hawk seagull bluejay owl Searching Again • Look up "cow": • hash(cow) = 144 • 3 probes:table[144] != cow;table[145] != cow;table[146] is empty • "cow" is not in the table since the probes reached an empty cell

. . . 141 142 143 144 145 146 147 148 . . . robin sparrow hawk seagull bluejay owl Insertion Again • Add "hawk": • hash(hawk) = 143 • 2 probes:table[143] != hawk;table[144] == hawk • hawk is already in the table, so do nothing

. . . 141 142 143 144 145 146 147 148 robin sparrow hawk seagull bluejay owl Insertion Again • Add "cardinal": • hash(cardinal) = 147 • 3 or more probes:147 and 148 are occupied;"cardinal" goes in location 0 (or 1, or 2, or ...)

Search with Linear Probing Algorithm get(k) i  h(k) p  0 // count num of probes repeat c  A[i] if c = // empty cell return null else if c.key () = k return c.element() else // linear probing i  (i + 1) mod N p  p + 1 until p = N return null • Search hash table A • get(k) • start at cell h(k) • probe consecutive locations until: • An item with key k is found, or • An empty cell is found, or • N cells have been unsuccessfully probed(N is the table size)

Lazy Deletion • Deletions are done by marking a table cell as deleted, rather than emptying the cell. • Deleted locations are treated as empty when inserting and as occupied during a search.

Updates with Lazy Deletion • delete(k) • Start at cell h(k) • Probe consecutive cells until: • A cell with key k is found: • put DELETED in cell; • return true • or an empty cell is found • return false • or N cells have been probed • return false • insert(k, v) • Start at cell h(k) • Probe consecutive cells until: • A cell i is found that is either empty or contains DELETED • put v in cell; • return true • or N cells have been probed • return false

Clustering • Linear Probing tends to form “clusters”. • a cluster is a sequence of non-empty array locations with no empty cells among them • e.g. the cluster in Example 1 on slide 12 • The bigger a cluster gets, the more likely it is that new items will hash into that cluster, and make it ever bigger. • Clusters reduce hash table efficiency • searching becomes sequential (O(n)) continued

If the size of the table is large relative to the number of items, linear probing is fast ( O(1)) • a good hash function generates indices that are evenly distributed over the table range, and collisions will be minimal • As the ratio of the number of items (n) to the table size (N) approaches 1, hashing slows down to the speed of a sequential search ( O(n)).

5. Solution #2: Double Hashing • In the event of a collision, compute a second 'offset' hash function • this is used as an offset from the collision location • Linear probing always uses an offset of 1, which contributes to clustering. Hashing makes the offset more random.

Example of Double Hashing • A hash table with N = 13, h(k) = k mod13, and d(k) =7 - k mod7 • Insert keys 18, 41, 22, 44, 59, 32, 31, 73 • Total number of probes: 11 0 1 2 3 4 5 6 7 8 9 10 11 12 31 41 18 32 59 73 22 44 0 1 2 3 4 5 6 7 8 9 10 11 12

. . . 141 142 143 144 145 146 147 148 . . . robin sparrow seagull hawk bluejay owl 6. Solution #3: Bucket Hashing • The previous solutions use open hashing: • all items go into the array • In bucket hashing, an array cell points to a linked list of items that all hash to that cell. also called chaining continued

Chaining is generally faster than linear probing: • searching only examines items that hash to the same table location • With linear probing and double hashing, the number of table items (n) is limited to the table size (N), whereas the linked lists in chaining can keep growing. • To delete an element, just erase it from its list.

7. Table Resizing • As the number of items in the hashtable increases, search speed goes down. • Increase the hash table size when the number of items in the table is a specified percentage of its size. Works with open chaining also. continued

Create a new table with the specified size and cycle through the items in the original table. • For each item, use the hash() value modulo the new table size to hash to a new index. • Insert the item at the front of the linked list.

8. Java's hashCode() • public int hashCode() is defined in Object • it returns the memory address of the object • hashCode() does not know the size of the hash table • the returned value must be adjusted • e.g hashCode() % N • hashCode() can be overridden in your classes

Coding your own hashCode() • Your hashCode() must: • always return the same value for the same item • it can’t use random numbers, or the time of day • always return the same value for equalitems • if o1.equals(o2) is true then hashCode() for o1 and o2 must be the same number continued

A good hashCode() should: • be fast to evaluate • produce uniformly distributed hash values • this spreads the hash table indices around the table, which helps minimize collisions • not assign similar hash values to similar items

String Hash Function • In the majority of hash table applications, the key is a string. • combine the string's characters to form an integer public int hashCode() { int hash = 0; for (int i = 0; i < s.length; i++) hash = 31*hash + s[i]; return hash; } continued

String strA = "and"; String strB = "uncharacteristically"; String strC = "algorithm"; hashVal = strA.hashCode(); // hashVal = 96727 hashVal = strB.hashCode(); // hashVal = -2112884372 hashVal = strC.hashCode(); // hashVal = 225490031 A hash function might overflow and return a negative number. The following code insures that the table index is nonnegative. tableIndex = (hashVal & Integer.MAX_VALUE) % tableSize

Time24 Hash Function • For the Time24 class, the hash value for an object is its time converted to minutes. • Since each hour is 60 mins more than the last, and a minute is between 0--59, then each hash is unique. public int hashCode() { return hour*60 + minute; }

9. Hash Tables in Java • Java provides HashSet, Hashtable and HashMap in java.util • HashSet is a set • Hashtable and HashMap are maps continued

Hashtable is synchronized; it can be accessed safely from multiple threads • Hashtable uses an open hash, and has rehash() for resizing the table • HashMap is newer, faster, and usually better, but it is not synchronized • HashMap uses a bucket hash, and has a remove() method

Hash Table Operations • HashSet, Hashtable and HashMap have no-argument constructors, and constructors that take an integer table size. • HashSet has add(), contains(), remove(), iterator(), etc. continued

Hashtable and HashMap include: • public Object put(Object key, Object value) • returns the previous value for this key, or null • public Object get(Object key) • public void clear() • public Set keySet() • dynamically reflects changes in the hash table • many others

Using HashMap • A HashMap with Strings as keys and values HashMap "Charles Nguyen" "(531) 9392 4587" "Lisa Jones" "(402) 4536 4674" "William H. Smith" "(998) 5488 0123" A telephone book

Coding a Map HashMap <String, String> phoneBook = new HashMap<String, String>(); phoneBook.put("Charles Nguyen", "(531) 9392 4587"); phoneBook.put("Lisa Jones", "(402) 4536 4674"); phoneBook.put("William H. Smith", "(998) 5488 0123"); String phoneNumber = phoneBook.get("Lisa Jones"); System.out.println( phoneNumber ); prints: (402) 4536 4674

HashMap<String, String> h = new HashMap<String, String>(100, /*capacity*/ 0.75f /*load factor*/ ); h.put( "WA", "Washington" ); h.put( "NY", "New York" ); h.put( "RI", "Rhode Island" ); h.put( "BC", "British Columbia" );

Capacities and Load Factors • HashMaps round capacities up to powers of two • e.g. 100 --> 128 • default capacity is 16; load factor is 0.75 • The load factor is used to decide when it is time to double the size of the table • just after you have added 96 elements in this example • 128 * 0.75 == 96 • Hashtables work best with capacities that are prime numbers.

10. Efficiency • Hash tables are efficient • until the table is about 70% full, the number of probes (places looked at in the table) is typically only 2 or 3 • Cost of insertion / accessing, is O(1) continued

Even if the table is nearly full (leading to long searches), efficiency remains quite high. • Hash tables work best when the table size (N) is a prime number. HashMaps use powers of 2 for N. continued

In the worst case, searches, insertions and removals on a hash table take O(n) time (n = no. of items) • the worst case occurs when all the keys inserted into the map collide • The load factor =n/N affects the performance of a hash table (N = table size). • for linear probe, 0 ≤  ≤ 1 • for chaining with lists, it is possible that  > 1 continued

Assume that the hash function uniformly distributes indices around the hash table. • we can expect  = n/N elements in each cell. • on average, an unsuccessful search makes  comparisons before arriving at the end of a list and returning failure • mathematical analysis shows that the average number of probes for a successful search is approximately 1 + /2 • so keep  small!

11. Ordered/Unordered Sets and Maps • Use an ordered set or map if an iteration should return items in order • average search time: O(log2n) • Use an unordered set or map with hashing when fast access and updates are needed without caring about the ordering of items • average search time: O(1)

Timing Tests • SearchComp.java: • read a file of 25025 randomly ordered words and insert each word into a TreeSet and a HashSet. • report the amount of time required to build both data structures • shuffle the file input and time a search of the TreeSet and HashSet for each shuffled word • report the total time required for both search techniques continued

Ford & Topp's HashSet build and search times are much better than TreeSet. continued

SearchJComp.java • replace Ford & Topp's TreeSet and HashSet by the ones in the JDK. JDK HashSet and TreeSet are much the same speed for searching

241-423 Advanced Data Structures and Algorithms