Hash Tables and Sets

Hash Tables and Sets Lecture 3

Sets • A set is simply a collection of elements • Unlike lists, elements are not ordered • Very abstract, general concept with broad usefulness: • The set of all Google search queries from the past 24 hours • The set of all photos with your face in them • The set of all files in a folder • How are sets represented in computers? • Consider the following problem: • We want to store a large set of approx. 10 million random numbers • The following operations are happening constantly: • Add – inserting a new number into the set • Delete – removing an existing element from the set • Lookup – checking if a new random number is in the set

Representing Sets • Suppose we use an ArrayList for this heavily churning set: • Add, Delete, and Lookup are all O(n) • Suppose the ArrayList is sorted: • Lookup is O(log n) • Add/Delete are still O(n) • Cleverer algorithms: • Self-balancing trees: • Lookup, Add, and Delete are guaranteed O(log n) • Hash tables: • Lookup, Add, and Delete are worst-case O(n) • … but on average O(1)

Using Buckets • Let’s go back to ArrayLists, but use a different approach: • Create 2 ArrayLists • Even numbers go in the first list • Odd numbers go in the second list • Now, Add/Delete/Lookup only take half the work: • Check if the number is even or odd • Get the right ArrayList • Search through about 5 million entries instead of 10 million • This is promising! • … but still O(n)

Using Buckets • Yet another approach: • Instead of two different ArrayLists, let’s use 4 • Multiples of 4 go in the first list • Multiples of 4 have the property (x % 4) == 0 • If (x % 4) == 1, then x goes in the second list • If (x % 4) == 2, then x goes in the third list • If (x % 4) == 3, then x goes in the fourth list • Now, Add/Delete/Lookup only take ¼ as much work: • Calculate the number mod 4 • Find the right list • Search through 2.5 million elements instead of 10 million • This is even better! • … but still O(n)

Using Buckets • Yet another approach: use 10 million buckets! • If the numbers are truly randomly distributed, then: • Some buckets may be empty • Some buckets may have 2 or even 100 elements • On average, each bucket has close to 1 element • Suddenly, Add/Delete/Lookup become very cheap – O(1) • As long as we scale up the number of buckets to match the amount of data, we can maintain O(1) lookup • This is a hash table!

Hash Functions • In our example, we were only storing integers • We can use this to store arbitrary data, as long as one thing is provided: • A hash function • What is a hash function? • A function that converts any data into an integer • This integer is used to determine which bucket in which to store the data • The hash function must ensure fairly even distribution in the table. More on this later.

Example Hash Function • Suppose we wish to store a set of strings instead of integers • We need a hash function • Here’s a simple one: • ‘a’ = 1, ‘b’ = 2, ‘c’ = 3, …, ‘z’ = 26 • Sum the value of each letter • “asdf”.hashCode() • = ‘a’ + ‘s’ + ‘d’ + ‘f’ • = 1 + 19 + 4 + 6 • = 30 • “asdf” goes in the 30th bucket

Hash Collisions • This hash function has some problems: • It only deals with English letters • We can solve this by using the ASCII or Unicode value of the character instead of its index in the English alphabet • It is prone to collisions • A hash collision is when two or more distinct values have the same hash code • In example hash function, all anagrams collide: • “least”  12 + 5 + 1 + 19 + 20 = 57 • “steal”  19 + 20 + 5 + 1 + 12 = 57 • “stale”  19 + 20 + 1 + 12 + 5 = 57 • Therefore, this hash table would be very bad for storing sets of anagrams! • It would degenerate into using a single ArrayList, as one bucket would be used.

Generalizing • What exactly is a hash table? • Given elements that have a hash function, hash tables are just arrays! • Each array element is an ArrayList in order to resolve collisions • Number of buckets is proportional to number of elements in the set • Expliot time-memory tradeoff to get quick lookup times • Array is resized when hash table gets too “full” • Load factor: The ratio of filled hash table slots to total slots • Load factor is 0.0 when the hash table is empty and 1.0 when every bucket has at least one element • When load factor reaches a certain value, 0.75 in our case, the array gets larger to maintain sparseness • Hash tables can get much more complicated than this, but the fundamentals remain the same.

The Lab • In this lab, we have implemented a very simple hash table • SimpleHashTable.java • It is so simple that it cannot handle collisions! • Each bucket isn’t an ArrayList – it’s just a single element when full, or “null” if empty • Your task is to modify the code and implement collision resolution • This means that each array slot should be an ArrayListinstead of merely an Object

Java Generics • You will see some strange angle-bracket notation: • ArrayList<T>, SimpleHashTable<T> • If parentheses indicate function arguments, then angle brackets indicate type arguments • Type arguments are a way of specifying data structures that work on various types: • ArrayList<String> has: • void add(String arg0) • String get(int index) • SimpleHashSet<Integer> has: • void add(Integer arg0) • boolean contains(Integer arg0)

Operations to Implement • SimpleHashSet.java: • public void add(T element) • public boolean contains(T element) • public boolean remove(T element) • public void clear() • public booleanisEmpty() • public int size() • Some of these may remain unchanged • You will also have to edit the private members and reimplement some private methods

Hash Tables and Sets