Sets, Maps and Hash Tables

Sets, Maps and Hash Tables

Sets • We have learned that different data struc-tures have different advantages – and drawbacks • Choosing the proper data structure depends on typical usage patterns • Array- and list-oriented data structures are appropriate when the order of elements matter – but that is not always the case RHS – SWC

Sets • A Set is a data structure which can hold an unordered collection of elements • Not having to worry about ordering can improve performance of other operations • On a Set, we want to be able to • Insert an element • Delete an element • Check if a given element is in the Set RHS – SWC

Sets public interface Set<T> { void add(T element); void remove(); boolean contains(T element); Iterator<T> iterator(); } RHS – SWC

Sets • It turns out that insertion, deletion and check for containment can be done in O(log(n)), or even faster! • Depends on the underlying implemen-tation of the interface • In Java, implementation is either • HashSet (based on Hash Tables) • TreeSet (based on Trees) RHS – SWC

Sets • A Set iterator is ”simpler” than e.g. a List iterator • Elements will occur in ”random” order • No add method – we just call add on the Set itself • No previous method – does not make sense • The Set iterator does however have a delete method (why?) RHS – SWC

Sets – Quality tip • When using a Set, we must choose a spe-cific implementation (HashSet or TreeSet) • However, the definition should look like: Set<Car> cars = new HashSet<Car>(); RHS – SWC

Sets – Quality tip Set<Car> cars = new HashSet<Car>(); • Why…? We should in general only refer to the interface, not the implementation • Easy to switch implementation! RHS – SWC

Maps • A Map is a data structure which stores associations between • A collection of keys • A collection of values • All keys map to a value • Keys are unique (values are not) RHS – SWC

Maps K1 V1 K2 V2 K3 V3 K4 RHS – SWC

Map public interface Map<K,V> { void put(K key,V value); V get(K key); void remove(K key); Set<K> keySet(); } RHS – SWC

Map • The keySet method returns a Set containing all keys in the Map • You must then iterate through this Set, in order to get all values stored in the Map RHS – SWC

Map Map<String,Car> carMap = new HashMap<String,Car>(); ... Set<String> regNumbers = carMap.keySet(); for (String regNo : regNumbers) { Car aCar = carMap.get(regNo); ... // Do something with the Car object } RHS – SWC

Hash Tables • A Set and a Map are both abstract data types – we need a concrete implemen-tation in order to use them • In the Java library, two implementations are available: • Sets: HashSet, TreeSet • Maps: HashMap, TreeMap RHS – SWC

Hash Tables • The implementations HashSet and HashMap are based on a Hash Table • A Hash Table is based on the below ideas: • Create an array of length N, which can store objects of some type T • Find a mapping from T to the interval [0; N-1] (a Hash Function f) • Store an object t of type T in the position f(t) RHS – SWC

Hash Tables Car3 f(Car1) = 3 f(Car2) = 0 f(Car3) = 2 Car1 Car2 0 1 2 3 4 RHS – SWC

Hash Tables • A Hash Table is thus ”almost” an array • Instead of having an index directly available, we must calculate it • If calculation can be done in constant time, then all basic operations (insert, delete, lookup) can be done in constant time! • Better than tree-based implementations, which have O(log(N)) RHS – SWC

Hash Tables • However, there are some issues: • How do we define a good mapping from the objects to [0; N-1]? • What happens if we try to store two objects at the same position? RHS – SWC

Hash Functions • Before finding a good mapping – i.e. a good hash function – we must consider the size of the array • For good performance, the array should at least be as large as the maximal number of objects stored • Rule of thumb is about 30 % larger • Size should be a prime number (???) RHS – SWC

Hash Functions • What if the expected number of objects is unknown in advance? • We can expand a hash table dynamically • If the hash table in running out of space, double the capacity • Start out with a reasonably large array (space is cheap…) RHS – SWC

Hash Functions • Having handled the choice of N, how do we define a proper hash function? • Properties of a hash function: • Must map all objects of type T to the interval [0; N-1] • Should map objects as uniformly as possible to the interval [0; N-1] RHS – SWC

Hash Functions • We can enforce the mapping to [0;N-1] by using the modulo operator: f(t) = g(t) % N • g(t) can then produce any integer value • How do we achieve a uniform distribution? • Theory for this is complicated, but there are some general rules to follow RHS – SWC

Hash Functions • A good hash function should be ”almost ran-dom”, but deterministic • ”Almost random” – values are well distri-buted in the interval • Deterministic – always produce the same output for the same input RHS – SWC

Hash Functions • In Java, all objects have a hashCode method • Defined in Object class • Can be overrided • Returns an integer (the Hash Code) • We must use modulo on the value ourselves RHS – SWC

Hash Functions • Hash function for integers: • The number itself… • Hash function for strings: final int HASH_MULTIPLIER = 31; int h = 0; for (int i = 0; i < s.length; i++) h = (HASH_MULTIPLIER * h) + s.charAt(i); RHS – SWC

Hash Functions • Hash code for an object can be calculated by combining hash codes for instance fields • Combine values in a way similar to the algorithm used to find string hash codes RHS – SWC

Hash Functions public int hashCode() { final int MULTIPLIER = 31; int h1 = regNo.hashCode(); int h2 = mileage; int h3 = model.hashCode(); int h = h1*MULTIPLIER + h2; h = h*MULTIPLIER + h3; return h; } RHS – SWC

Hash Functions • But wait…what about numeric overflow? • We multiply a ”random” integer value with a number…? • Does not really matter… • As long as the algorithm is deterministic, overflow is not a problem • Just helps ”scrambling” the value  RHS – SWC

Hash Functions • Common pitfalls: • Remember to define a hashCode function • If you forget, the hashCodeimplementation in Object is used • Based solely on memory location of object • Two objects with the same value of instance fields will produce different hash codes… RHS – SWC

Hash Functions • Common pitfalls: • The hashCode function must be ”compatible” with your equals function • If a.equals(b) it must hold that a.hashCode() == b.hashCode() • If not, duplicates are allowed! • The reverse condition is not required; two different objects may have the same hash code RHS – SWC

Hash Functions • In general, you must remember to: • Either define the hashCodeand the equals method • Or not define any of them! RHS – SWC

Handling collisions • Even with a good hash function, we will still experience collisions • Collision: two different objects t1 and t2 have the same hash code • We will then try to store both objects in the same position in the array • Now what…? RHS – SWC

Handling collisions • What we store in each position in the array is not the objects themselves, but a linked list of objects • Objects with the same hash code h are stored in the linked list in position h • With a good hash function, the average length of non-empty lists is less than 2 RHS – SWC

Handling collisions Car6 Car4 Car2 Car3 Car1 Car5 0 1 2 3 4 RHS – SWC

Handling collisions • Basic operations (insert, delete, lookup) follow this structure: • Calculate hash code for the object • Find the corresponding position in the array • Insert: Insert element at the end of list • Delete/Lookup: Iterate through list until element is found, or end of list is reached RHS – SWC

Handling collisions • Basic operations are thus not done in truly constant time • However, if a proper hash function is used, running time is constant in practice • Use hash-based implementations unless special circumstances apply • Hard to define hash/equals function • More functionality required RHS – SWC

Sets, Maps and Hash Tables