260 likes | 362 Views
Hashing & HashMaps. Let’s review the worst-case performance characteristics of previously covered data structures. ArrayList – JCF class get() add() contains() SortedArrayList (uses binary searching) get() add() contains() LinkedList – JCF class get() add() contains() BinaryTree
E N D
Hashing & HashMaps CS-2851Dr. Mark L. Hornick
Let’s review the worst-case performance characteristics of previously covered data structures ArrayList – JCF class get() add() contains() SortedArrayList (uses binary searching) get() add() contains() LinkedList – JCF class get() add() contains() BinaryTree get() add() contains() CS-2851Dr. Mark L. Hornick
Let’s review the worst-case performance characteristics of previously covered data structures ArrayList get() – O(k); constant-time access add() – O(n); due to shifting contains() – O(n); sequential search SortedArrayList get() add() contains() LinkedList get() add() contains() BinaryTree get() add() contains() CS-2851Dr. Mark L. Hornick
Let’s review the worst-case performance characteristics of previously covered data structures ArrayList get() – O(k); constant-time access add() – O(n); due to shifting contains() – O(n); sequential search SortedArrayList get() – O(k); constant-time access add() – O(n+log n)->O(n); due to need to figure out first where to add; then the need to shift elements to the right contains() – O(log n); search based on the Splitting Rule (binary search) LinkedList get() add() contains() BinaryTree get() add() contains() CS-2851Dr. Mark L. Hornick
Let’s review the worst-case performance characteristics of previously covered data structures ArrayList get() – O(k); constant-time access add() – O(n); due to shifting contains() – O(n); sequential search SortedArrayList get() – O(k); constant-time access add() – O(n+log n)->O(n); due to need to figure out first where to add; then the need to shift elements to the right contains() – O(log n); search based on the Splitting Rule (binary search) LinkedList get() – O(n); sequential access add() – O(k); once where to add has been determined really O(n), because that’s how it takes to find the location to insert contains() – O(n); sequential search BinaryTree get() add() contains() CS-2851Dr. Mark L. Hornick
Let’s review the worst-case performance characteristics of previously covered data structures ArrayList get() – O(k); constant-time access add() – O(n); due to shifting contains() – O(n); sequential search SortedArrayList get() – O(k); constant-time access add() – O(n+log n)->O(n); due to need to figure out first where to add; then the need to shift elements to the right contains() – O(log n); search based on the Splitting Rule (binary search) LinkedList get() – O(n); sequential access add() – O(k); once where to add has been determined really O(n), because that’s how it takes to find the location to insert contains() – O(n); sequential search BinaryTree get() – not supported due to lack of indexing (but do we always need it?) add() – O(log n); due to sorting built into the tree structure contains() – O(log n); due to sorting built into the tree structure What about memory usage?? CS-2851Dr. Mark L. Hornick
Is there anything faster at everything? CS-2851Dr. Mark L. Hornick
Map definition • A map is a collection in which each Entry element has two parts • a uniquekey part • a value part (which may not be unique) • Each unique key “maps” to a corresponding value • Example: Morse code map – each character maps to a (unique) sequence of dots and dashes • Example: a map of Students, in which each key is the (unique) student ID, and each (non-unique?) value is a reference to the Student object itself • Example: a phonebook, where each number (each key) maps to a person Entry key value CS-2851Dr. Mark L. Hornick
What is a Key? • A key is just something that uniquely identifies a particular instance of an value/object • A key can be a number, a string, or an object, so long as it is unique • If two values/objects have the same key, then they are (theoretically) equal • Only one ID per MSOE student, so if the ID’s match, it must (by definition) be the same student • If the equals() method comparing two keys returns true, then the objects are equal too, by definition CS-2851Dr. Mark L. Hornick
What if an object doesn’t possess a specific unique attribute? • Scenario: pretend MSOE ID’s didn’t exist • Can any of the attributes of a student, taken together, be unique? • …even though any individual attribute may not exhibit this uniqueness? • Exercise CS-2851Dr. Mark L. Hornick
A key can be generated from a unique combination of non-unique attributes All of an object’s attributes can be used to generate the key • That is, the object itself is the key Or the key can be generated from just a subset of an object’s attributes • Provided that subset is unique CS-2851Dr. Mark L. Hornick
OK, so what role do keys play in making a faster data structure? What if each unique key corresponded to a unique index within an array of Entries? Maps to key index Entry key value CS-2851Dr. Mark L. Hornick
Hash definition • A hash is a transformation of a key into a numeric value that maps to the index of an array (or table) • This is done in two steps: • generate a numeric hashcode from the key (which is not necessarily numeric) • If the key is already numeric and unique (like an ID), then the key can be used as the hashcode • transform the hashcode into an array index Key hashcode index CS-2851Dr. Mark L. Hornick
HashMap definition • A HashMap<E> is an array-based collection of Entry<E> elements • a value part (which could be anything) • a uniquekey part (somehow derived from value) • Each Entry is at a specific index in the array, where the index is determined from the hashcode of the key • Example: a map of Students, in which each key is the (unique) student ID, and each (non-unique?) value is a reference to the Student object itself Entry<E> key E value CS-2851Dr. Mark L. Hornick
How do you generate a hashcode? In Java, all classes have a built-in hashCode() method defined in the Object class Key hashcode CS-2851Dr. Mark L. Hornick
Classes that don’t override hashCode() inherit the Object class’s hashCode() method Which returns the memory address of the object • Is this a repeatable hashcode??? No! Mem addr Object hashcode CS-2851Dr. Mark L. Hornick
A given key should always generate the same hashcode • So that the hashcode computation can be repeated at any time, and always result in the same value • …and therefore, the same index Q: If keys are unique, does this guarantee the hashcode generated from the keys are also unique?? Key hashcode index CS-2851Dr. Mark L. Hornick
Exercise • Generate a hashcode from a String of characters • What approach should you use?? CS-2851Dr. Mark L. Hornick
How do you generate a hashcode? In Java, many classes override Objects hashcode() method in order to generate unique hashcodes Integer class • Integer’s hashCode( ) method simply returns the underlying int value String class • Look at the javadoc for String.hashCode Key hashcode CS-2851Dr. Mark L. Hornick
Writing your own hashCode() • A key should uniquely identify an object • Hashcodes generated from keys should be as unique as possible • to avoid collisions • Depending on the hashcode algorithm, different keys can generate the same hashcode Key hashcode index CS-2851Dr. Mark L. Hornick
How do you transform a hashcode into an array index? Assume you have an array with length=1024 An array index in the range 0…1023 can be computed as follows using modulo arithmetic: int index = hashCode(123456789)% 1024; The resulting index=933 CS-2851Dr. Mark L. Hornick
More hashing examples(for a table 1024 in length) • 123456789 indexes to 933 • 428671256 indexes to 500 • 884739816 indexes to 234 CS-2851Dr. Mark L. Hornick
table size null 3 0 … xxx Anne xxx … yyy Susan yyy … zzz Ed zzz … null 1023 Exercise What are the index values xxx, yyy, and zzz? CS-2851Dr. Mark L. Hornick
Hashing can result in Collisions 123456789 indexes to 933 428671256 indexes to 500 884739816 indexes to 234 403578063 also indexes to 933 • When two different keys yield the same index (even from different hashcodes), that is called a collision • Keys that yield the same index are called synonyms • Special handling is required CS-2851Dr. Mark L. Hornick
Hashing is inefficient when there are a lot of collisions • Ideally, we want the hashing algorithm to generate indices “sprinkled” randomly throughout the underlying table • The Uniform Hashing Assumption assumes • Each key is equally likely to hash to any one of the table addresses, independently of where the other keys have hashed CS-2851Dr. Mark L. Hornick
Even if this assumption is true, collisions still occur • This is due to the finite set of indices in a table • The bigger the table, the less likely a collision is to occur • But tables cannot be made infinitely large • An infinite number of keys cannot be mapped into a finite set of indices • So collision handlers have to be implemented CS-2851Dr. Mark L. Hornick