390 likes | 545 Views
CS305/503, Spring 2009 Hash Tables. Michael Barnathan. Here’s what we’ll be learning:. Review of Assignment 3. Theory: Keys and values. What constitutes a good hash function? Data Structures: Hash Tables. Collision Resolution: Chaining Open Addressing / Linear Probing Perfect Hashing
E N D
CS305/503, Spring 2009Hash Tables Michael Barnathan
Here’s what we’ll be learning: • Review of Assignment 3. • Theory: • Keys and values. • What constitutes a good hash function? • Data Structures: • Hash Tables. • Collision Resolution: • Chaining • Open Addressing / Linear Probing • Perfect Hashing • Cuckoo Hashing
Assignment 3 Grades • Probably the most difficult assignment of the course. • I always implement the assignments myself prior to handing them out to ensure that I’m not assigning something overwhelming. • This time, one person somehow accessed my solution and attempted to submit it back to me. • But I can recognize my own code. • That person received a 0 on the assignment (not shown on graph).
Assignment 3 • This assignment tested several real-world development skills. • It required learning a new class: Map. • But you had everything you needed to do so already, and have since you learned how to use vectors. • Today we’re just going over the theory behind hashing. This is how maps work, not how to use them. • From your perspective, Maps are just arrays that can use things other than numbers as indices. • .get() and .put() are otherwise the same. • Since the keys aren’t contiguous anymore, you need a means of getting a complete list of them: .keySet() • Learning how to use new libraries and classes is a vital development skill. (“You are being prepared to solve problems that do not exist yet”) • Or how will you come in and work on code that others have been developing for years? • Could you work on, say, the next version of Windows if you can’t learn how to use new libraries? Do you think they’re only using the standard classes? • It required basic encapsulation to complete easily, but not a whole architecture. • If you tried to stuff each entry into a string, you needed to do extra work parsing the string so you could display it in the proper format. • But if you created a class for a Word, you could keep the part of speech and definition separate, then print them out as you wished. • Aside from that, it required making tough design decisions. • To sort or search, and how to do each? • TreeMap or HashMap? • Fast range queries or fast individual word lookup? • Finally, it required following a detailed spec, such as your clients will give you (but with more hints). • Ignoring parts of the spec., such as “next” or “-hash”, cost most of the lost points. • Incorrect implementations of these didn’t cost nearly as much as lack of implementation altogether.
Assignment 3 – The Bright Side • If this wasn’t a challenge, you’re ready to go out there and write production code. • If it was, then you just may need more experience. • It does get easier over time. • If you haven’t been developing for a long time, you’re not behind the curve. • It took me a year to learn how to use functions! • (But I didn’t have any formal instruction then.) • The important thing is that you stay at it. • This is only 6% of your final grade. Don’t stress out too much over it.
Experience • Ability arises mostly from years of borderline-obsessive work. You can’t acquire it overnight. • Who here has been coding for at least: • 1 year? • 2 years? • 5 years? • 10 years? • 15 years? • Who has coded anything on the side?
Assignment 3 Code • Let’s review the solution. • Questions?
Review: Arrays and Random Access • Let’s review arrays for a moment: • A size n array is indexed by a contiguous set of integers from 0 to n-1. • Because the array is contiguous in memory, accessing any element of it can be performed in constant-time. This is random access. • If the index actually represents something about the dataset, we can use this to access desired elements in constant-time. • For example, asking “who is the 4th person up to bat?” in a baseball roster. • Answer: roster[3] (remember, they start at 0). • This is an O(1) operation. I’m fourth! (Worst team ever.)
Keys and Values • An index is an example of a numeric key into the array. • A key is an attribute or combination of attributes by which each record is identified. • Arr[3] identifies as the fourth element in the array. In this case, the key is simply an element’s position in the array. • But we can also identify arrays by attributes such as employee names and salaries. • These don’t map too well to array indices. • The value of an element is the data accessed by the key. • For example, if Arr[3] was an Employee, “3” is the key and the resulting Employee object is the value. • A container that maps directly between keys and values is called a Map (surprise!) or an associative array.
Arrays’ Shortcomings • Arrays work well if keys are contiguous integers. • Years in a calendar, for example. • However, what if we have a non-numeric key? • In every data structure we’ve discussed so far, we have no choice but to search for it, which is an Ω(log n) operation. John? John? Bob Who? No one ever listens to me… Alice Eve Don’t look at me! I’m over here! Charlie John Trudy Mallory
Mapping Data • Idea: What if we could map the word “John” to an array index somehow? • “John” -> 5. Arr[5] = … • Then finding “John” becomes equivalent to mapping “John” to 5 and accessing Arr[5]. • Arrays are random-access, so this is O(1). • Obvious question: How do we turn “John” into 5? Why 5 and not 6? • Less obvious question: What if “Bob” also maps to 5? What happens then?
Maps and Mathematical Functions • Go waaay back and think about the first time you heard the word “function”. • It was something that took input and transformed it into output. f(x) = 2x 4 3 2 8 6 4
Maps and Mathematical Functions • So if we can do that, why not this? h(x) John Bob Alice 2 1 0 Black box
Hashing: The Idea • We call the process of transforming input with a function and using the result as an index hashing. • This allows us to use strings or other objects as keys. Salaries[“Alice”] = 50000 Salaries[“Bob”] = 25000 double[] Salaries Salaries[“John”] = 75000 h(x) John Bob Alice 2 1 0
h(x) The Hash Function • We call h(x) a hash function. • Any function that maps the input type to something suitable for indexing may be used. • In Java, this means we are mapping from Object to int. • In fact, every Java class has a built in function called:int hashCode() • This function is defined in the Object class, which means every object has a default one. • It also means you can override it in your own objects.
Good Hash Functions • A hash function must be deterministic: it must always return the same value for the same input. • Good hash functions distribute their output as uniformly as possible to minimize the number of “collisions”: two different input values that hash to the same output. • If every distinct input value is mapped to a distinct output value, the function is called injective, or one-to-one. This is the ideal. • If the space of possible inputs is greater in size than possible outputs, it is also impossible (due to the pigeonhole principle: if you put n+1 objects in n holes, at least one hole must have more than one object in it). • Because the hash function is computed on every access of the hash table, good hash functions execute very quickly.
The Birthday Paradox • If the range of possible inputs is larger than the range of possible outputs, it is impossible to obtain an ideal hash function due to the pigeonhole principle. • However, even if this is not the case, it is still unlikely that a uniform hash function will avoid collisions. • This is due to the birthday paradox: • This just refers to the counterintuitive notion that it is highly likely that two people in a relatively small group share the same birthday. • Assuming a uniform distribution: • In a group of 23 people, the probability that 2 share a birthday is 50%. • In a group of 50 people, the probability is 97%. • The probability does not reach 100% until 365 people are in the room. • “Having the same birthday” -> “Hashing to the same value”.
The Birthday Paradox (Wikipedia)
Popular Hash Functions • MD5 • MD4 • SHA1 • SHA2 • SHA3 • CRC32 • 3DES • Tiger • (Aside: Many hash functions are used for cryptography as well. Should you use them for cryptography, make sure you pad the data with an extra string, called a salt, to avoid “rainbow table” attacks).
Hash Tables • The hash table is the array that the hash function provides an index into. • Like other arrays, it begins with a fixed capacity and strategies must be employed to maintain it as the hash table grows. • Because performance degrades as the hash table begins to fill, the size of a hash table is usually increased when capacity passes a certain load factor. • For example, a table with a load factor of 0.75 would increase in size when it is 75% full. • 0.75 is the default in Java’s HashTable, HashSet, and HashMap classes. • Collisions, mappings of distinct objects to the same position in the array, must also be handled. • They become more of a problem as the hash table fills.
Collision Resolution 1 • What if element B hashes to a location already filled by element A? • We have a collision. • There are two strategies for handling this scenario: • Linear Probing. • Chaining. • Or, to put it in intuitive terms: • This spot’s taken. Store the new element somewhere else, or: • Cram both elements into the same spot. h(x) Alice Bob
Linear Probing • Let element B hash to the location h(B). • Suppose h(B) is already filled by element A. • A linear probing strategy simply stores B in the next available space. • If h(B) + 1 is available, this is where it is stored. • If not, we move to h(B) + 2 and check whether it is available. • And so on. • If we hit the end of the table, we wrap around to the beginning (modular arithmetic). • It is also possible to use an arbitrary offset k. • Then we check h(b) + k, h(b) + 2k, etc. • Again, everything is (mod n), the size of the table, so we wrap. • The same strategy is used for access: • If the hashed element is not the same as the one we’re looking up, move down the hash table and check the next element. Repeat until the elements match or an empty space is reached.
Linear Probing Example Insert “Mallory” h(x) Suppose Mallory hashes to John’s spot.
Linear Probing Example Insert “Mallory” h(x) We check the next spot. It’s filled.
Linear Probing Example Insert “Mallory” h(x) We check the next spot. It’s filled.
Linear Probing Example Insert “Mallory” h(x) When we find an empty spot, it is filled.
Advantages and Disadvantages • Advantages: • Very space-efficient; values are stored in the hash table itself. • Simple; no extra structures needed. • Works fairly well when load factor is low. • However, a low load factor wastes space. • Because colliding elements remain adjacent in memory, caching behavior is exceptional. • Disadvantages: • Performance swiftly degrades when load factor exceeds 0.8. • Collisions may cluster, and this requires traversing the hash table one element at a time to find the next available space. This may slow insertion.
Chaining • Let element B hash to the location h(B). • Suppose h(B) is already filled by element A. • A chaining strategy stores a linked list at each node and appends the new node to the list. • When we wish to access the element again, we perform a linear search on the list.
Chaining Example Insert “Mallory” Mallory h(x) Suppose Mallory hashes to John’s spot. We then append Mallory to a linked list in that same spot.
Advantages and Disadvantages • Advantages: • Intuitive; the location we hash at is always the one returned by the hash function. • New elements can be added to the list in constant-time; linear probing requires a linear scan. • Performance degrades linearly even as the table fills. • More elements may be stored in the table than there are available slots using this method. • You can quickly discover the number of keys that collide with another. • Disadvantages: • Storing the data in adjacent memory locations, as in linear probing, has very good caching behavior. Linked lists in general do not.
Performance (Wikipedia)
Perfect Hashing • If all n keys are known prior to hashing, it is possible to construct a function that maps these keys to a hash table of size n without collisions. • This function is known as a perfect hash function. • There is a generalized procedure for discovering perfect hash functions described at http://cmph.sourceforge.net/papers/chm92.pdf. • But since this is a difficult paper to understand, just be aware that it is possible.
Cuckoo Hashing • This is a strategy that uses two hashing functions to insert. • If a collision occurs using the first hash function, the existing element is pushed out of its space (replaced by the new element) and hashed using the second function. • This can potentially push another element out. If a loop occurs, the hash table is rebuilt using a different set of hash functions. • However, a collision on both hash functions is unlikely until the table begins to fill. • This begins earlier than in the other two strategies: • Using two hash functions, an appropriate load factor is .5. • However, using three, the appropriate load factor jumps to .91. • This strategy was generally found superior to both chaining and probing. However, it is still not widely known. • Fortunately for you, I have some very esoteric areas of interest.
Unsorted Associative Containers • Java has excellent built-in support for hashing. • In particular, the unsorted associative containers utilize hash tables: • HashMap, which you have used: • Similar functions to TreeMap. • Usually faster for random-access queries. • As you saw in Assignment 3, performing range queries or sequential access is a pain (you had to sort). • HashSet. • HashTable (which is very much like HashMap). • Why are they unsorted? • The point of a hash function is to turn keys into integers. In general, sorted order cannot be maintained through this conversion.
Keeping a Hash Table Sorted • It’s possible to make hash tables range-efficient with some extra structure. • Specifically, what if we stored a linked list within the table that pointed to the next element in sorted order? • This incurs no extra asymptotic cost on insertion, access, or deletion. • Java has a class that implements this idea, LinkedHashMap.
Hashing in Other Languages • Java: HashMap • C++: hash_map • C#: Hashtable • Perl: $var{‘key’} = “value” • PHP: $var[‘key’] = “value” • Ruby: v = { ‘key’ => ‘value’ }
Performance • What is the complexity of insertion in a hash table if there are no collisions? • What if there are collisions? • If you choose your table size appropriately, collisions are rather rare. The average size of your chains usually ends up around 2 or 3. • Do hash tables need to use any extra space?
CRUD: Hash Tables • Insertion (average): O(1). • Access (average): O(1). • Deletion (average): O(1). • Insertion (worst): O(n). • Access (worst): O(n). • Deletion (worst): O(n). • Since collisions are not very common with a good hash function and an appropriate load factor, hash tables very often yield constant-time insertion, access, and deletion. • The amount of space used depends on the load factor, but remains O(n). • They are incredibly useful structures! • They allow you to index data by a generalized key rather than a numeric ID, and are therefore used extensively in databases and distributed queries. A hash-based algorithm called MapReduce powers Google.
Access on Demand • This was our discussion of hashing. • Next time, we will discuss amortized analysis and Java’s “Set” classes. • The lesson: • An unlikely event actually has a very high probability given enough repetitions (birthday paradox).