Chapter 9

Chapter 9

Topics • Maps • Hash tables • Hash functions and hash code • Compression functions and collisions • Ordered maps • List-based Dictionary • Hash table Dictionary • Ordered search tables and binary search • Skip list • Dictionaries

Maps

Maps • A map models a searchable collection of key-value entries • Used when entries need to be located quickly • Each element typical stores additional information besides the key-value • The main operations of a map are for searching, inserting, and deleting items • Keys must be unique so the association of keys to values defines a mapping • Applications: • Student IDs, social security numbers, airline reservation system

Maps - 2 • An entry stores a key-value pair (k,v) • Key can be viewed as an address for the object • Called associate stores or associate containers • Determines location within the data structure

Maps Entry ADT (key, value) • Methods: • key(): return the associated key • value(): return the associated value • setKey(k): set the key to k • setValue(v): set the value to v • Data members • key • value

The Map ADT • Stores a collection of key-value pairs • An iterator is used to traverse the Map, M • Methods • size(): returns the number entries in M • empty(): test whether M is empty • find(k): if the map M has an entry with key k, return and iterator to it; else, return special iterator end • put(k, v): if there is no entry with key k, insert entry (k, v), and otherwise set its value to v. Return an iterator to the new/modified entry

The Map ADT -2 • erase(k): if the map M has an entry with key k, remove it from M • erase(p): if the map M has an entry reference by iterator p, remove it from M • begin(): return an iterator to the first entry of m • end(): return an iterators to a position just beyond the end of M

Map ADT Example Operation Output Map empty() true Ø put(5,A) [(5,A)](5,A) put(7,B) [(7,B)] (5,A),(7,B) put(2,C) [(2,C)] (5,A),(7,B),(2,C) put(8,D) [(8,D)] (5,A),(7,B),(2,C),(8,D) put(2,E) [(2,E)] (5,A),(7,B),(2,E),(8,D) find(7) [(7,B)] (5,A),(7,B),(2,E),(8,D) find(4) end(5,A),(7,B),(2,E),(8,D) find(2) [(2,E)] (5,A),(7,B),(2,E),(8,D) size() 4 (5,A),(7,B),(2,E),(8,D) erase(5) — (7,B),(2,E),(8,D) erase(2) — (7,B),(8,D) find(2) end(7,B),(8,D) empty() false (7,B),(8,D)

Map ADT C++ Interface template <typename K, typename V> class Map { // map interface public: class Entry; // a (key,value) pair class Iterator; // an iterator (and position) int size() const; // number of entries in the map bool empty() const; // is the map empty? Iterator find(const K& k) const // find entry with key k Iterator put(const K& k, const V& v); // insert/replace pair (k,v) void erase(const K& k) // remove entry with key k throw(NonexistentElement); void erase(const Iterator& p); // erase entry at p Iterator begin(); // iterator to first entry Iterator end(); // iterator to end entry }; 9.2

STL Containers • Container • Data structure that stores a collection of objects • STLContainers • <vector> - defines a template class for implementing a container (resizable array) • <stack> - container with the last-in, first-out access • <queue> - container with the first-in, first-out access • <deque> - double ended queue • <list> - doubly linked list • <priority_queue> - queue order by value • <set> - set • <map> - associative array (dictionary) • <iterator> - defines a template for defining manipulating iterators

The STL Map • http://en.cppreference.com/w/cpp/container/map • The class list is a STL container class • The header file <map> • #include <map> • using namespace std; • map <class key, class value)> myMap

The STL Map Methods -1 • size() : • Returns the number elements in the map • empty() : • Returns if the map is empty • find(k) : • Finds the entry with key k and returns an iterator to it ; if key doesn’t exist, return end • operator(k) : • Produce a reference to the value of k, if no such key exists, create a new entry for key k

The STL List Methods - 2 • insert (pair (k,v)) • Insert pair (k,v) returning an iterator to its position • erase(k) • Remove element with key k • erase(p) • Remove element referenced by iterator p • begin() • return an iterator to the first entry of the map • end() • return an iterators to a position just beyond the end of the map

Map STL Example map<string, int> myMap; // a (string,int) map map<string, int>::iterator p; // an iterator to the map myMap.insert(pair<string, int>("Rob", 28));// insert ("Rob",28) myMap["Joe"] = 38; // insert("Joe",38) myMap["Joe"] = 50; // change to ("Joe",50) myMap["Sue"] = 75; // insert("Sue",75) p = myMap.find("Joe"); // *p = ("Joe",50) myMap.erase(p); // remove ("Joe",50) myMap.erase("Sue"); // remove ("Sue",75) p = myMap.find("Joe"); if (p == myMap.end()) cout << "nonexistent\n"; // outputs: "nonexistent" for (p = myMap.begin(); p != myMap.end(); ++p) {// print all entries cout << "(" << p->first << "," << p->second << ")\n"; } 9.3

Maps We can efficiently implement a map using an unsorted list We store the items of the map in a list S (based on a doubly-linked list), in arbitrary order A Simple List-Based Map trailer nodes/positions header c c 5 8 c c 9 6 entries

Maps The find Algorithm Algorithm find(k): for each p in [S.begin(), S.end()) do if pkey() = k then return p return S.end(){there is no entry with key equal to k} We use pkey() as ashortcut for (*p).key()

Maps The put Algorithm Algorithm put(k,v): for each p in [S.begin(), S.end()) do if pkey() = k then psetValue(v) return p p = S.insertBack((k,v)){there is no entry with key k} n = n + 1 {increment number of entries} return p

Maps The erase Algorithm Algorithm erase(k): for each p in [S.begin(), S.end()) do if p.key() = k then S.erase(p) n = n – 1 {decrement number of entries}

Maps Performance: put takes O(n) time since we need to determine whether it is already in the sequence find and erase take O(n) time since in the worst case (the item is not found) we traverse the entire sequence to look for an item with the given key The unsorted list implementation is effective only for maps of small size or for maps in which puts are the most common operations, while searches and removals are rarely performed (e.g., historical record of logins to a workstation) Performance of a List-Based Map

Hash Tables 0  1 025-612-0001 2 981-101-0002 3  4 451-229-0004 Hash Tables

Hash Tables • Keys associated with values in a map can be thought of as addresses • Hash tables are an efficient way to implement this kind of map • Can usually perform operations in O(1)time • Worse case is O(n) • Consists of two major components • Bucket array • Hash function

Bucket Arrays • An array of size N for a hash table • Each cell can be thought of as a bucket • Collection of a key-value pairs • Goal • Well-distributed keys in the range [0,N-1] 0 1 2 3 4 5 (1,D) (3,D) (3,A) (5,F)

Hash Functions (1) • A hash function h maps keys of a given type to integers in a fixed interval [0, N - 1] • Example:h(x) = x mod N is a hash function for integer keys • The integer h(x) is called the hash value of key x • The entry (k,v) is stored in the bucket A(h(x)) where A is the bucket array

Hash Functions (2) • If the keys are unique integers within [0,N-1] • Each bucket holds at most one entry • Searches, insertions, and removal take O(1) time • Two possible problems with this scheme • N can be much larger than needed • Keys must be integers

Hash Tables • A hash table for a given key type consists of • Hash function h • Bucket array (called table) of size N • If two or more keys have the same hash value, a collision occurs

0  1 025-61-0001 2 981-10-0002 3  4 451-22-0004 … 9997  9998 200-75-9998 9999  Hash Example • We design a hash table for a map storing entries as (SSN, name), where SSN (social security number) is a nine-digit positive integer • Our hash table uses an array of sizeN= 10,000 and the hash functionh(x) = last four digits of x • Collisions occur using this scheme

Evaluation of Hash Functions • A hash function is usually a composition of two functions • First function (hash code) maps a key to an integer (…-3,-2,-1,0,1,2,3,…) • h1: keys  integers • Second function (compression function) maps the hash code to an integer within the range [0, N - 1] • h2: integers  [0, N - 1]

Hash Function Evaluation • The hash code is applied first, and the compression function is applied next on the result • h(x) = h2(h1(x)) • The goal of the hash function is to “disperse” the keys in an apparently random way • Avoid collisions

Integer Hash Code • One can reinterpret the memory address of the key object as an integer • Suitable for keys of length less than or equal to the number of bits of the integer type (e.g., char, short, and int in C++) • Cast char and short to an int • Good in general, except for numeric and string keys

Component Sum Hash Code • One can partition the bits of the key into components of fixed length (e.g., 16 or 32 bits) and we sum the components (ignoring overflows) • Suitable for numeric keys of fixed length greater than or equal to the number of bits of the integer type (e.g., long and double in C++) • Not a good choice for character strings or other variable length objects • stop, tops, pots, spot would generate the same hash code causing collisions • Same characters

Float Hash Code • One can sum its mantissa and exponent as long integers and then apply the long integer to the results

Polynomial Hash Codes • To address the problem (stop, tops, pots, spot generating the same hash code) one needs to address to positions of the characters • Partition the bits of the key into a sequence of components (e.g characters) of fixed length (e.g., 8, 16 or 32 bits)a0 ,a1 , … , an-1 • Evaluate the polynomial p(z)= a0+a1 z+a2 z2+ … +an-1zn-1 for a non-zero constant z (ignoring integer overflows) • Especially suitable for strings (e.g., the choice z =33gives at most 6 collisions on a set of 50,000 English words)

Cyclic Shift Hash Code • Variant of the polynomial hash code • Replacing multiplication by a with a cyclic shift of a partial sum by a certain number of bits inthashCode (const char* p, intlen) { unsigned int h=0; For (int i = 0; i < len ; i++ { h = (h << 5 | h >> 27) // 5 bit cyclic shift h +=(unsigned int)p[i]; } return hashCode(int(h))

Compression Functions • Hash codes for keys will typically not be suitable for immediate use within a bucket array • Range will exceed the range of legal indices • Need to map integers from the hash code to the legal indices [0,N-1] • Good hash code is when the probability of two different keys getting hashed to the same bucket is 1/N (number of buckets) • A good compression function will minimize the number of collisions

Division Compression Function • h2 (y) = y mod N • The size N of the hash table is usually chosen to be a prime • For example if N=100 • If we insert keys with hash codes • {205,210,215,…,600} • Each hash code will collide with three other hash codes • For example if N=101 • There will be no collisions • The division compression with a prime number is not always sufficient

MAD Compression Function • Multiply, Add and Divide (MAD) • Helps eliminate collisions compared to the division function • h2 (y) =[(ay + b) mod p] mod N • N is the size of the bucket array • p is a prime number larger than N • a and b are random nonnegative integers from the interval [0,p-1]

Hash Tables Collisions occur when different elements are mapped to the same cell Separate Chaining: let each cell in the table point to a linked list of entries that map there Separate chaining is simple, but requires additional memory outside the table 0  1 025-62-0001 2  3  4 451-29-0004 981-11-0004 Collision Handling

Hash Tables Map with Separate Chaining Delegate operations to a list-based map at each cell: Algorithm find(k): return A[h(k)].find(k) Algorithm put(k,v): t = A[h(k)].put(k,v) if t = null then {k is a new key} n = n + 1 return t Algorithm erase(k): t = A[h(k)].erase(k) if t ≠null then {k was found} n = n - 1 return t

24 33 49 24 12 57 42 18 54 41 Load Factor 0 1 2 3 4 5 6 7 8 9 10 11 12

Open addressing: the colliding item is placed in a different cell of the table Linear probing handles collisions by placing the colliding item in the next (circularly) available table cell Each table cell inspected is referred to as a “probe” Colliding items lump together, causing future collisions to cause a longer sequence of probes Example: h(x) = x mod13 Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order Linear Probing 0 1 2 3 4 5 6 7 8 9 10 11 12 41 18 44 59 32 22 31 73 0 1 2 3 4 5 6 7 8 9 10 11 12

Consider a hash table A that uses linear probing find(k) We start at cell h(k) We probe consecutive locations until one of the following occurs An item with key k is found, or An empty cell is found, or N cells have been unsuccessfully probed Search with Linear Probing Algorithmfind(k) i h(k) p0 repeat c A[i] if c= returnnull else if c.key() =k returnc.value() else i(i+1)mod N p p+1 untilp=N returnnull

Hash Tables To handle insertions and deletions, we introduce a special object, called AVAILABLE, which replaces deleted elements erase(k) We search for an entry with key k If such an entry (k, o) is found, we replace it with the special item AVAILABLE and we return element o Else, we return null Updates with Linear Probing - 1

Hash Tables put(k, o) We throw an exception if the table is full We start at cell h(k) We probe consecutive cells until one of the following occurs A cell i is found that is either empty or stores AVAILABLE, or N cells have been unsuccessfully probed We store (k, o) in cell i Updates with Linear Probing - 2

Double Hashing • Double hashing uses a secondary hash function h’(k) and handles collisions by placing an item in the first available cell of the series (i+jh’(k)) mod Nfor j= 0, 1, … , N where i=h(k) • The secondary hash function h’(k) cannot have zero values • The table size N must be a prime to allow probing of all the cells

Double Hashing Example Function • Common choice of compression function for the secondary hash function: h’(k) =q-k mod q where • q<N • q is a prime • The possible values for h’(k) are1, 2, … , q

Example of Double Hashing • Consider a hash table storing integer keys that handles collision with double hashing • N= 13 • h(k) = k mod13 • h’(k) =7 - k mod7 • Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order 0 1 2 3 4 5 6 7 8 9 10 11 12 (h(k)+ jh’(k)) mod N for j= 0, 1, … , N – 1 31 41 18 32 59 73 22 44 Key:44 (5+5)=10 Key:31 (5+2(4))=13-->0 0 1 2 3 4 5 6 7 8 9 10 11 12

Load Factors • The load factor = n/N (for n entries with a bucket size N) should be between 0.0 and 1.0 • Entries well distributed within the bucket array • The load factor should be maintained where < 0.5 for open addressing schemes and < 0.9 for separate chaining • To avoid “bouncing” around

If the load factor exceeds the recommended values, the bucket array needs to be resized The new array should be at least twice as big as the original Need to define a new hash function Need to move every item from the old bucket array to the new bucket array Should scatter the elements Called rehashing Rehashing into a New Table

In the worst case, searches, insertions and removals on a hash table take O(n) time The worst case occurs when all the keys inserted into the map collide The load factor a=n/N affects the performance of a hash table Assuming that the hash values are like random numbers, it can be shown that the expected number of probes for an insertion with open addressing is1/ (1 -a) Performance of Hashing - 1

Chapter 9

Chapter 9

Presentation Transcript

Chapter 9

CHAPTER 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

CHAPTER 9

Chapter 9

Chapter 9

Chapter 9