15-211 Fundamental Structures of Computer Science

Dictionaries, Tries and Hash Maps 15-211Fundamental Structuresof Computer Science Jan. 30, 2003 Ananda Guna Based on lectures given by Peter Lee, Avrim Blum, Danny Sleator, William Scherlis, Ananda Guna & Klaus Sutner

Dictionaries • A data structure that supports lookups and updates • Computational complexity: • The time and space needed to authenticate the dictionary, i.e. creating and updating it. • The time needed to perform an authenticated membership query. • The time needed to verify the answer to an authenticated membership query. • Implementing a Dictionary • A sorted array? • A sorted linked list? • A BST? • Array of Linked Lists? • Trie?

root 2 4 3 0 4 1 2 2 3 2 20 22 24 31 32 42 43 What is a Trie? • Trie of size (h, b) is a tree of height h and branching factor at most b • All keys can be regarded as integers in range [0, bh] • Each key K can be represented as h-digit number in base b: K1K2K3…Kh • Keys are stored in the leaf level; path from the root resembles decomposition of the keys to digits

Motivation for tries: An Application of Trees

Sample problem: How to do instant messaging on a telephone keypad?

Typing “I love you”: 444 * 555 666 888 33 * 999 666 88 # What is the worst-case number of keystrokes? What might be the average number of keystrokes?

Better approach: 4* 5 6 8 3* 9 6 8# Can be done by using tries.

Tries • A trie is a data structure that stores the information about the contents of each node in the path from the root to the node, rather than the node itself • A tree structure that encodes the possible sequences of symbols in a dictionary. • Invented by Fredkin in 1960. • For example, suppose we have the alphabet {a,b}, and want to store the sequences • aaa, aab, baa, bbab

Tries aaa, aab, baa, bbab No information stored in nodes per se. Shape of trie determines the items.

Tries Nodes can be quite large: N pointers to children, where N is the size of the alphabet. Operations are very fast: search, insert, delete are all O(k) where k is the length of the sequence in question.

Keypad IM Trie 4 5 9 I 4 6 6 5 8 8 3 3 you like love

Implementing a Trie • How do we implement a trie? • How about a tree of arrays? • You can also use HashMaps to implement a trie (later) A C D A

Hashing

Why Hashing? • Suppose we need to find a better way to maintain a table that is easy to insert and search. • If we use a sorted list, you can do binary search and insertion in log2n time. • So is there an alternative way to handle operations such as insert, search, delete? • Yes, Hashing

Big Idea • Suppose we have M items we need to put into a table of size N. • Can we find a Map H such that • H[ith item] [0..N-1]? • Assume that N = 5 and the values we need to insert are: cab, bea, bad etc. • Assume that we assign values to letters: a=0, b=1, c=2, etc

Big Idea Ctd.. • Define H such that • H[data] = ( characters) Mod N • H[cab] = (0+1+2) Mod 5 = 3 • H[bea] = (1+4+0) Mod 5 = 0 • H[bad] = (1+0+3) Mod 5 = 4 bea cab bad 0 1 2 3 4

Problems • Collisions • What if the values we need to insert are “abc”, “cba”, “bca” etc… • They all map to the same location based on our map H (obviously H is not a good hash map) • This is called “Collision” • One way to deal with collisions is “separate chaining” • The idea is to maintain an array of linked lists • More on collisions later

Running Time • We need to make sure that H is easy to compute (constant time) • Lookups and deletes from the hash table depends on H • Assume M = theta(N) • So what is a “bad” H? • Suppose we hash strings by simply adding up the letters and taking it mod the table size. Is that good or bad? • Homework: Think of hashing 1000 5-letter words into a table of size 1000 using the map H. What would be the key distribution like?

What is a good H? • If H behaves likes a random function, there are N-1 other keys with equal probability(1/M) that can collide with the given Key. • Therefore E(collision of a Key) = (N-1)/M • If M = Theta(N) then this value is 1. This is great. But life is not fair. • So what is a good Hash function? • Lets consider a hashing a set of strings Si. Say each Si is of some length i. Consider • H(Si) = (Si[j].dj ) Mod M, where d is some large number and M is the table size. • Is this function hard to calculate?

Collisions • Hash functions can be many-to-1 • They can map different search keys to the same hash key. hash(`a`) == 9 == hash(`w`)

Collisions • Hash functions can be many-to-1 • They can map different search keys to the same hash key. hash1(`a`) == 9 == hash1(`w`) • Must compare the search key with the record found

Collisions • Hash functions can be many-to-1 • They can map different search keys to the same hash key. hash1(`a`) == 9 == hash1(`w`) • Must compare the search key with the record found • If the match fails, there is a collision

Collision strategies • Separate chaining • Open addressing • Linear • Quadratic • Double • Etc. • The perfect hash

Linear Probing • The idea: • Table remains a simple array • On insert, if the cell is full, find another by sequentially searching for the next available slot • On find, if the cell doesn’t match, look elsewhere. • Eg: Consider H(key) = key Mod 6 (assume N=6) • H(11)=5, H(10)=4, H(17)=5, H(16)=4,H(23)=5 • Draw the Hash table

Linear Probe ctd.. • How about deleting items? • Item in a hash table connects to others in the table(eg: BST). • “Lazy Delete” – Just mark the items active or delete rather than removing it.

More on Delete • Naïve removal can leave gaps! Insert f Remove e Find f 0 a 2 b 3 c 3 e 5 d 8 j 8 u 10 g 8 s 0 a 2 b 3 c 3 e 5 d 3 f 8 j 8 u 10 g 8 s 0 a 2 b 3 c 5 d 3 f 8 j 8 u 10 g 8 s 0 a 2 b 3 c 5 d 3 f 8 j 8 u 10 g 8 s “3 f” means search key f and hash key 3

More about delete ctd.. • Clever removal shrinks the table Insert f Remove e Find f 0 a 2 b 3 c 3 e 5 d 8 j 8 u 10 g 8 s 0 a 2 b 3 c 3 e 5 d 3 f 8 j 8 u 10 g 8 s 0 a 2 b 3 c gone 5 d 3 f 8 j 8 u 10 g 8 s 0 a 2 b 3 c gone 5 d 3 f 8 j 8 u 10 g 8 s “3 f” means search key f and hash key 3

Performance of linear probing • Average numbers of probes • Unsuccessful search and insert ½ (1 + 1/(1-)2) • Succesful search ½ (1 + 1/(1-)) • When  is low: • Probe counts are close to 1 • When  is high: • E.g., when  = 0.75, unsuccessful probes:½ (1 + 1/(1-)2) = ½ (1 + 16) = 8.5 probes • E.g., when  = 0.5, unsuccessful probes:½ (1 + 1/(1-)2) = ½ (1 + 4) = 2.5 probes

Quadratic probing • Resolve collisions by examining certain cells away from the original probe point • Collision policy: • Define h0(k), h1(k), h2(k), h3(k), … where hi(k) = (hash(k) + i2) mod size • Caveat: • May not find a vacant cell! • Table must be less than half full ( < ½) • (Linear probing always finds a cell.)

Quadratic probing • Another issue • Suppose the table size is 16. • Probe offsets that will be tried: 1 4 9 16 25 36 49 64 81

Quadratic probing • Another issue • Suppose the table size is 16. • Probe offsets that will be tried: 1 mod 16 = 4 mod 16 = 9 mod 16 = 16 mod 16 = 25 mod 16 = 36 mod 16 = 49 mod 16 = 64 mod 16 = 81 mod 16 =

Quadratic probing • Another issue • Suppose the table size is 16. • Probe offsets that will be tried: 1 mod 16 = 1 4 mod 16 = 4 9 mod 16 = 9 16 mod 16 = 0 25 mod 16 = 9 36 mod 16 = 4 49 mod 16 = 1 64 mod 16 = 0 81 mod 16 = 1

Quadratic probing • Another issue • Suppose the table size is 16. • Probe offsets that will be tried: 1 mod 16 = 1 4 mod 16 = 4 9 mod 16 = 9 16 mod 16 = 0 25 mod 16 = 9 only four different values! 36 mod 16 = 4 49 mod 16 = 1 64 mod 16 = 0 81 mod 16 = 1

Quadratic probing • Table size must be prime • Load factor must be less than ½

Rehash • Scaling up • What makes  grow too large? • Too much data • Too many removals • Rehash! • Do when insert fails or load factor grows • Build a new table • Scan existing table and do inserts into new table

Rehash • Scaling up • What makes  grow too large? • Too much data • Too many removals • Rehash! • Do when insert fails or load factor grows • Build a new table • Scan existing table and do inserts into new table • Twice the size or more • Adds only constant average cost

Double Hashing • Collision policy • Define h0(k), h1(k), h2(k), h3(k), … where hi(k) = (hash(k) + i*hash2(k))mod size • Caveats • hash2(k) must never be zero • Table size must be prime • If multiples of hash2 results are equal to table size, fewer alternative cells will be tried. • Quadratic probing may be faster/easier in practice.

About HashMap Class • This implements the Map interface • HashMap permits null values and null keys. • Constant time performance for get and put operations • HashMap has two parameters that affect its performance: initial capacity and load factor • Capacity – number of buckets in the Hash Table • Load factor – How full the Hash Table is allowed to get before capacity is automatically increased using rehash function.

More on Java HashMaps • HashMap(int initialCapacity) • put(Object key, Object value) • get(Object key) – returns Object • keySet() Returns a set view of the keys contained in this map. • values() Returns a collection view of the values contained in this map.

Next Week • HW2 is due Monday – you need to start early • We will discuss Priority Queues and recap of what we have done so far • Read More about Hash Tables in Chapter 20 • See you Tuesday

15-211 Fundamental Structures of Computer Science