330 likes | 531 Views
Hashing. Alan, Tam Siu Lung 96397999 Tam@SiuLung.com 99967891. Prerequisites. List ADT Linked List Table ADT Array Mathematics Modular Arithmetic Computer Organization ASCII Algorithm Order Analysis. Basic Data Types. Abstract Data Types (ADT). Stack<v>
E N D
Hashing Alan, Tam Siu Lung 96397999 Tam@SiuLung.com 99967891
Prerequisites • List ADT • Linked List • Table ADT • Array • Mathematics • Modular Arithmetic • Computer Organization • ASCII • Algorithm • Order Analysis
Abstract Data Types (ADT) • Stack<v> • Can add and remove in LIFO order • Queue<v> • Can add and remove in FIFO order • Priority Queue<v> • Can add. Can remove in larger first order. v is comparable.
Data Structure • An ADT, implemented by a Data Type • E.g. • ArrayList, using an array to implement a List ADT • ArrayHeap, using an array to implement a Heap (may in turn implements a PQ)
Dictionary<k, v> ADT • Add(k, v) • Add a key-value pair • Remove(k) • Remove a key-value pair given the key • Search(k) : v • Search for the value given the key A Table ADT only differs in that key is an integer in range.
Direct Addressing • Use the Table ADT • The key is the location • Efficient: O(1) for all operations • Infeasible: if the key can range from 1 to 20000000000, if the key is not numeric ...
Time Complexity Note: For sorted array and BST, keys have to be ordered.
Hash Function • Hash Function: hm(k) • Map all keys into an integer domain, e.g. 0 to m - 1 • E.g. CRC32 hashes strings into 32-bit integer (i.e. m = 232) • Alan: 1598313570 • Max: 3452409927 • Man: 943766770 • On: 2246271074 Note: We won’t use such a big m in our programs!
Hash Table • Use a Table<int, v> ADT of size m • Use h(k) as the key • All operations can be done like using Table • Solved except • Collision: What to do if two different k have same h(k) • How to find a suitable hash function
Hash Functions • If k is an integer, use h(k) = k mod m • More advanced: floor(m*frac(k*A)) for some 0 < A < 1 • If k is a string, convert it to an integer, e.g. • h(‘Alan’) = [ASC(‘A’)*2563+ ASC(‘l’)*2562+ ASC(‘a’)*256+ASC(‘n’)] mod m • If k is other data type, try to combine all features of the type
Chaining(a.k.a. Open Hashing) • Use Table<int, List<v> > instead • When there are multiple k’s with same h(k), add it to the list (usually linked list) • When searching, remove it from the list • Order: O(length of all lists)
Chaining Samples • h(‘Alan’) = h(‘Man’) = h(‘On’) = 0, h(‘Max’) = 5 • Operations: • Add <Alan, D> • Add <Max, Z> • Add <Man, X> • Add <On, Y> • Search for Max • Remove Man
Chaining Samples • h(‘Alan’) = h(‘Man’) = h(‘On’) = 0, h(‘Max’) = 5 • Operations: • Add <Alan, D> • Add <Max, Z> • Add <Man, X> • Add <On, Y> • Search for Max • Remove Man
Chaining Samples • h(‘Alan’) = h(‘Man’) = h(‘On’) = 0, h(‘Max’) = 5 • Operations: • Add <Alan, D> • Add <Max, Z> • Add <Man, X> • Add <On, Y> • Search for Max • Remove Man
Chaining Samples • h(‘Alan’) = h(‘Man’) = h(‘On’) = 0, h(‘Max’) = 5 • Operations: • Add <Alan, D> • Add <Max, Z> • Add <Man, X> • Add <On, Y> • Search for Max • Remove Man
Chaining Samples • h(‘Alan’) = h(‘Man’) = h(‘On’) = 0, h(‘Max’) = 5 • Operations: • Add <Alan, D> • Add <Max, Z> • Add <Man, X> • Add <On, Y> • Search for Max • Remove Man
Chaining Samples • h(‘Alan’) = h(‘Man’) = h(‘On’) = 0, h(‘Max’) = 5 • Operations: • Add <Alan, D> • Add <Max, Z> • Add <Man, X> • Add <On, Y> • Search for Max • Remove Man
Chaining Samples • h(‘Alan’) = h(‘Man’) = h(‘On’) = 0, h(‘Max’) = 5 • Operations: • Add <Alan, D> • Add <Max, Z> • Add <Man, X> • Add <On, Y> • Search for Max • Remove Man
Chaining Samples • h(‘Alan’) = h(‘Man’) = h(‘On’) = 0, h(‘Max’) = 5 • Operations: • Add <Alan, D> • Add <Max, Z> • Add <Man, X> • Add <On, Y> • Search for Max • Remove Man
Chaining (Optional) • Note that the Table can be Table<int, Container<v> > for any Container supporting Add, Remove and Search. • Why not consider other things, say another hash table? A BST?
Open Addressing(a.k.a. Closed Hashing) • During collission, find another slot for the entry • E.g. if h(k) is not empty, try h(k)+1, h(k)+2, etc • Define the probe sequence <h(k, 0), h(k, 1), ..., h(k, m – 1)> be the sequence to slots to try (it should be a permutation of <0, 1, ..., m – 1> • Then both add and search will try the same sequence, so finally must find the pair <k, v> before an empty slot is reached • How about delete? Search and mark it empty? • Order: O(length of probe sequence)
Open Addressing Samples Add Man Add Max
Open Addressing Samples Add Man Search for Max
Open Addressing Samples Delete Man Search for Max
Collision Resolution • The method outlined above is called linear probing • In general, h(k, i) = h(k) + c i • Forms Primary Clustering • There is also quadratic probing • In general, h(k, i) = h(k) + c1 i2 + c2 i • Still forms Secondary Clustering
Double Hashing (Optional) • h(k, i) = ( h(k) + i h’(k) ) mod m • Note: h’(k) cannot be 0 • Meaningful h’(k) should be in [1, m) • E.g. m – k mod (m – 1)
How good is Hashing? • Nearly constant time if very short list or very low probing rate • So we need • A uniform hash function (your job) • A larger hash table (trade it off with memory limit)
Size too small? (Optional) • Create a new hash table and re-hash all entries (not useful for OI use) • If use open addressing, need to re-hash to remove the deleted items anyway
Extensible Hashing (Optional) • Use Table<int, Ptr> (Ptr is like the list in chaining) • The size m = 2k • Given any uniform hash function h(k), g(k) = last k bits of h(k) • Ptr points to an array of size r, each storing an entry • The problem: what to do when the array is full
Extensible Hashing (Optional) h(‘Alan’) = 0, h(‘Man’) = 4, h(‘On’) = 12, h(‘Ben’) = 5, h(‘Max’)=5
Extensible Hashing (Optional) Add Si where h(‘Si’) = 9, i.e. g(‘Si’) = 01
Extensible Hashing (Optional) Add Unu where h(‘Unu’) = 4, i.e. g(‘Unu’) = 100 The first array will be split according to their h(k) Still need to chain?