CSCI 2342 Data Structures and Algorithms II Dr. Tami Meredith

CSCI 2342Data Structures and Algorithms IIDr. Tami Meredith Lecture 4: Dictionaries (Chapter 18)

Abstraction in Program Design • Problems can be viewed at many levels of abstraction • In particular, the data structures we have seen are not necessarily unrelated • Some "data structures" we have seen are actually just concepts for data management • E.g., a priority queue is data management concept implemented using a heap • The textbook uses the term ADT for both data storage structures and data management abstractions

Abstraction Layers • Examples: • Priority queue with an array-based heap in C++ • Dictionary using a linked BST with C pointers • Table using an array-based BST Index in a file • Set using a linked-list with Java objects and references

Key Concepts • High level data management techniques such as Queues, Tables, or Dictionaries are not data structures but are abstraction concepts • Data structures are a program design concept for storing and retrieving data • Implementation is a programming team concern • The squeeze ...

Data Organisation • Data is almost always organised into records; e.g., all the data for a person, order, meeting, ... • Records are broken into fields; e.g., parts • One (or more) field(s) are used as akey(see Chapter 11) • Keys are used to sort and retrieve records • Often must search a collection of data for specific information • Example: Assignment 1 • Record: Athlete • Fields: name, income, salary, endorsements, sport • Key: name (this is how we sorted and looked it up)

Dictionaries • A data management concept where data manipulation is performed based on the key • Sometimes called a map (i.e., maps key to the record) • May also be called a table, but some would argue this fact • Dictionary behavior changes significantly if keys are not unique (i.e., duplicates can exist) • Many systems require unique keys • Non-key access/search is highly suboptimal and inefficient

FIGURE 18-3 A dictionary entry Dictionary Entries Note: The data item can be broken up into fields – it is not necessarily a single field and is only shown as one since the ADT never manipulates the data in any way and treats it as a single entity

FIGURE 18-1 A collection of data about certain cities Example Dictionary (Key = City) Consider the need for searches through this data based on other than the name of the city

Operations: Dictionary ADT • Insert new item into dictionary • Remove item with given search key from dictionary • Find item with a given search key in dictionary • Traverse items in dictionary in sorted search-key order • Test whether dictionary is empty • Get number of items in dictionary • Remove all items from dictionary • Test whether dictionary contains an item with given search key (same as search really)

Dictionary ADT (UML)

High-Level Implementation • Have choices in our selection of data structures to implement a dictionary • Some choices are high level: sorted vs. unsorted • Some choices are low level: array vs. linked • Sorted (by search key), array-based • Sorted (by search key), link-based • Unsorted, array-based • Unsorted, link-based

Options for Implementation • Sorted Linked List • Unsorted Linked List • Sorted Array (could use STL Vector for array) • Unsorted Array • Unsorted Binary Tree • Binary Search Tree • ... others we haven't learned yet ...

FIGURE 18-4 The data members for two sorted linear implementations of the ADT dictionary for the data in Figure 18-1 (a) array based (b) link based Possible Implementations

FIGURE 18-5 The data members for a binary search tree implementation of the ADT dictionary for the data in Figure 18-1 Possible Implementations

Selecting an Implementation • Reasons for considering linear implementations • Perspective • Efficiency • Motivation • Questions to ask • What operations are needed? • How often is each operation required?

Selecting an Implementation • Amount of Data • Frequency of Change in Data (insert/delete) – static data can be better optimized • Frequency of Search • Availability of existing data structures (reuse) • Memory usage and availability • Cost of implementation and testing • Time available for implementation and testing Efficiency, Memory Use, Implementation Factors

Selecting an Implementation • Three Scenarios • Insertion and traversal in no particular order • Not a dictionary, just a list of stuff • Retrieval Only (Static data)– consider: • Binary search of a sorted array is equivalent to retrieval from a BST: O(log2n) • Is there enough data to justify sorting? • Is hashing possible? • Insertion, removal, retrieval, and traversal in sorted order (Dynamic data) • How much data? • How critical is this data to the application?

FIGURE 18-6 Insertion for unsorted linear implementations (a) array based (b) link based Selecting an Implementation

FIGURE 18-7 Insertion for sorted linear implementations (a) array based (b) pointer based Selecting an Implementation

FIGURE 18-8 The average-case order of dictionary operations for various implementations Selecting an Implementation Note: Ignore traversal (its always linear)

Change of Direction -- Hashing

Can we do better? • Idea: If we had 100 students at SMU, we could: • give them all a student number from 00 to 99 • store their data in an array • use the student number as an array offset • If we knew the student number search = insertion = deletion = O(1) • Would work for SMU but would need an array of length 100,000,000 entries to use A Numbers as array indices – fast but inefficient, lots of wasted space

Hashing • Need a different strategy to locate an item • Consider a “magic box” as an address calculator • Place/retrieve item from that address in an array • Ideally to a unique number for each key • FIGURE 18-9 Address calculator

Hashing • The "Address Calculator" turns the key (e.g., an A-number) into a smaller number so we can use a smaller array • We call an address calculator a hash function indexarray = hash(key)

Hashing • Pseudocode for getItem

Hashing • Pseudocode for remove

Properties • Would like unique array indices – having two students hash to the same array slot is a problem • Would like to use as small an array as possible • Situation: 9000 students at SMU with an 8 digit A-Number • Problem: Need a function to turn these 9000 A-numbers into unique numbers in the range [0..n] where 9000 < n by the least amount

Hash Functions • Possible algorithms • Selecting digits • Folding • Modulo arithmetic • Converting a character string to an integer • Use ASCII values • Factor the results, Horner’s rule

Selecting Digits • Very simple and easy • Leads to a poor distribution across the hash table • Not unique • Example: 3rd and 7th digit • 12345678 → 37 • 98454562 → 46 • 15477869 → 46

Folding • Simply add the digits • Note that for 9 digits 0 ≤ hash(key) ≤ 81 • Limited range of results but can be improved by adding in different ways • Example: • 12345678 → 1+2+3+4+5+6+7+8 = 36 • 98454562 → 9+8+4+5+4+5+6+2 = 43 • 15477869 → 1+5+4+7+7+8+9+6 = 47

Modulo Arithmetic • Simple and effective • Yields generally good results for prime numbers index = key % p • where p is the table size and is prime • Example: where p = 101 • 12345678 → 12345678 % 101 = 44 • 98454562 → 98454562 % 101 = 65 • 15477869 → 15477896 % 101 = 23

Finding Primes (Fast and Easy) // Sieve of Erastosthenes boolisPrime[10000000]; for (inti = 0; i < 10000000; i++) isprime[i] = true; // initialize isPrime[] for (inti = 2; i < sqrt(10000000); i++) { if (isPrime[i]) { for (int j = 2 * i; j < 10000000; j += i) { isPrime[j] = false; } } } // isprime[] has all prime numbers up to 10000000

Working with Characters • Could use numbers: a=1, b=2, ... • Similarly, could use the ASCII values (or parts thereof) • Yields very large numbers quickly • Adding, folding, or modulo methods can reduce these

Horner's Rule • Let a=1, b=2, and so on • Thus "note" = 14 15 20 5 • In Binary, we have: 01110 01111 10100 00101 • We can concatenate the binary to make a single number: 01110011111010000101 • This is 14*323+15*322+20*321+5*320 • Horner noticed that this is more efficiently calculated as: (((14 * 32) + 15) * 32 + 20) * 32 + 5

Perfect Hashing • If SMU has 9876 students, the ideal would be to find a hash function that produces an integer n ϵ [0..9875] with all n being unique (no duplicates) • Such a function is called a perfect hash function • Perfect hash functions are possible if the data are known in advance • See gperf- http://www.gnu.org/software/gperf/ • generates c or c++ coded perfect hash functions & tables for a set of strings

Collisions • For random data, perfect hash functions are not possible • A collision is when hash(k1) = hash(k2) and k1 ≠ k2 • We have two choices • Use a table so big that collisions can't occur and waste a lot of space • Do a bit of extra work to find an empty space in a smaller table – "collision resolution"

FIGURE 18-10 A collision Collisions

Resolving Collisions • Approach 1: Open addressing • Probe (search) for another available location • Can be done linearly, quadratically • Use wrap around at the end of the table • Removal requires specify state of an item • Occupied, emptied, removed • This is because we stop on lookup when we find an empty slot during probing • Clustering is a problem, clusters can merge (YUCK!) • Hard to tell when to stop with quadratic probing (when have all slots been examined?) • Double hashing can reduce clustering

Probing Techniques if hash(key) = n check: • Linear Probing n, n+1, n+2, n+3, ... • Quadratic Probing n, n+12, n+22, n+32, ... • Double Hashing n+hash2(key), n+2*hash2(key), n+3*hash2(key), ...

FIGURE 18-11 Linear probing with h ( x ) = x mod 101 Linear Probing

FIGURE 18-12 Quadratic probing with h ( x ) = x mod 101 Quadratic Probing

FIGURE 18-13 Double hashing during the insertion of 58, 14, and 91 Double Hashing

Resolving Collisions • Approach 2: Restructuring the hash table • Each hash location can accommodate more than one item • Each location is a “bucket” or an array itself • Table size is known and fixed, lots of wasted space, no dynamic memory management • Alternatively, design the hash table as an array of linked chains – called “separate chaining”

FIGURE 18-14 Separate chaining Resolving Collisions

The Efficiency of Hashing • Hashing is O(1) when the hash function is perfect and is reduced because of collisions • Efficiency of hashing involves the load factor alpha (α) • Can change efficiency by changing table size • Best values for α are less than ⅔

The Efficiency of Hashing • Linear probing – average value for α

The Efficiency of Hashing • Quadratic probing and double hashing – efficiency for given α

The Efficiency of Hashing • Separate chaining – efficiency for given α

The Efficiency of Hashing (Fig 18-15)

Maintaining Hashing Performance • Insertions (collisions and their resolution) cause the load factor α to increase • To maintain efficiency, restrict the size of α • α 0.5 for open addressing • α 1.0 for separate chaining • If load factor exceeds these limits • Increase size of hash table • Rehash with new hashing function

CSCI 2342 Data Structures and Algorithms II Dr. Tami Meredith