CSCE350 Algorithms and Data Structure

CSCE350 Algorithms and Data Structure Lecture 16 Jianjun Hu Department of Computer Science and Engineering University of South Carolina 2009.11.

Chapter 7: Space-Time Tradeoffs • For many problems some extra space really pays off: • extra space in tables (breathing room?) • hashing • non comparison-based sorting • input enhancement • indexing schemes (eg, B-trees) • auxiliary tables (shift tables for pattern matching) • tables of information that do all the work • dynamic programming

String Matching • pattern: a string of m characters to search for • text: a (long) string of n characters to search in • Brute force algorithm: • Align pattern at beginning of text • moving from left to right, compare each character of pattern to the corresponding character in text until • all characters are found to match (successful search); or • a mismatch is detected • while pattern is not found and the text is not yet exhausted, realign pattern one position to the right and repeat step 2. What is the complexity of the brute-force string matching?

String Searching - History • 1970: Cook shows (using finite-state machines) that problem can be solved in time proportional to n+m • 1976 Knuth and Pratt find algorithm based on Cook’s idea; Morris independently discovers same algorithm in attempt to avoid “backing up” over text • At about the same time Boyer and Moore find an algorithm that examines only a fraction of the text in most cases (by comparing characters in pattern and text from right to left, instead of left to right) • 1980 Another algorithm proposed by Rabin and Karp virtually always runs in time proportional to n+m and has the advantage of extending easily to two-dimensional pattern matching and being almost as simple as the brute-force method.

Horspool’s Algorithm • A simplified version of Boyer-Moore algorithm that retains key insights: • compare pattern characters to text from right to left • given a pattern, create a shift table that determines how much to shift the pattern when a mismatch occurs (input enhancement)

Consider the Problem • Search pattern BARBER in some text • Compare the pattern in the current text position from the right to the left • If the whole match is found, done. • Otherwise, decide the shift distance of the pattern (move it to the right) • There are four cases!

Shift Distance -- Case 1: • There is no ‘c’ in the pattern. Shift by the m – the length of the pattern

Shift Distance -- Case 2: • There are occurrence of ‘c’ in the pattern, but it is not the last one. Shift should align the rightmost occurrence of the ‘c’ in the pattern

Shift Distance -- Case 3: • ‘c’ matches the last character in the pattern, but no ‘c’ among the other m-1 characters. Follow Case 1 and shift by m

Shift Distance -- Case 4: • ‘c’ matches the last character in the pattern, and there are other ‘c’s among the other m-1 characters. Follow Case 2.

We can precompute the shift distance for every possible character ‘c’ (given a pattern) • Shift Table for the pattern “BARBER”

Example • See Section 7.2 for the pseudocode of the shift-table construction algorithm and Horspool’s algorithm • Example: find the pattern BARBER from the following text

Algorithm Efficiency • The worst-case complexity is Θ(nm) • In average, it is Θ(n) • It is usually much faster than the brute-force algorithm • A simple exercise: Create the shift table of 26 letters and space for the pattern BAOBAB

Boyer-Moore algorithm • Based on same two ideas: • compare pattern characters to text from right to left • given a pattern, create a shift table that determines how much to shift the pattern when a mismatch occurs (input enhancement) • Uses additional shift table with same idea applied to the number of matched characters • The worst-case efficiency of Boyer-Moore algorithm is linear. See Section 7.2 of the textbook for detail

Boyer-Moore algorithm Bad-symbol shift

Space and Time Tradeoffs: Hashing • A very efficient method for implementing a dictionary, i.e., a set with the operations: • insert • find • delete • Applications: • databases • symbol tables Addressing by index number Addressing by content

Example Application: How to store student records into a data structure ? • Store student record • xxx-xx-6453 Jeffrey …..Los angels…. • xxx-xx-2038 • xxx-xx-0913 • xxx-xx-4382 • xxx-xx-9084 • xxx-xx-2498 2038 6453 4382 9084 2498 0913

Hash tables and hash functions • Hash table: an array with indices that correspond to buckets • Hash function: determines the bucket for each record • Example: student records, key=SSN. Hash function: h(k) = k mod m (k is a key and m is the number of buckets) • if m = 1000, where is record with SSN= 315-17-4251 stored? • Hash function must: • be easy to compute • distribute keys evenly throughout the table

Collisions • If h(k1) = h(k2) then there is a collision. • Good hash functions result in fewer collisions. • Collisions can never be completely eliminated. • Two types handle collisions differently: • Open hashing - bucket points to linked list of all keys hashing to it. • Closed hashing – • one key per bucket • in case of collision, find another bucket for one of the keys (need Collision resolution strategy) • linear probing: use next bucket • double hashing: use second hash function to compute increment

Example of Open Hashing • Store student record into 10 bucket using hashing function • h(SSN)=SSN mod 10 • xxx-xx-6453 • xxx-xx-2038 • xxx-xx-0913 • xxx-xx-4382 • xxx-xx-9084 • xxx-xx-2498 2038 6453 4382 9084 2498 0913

Open hashing • If hash function distributes keys uniformly, average length of linked list will be α =n/m (load factor) • Average number of probes = 1+α/2 • Worst-case is still linear! • Open hashing still works if n>m.

Example: Closed Hashing (Linear Probing) • xxx-xx-6453 • xxx-xx-2038 • xxx-xx-0913 • xxx-xx-4382 • xxx-xx-9084 • xxx-xx-2498

Hash function for strings • h(SSN)=SSN mod 10 SSN must be a integer • What is the key is a string? • “ABADGUYISMOREWELCOME”………..Bill….

Hash function for strings • h(SSN)=SSN mod 10 SSN must be a integer • What is the key is a string? • static unsigned long sdbm(str) unsigned char *str; { • unsigned long hash = 0; int c; • while (c = *str++) • hash = c + (hash << 6) + (hash << 16) - hash; • return hash; • }

Closed Hashing (Linear Probing) • Avoids pointers. • Does not work if n>m. • Deletions are not straightforward. • Number of probes to insert/find/delete a key depends on load factor α = n/m (hash table density) • successful search: (½) (1+ 1/(1- α)) • unsuccessful search: (½) (1+ 1/(1- α)²) • As the table gets filled (α approaches 1), number of probes increases dramatically:

B-Trees • Organize data for fast queries • Index for fast search • For datasets of structured records, B-tree indexing is used

Motivation (cont.) • Assume that we use an AVL tree to store about 20 million records • We end up with a verydeep binary tree with lots of different disk accesses; log2 20,000,000 is about 24, so this takes about 0.2 seconds • We know we can’t improve on the log n lowerbound on search for a binary tree • But, the solution is to use more branches and thus reduce the height of the tree! • As branching increases, depth decreases B-Trees

Constructing a B-tree 17 3 8 28 48 1 2 6 7 12 14 16 25 26 29 45 52 53 55 68 B-Trees

Definition of a B-tree • A B-tree of order m is an m-way tree (i.e., a tree where each node may have up to m children) in which: 1. the number of keys in each non-leaf node is one less than the number of its children and these keys partition the keys in the children in the fashion of a search tree 2. all leaves are on the same level 3. all non-leaf nodes except the root have at least m / 2 children 4. the root is either a leaf node, or it has from two to m children 5. a leaf node contains no more than m – 1 keys • The number m should always be odd B-Trees

Comparing Trees • Binary trees • Can become unbalanced and lose their good time complexity (big O) • AVL trees are strict binary trees that overcome the balance problem • Heaps remain balanced but only prioritise (not order) the keys • Multi-way trees • B-Trees can be m-way, they can have any (odd) number of children • One B-Tree, the 2-3 (or 3-way) B-Tree, approximates a permanently balanced binary tree, exchanging the AVL tree’s balancing operations for insertion and (more complex) deletion operations B-Trees

Announcement • Midterm Exam 2 Statistics: • Points Possible: 100 • Class Average: 85 (expected) • High Score: 100

CSCE350 Algorithms and Data Structure