CS 3343: Analysis of Algorithms

CS 3343: Analysis of Algorithms Lecture 26: String Matching Algorithms

Definitions • Text: a longer string T • Pattern: a shorter string P • Exact matching: find all occurrence of P in T length = m T P Length = n

The naïve algorithm Length = m Length = n

Time complexity • Worst case: O(mn) • Best case: O(m) • aaaaaaaaaaaaaa vs. baaaaaaa • Average case? • Alphabet size = k • Assume equal probability • How many chars do you need to compare before find a mismatch? • In average: k / (k-1) • Therefore average-case complexity: mk / (k-1) • For large alphabet, ~ m • Not as bad as you thought, huh?

Real strings are not random T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaab Plus: O(m) average case is still bad for long strings! Smarter algorithms: O(m + n) in worst case sub-linear in practice how is this possible?

How to speedup? • Pre-processing T or P • Why pre-processing can save us time? • Uncovers the structure of T or P • Determines when we can skip ahead without missing anything • Determines when we can infer the result of character comparisons without actually doing them. ACGTAXACXTAXACGXAX ACGTACA

Cost for exact string matching Total cost = cost (preprocessing) + cost(comparison) + cost(output) Overhead Minimize Constant Hope: gain > overhead

String matching scenarios • One T and one P • Search a word in a document • One T and many P all at once • Search a set of words in a document • Spell checking • One fixed T, many P • Search a completed genome for a short sequence • Two (or many) T’s for common patterns • Would you preprocess P or T? • Always pre-process the shorter seq, or the one that is repeatedly used

Pattern pre-processing algs • Karp – Rabin algorithm • Small alphabet and small pattern • Boyer – Moore algorithm • The choice of most cases • Typically sub-linear time • Knuth-Morris-Pratt algorithm (KMP) • Aho-Corasick algorithm • The algorithm for the unix utility fgrep • Suffix tree • One of the most useful preprocessing techniques • Many applications

Algorithm KMP • Not the fastest • Best known • Good for “real-time matching” • i.e. text comes one char at a time • No memory of previous chars • Idea • Left-to-right comparison • Shift P more than one char whenever possible

abcxabcde abcxabcde abcxabcde Intuitive example 1 • Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened when comparing P[8] with T[i], we can shift P by four chars, and compare P[4] with T[i], without missing any possible matches. • Number of comparisons saved: 6 abcxabc T mismatch P abcxabcde Naïve approach: abcxabc T ? abcxabcde

Should not be a c ? abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde Intuitive example 2 • Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened between P[7] and T[j], we can shift P by six chars and compare T[j] with P[1] without missing any possible matches • Number of comparisons saved: 7 abcxabc T mismatch P abcxabcde Naïve approach: abcxabc T ? abcxabcde

KMP algorithm: pre-processing • Key: the reasoning is done without even knowing what string T is. • Only the location of mismatch in P must be known. x t T z y P t t’ j i z y P t t’ j i Pre-processing: for any position i in P, find P[1..i]’s longest proper suffix, t = P[j..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’ (i.e., y≠ z) For each i, let sp(i) = length(t)

KMP algorithm: shift rule x t T z y P t t’ j i z y P t t’ 1 sp(i) j i Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the right by i – sp(i) chars and compare x with z. This shift rule can be implicitly represented by creating a failure link between y and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume comparison between x and P[sp(i)+1].

Failure Link Example P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 (= 2 + 1) a a t a a c sp(i) 0 1 0 0 2 0 aaat aataac

Another example P: abababc If a char in T fails to match at pos 7, re-compare it with the char at pos 5 (= 4 + 1) a b a b a b c Sp(i) 0 0 0 0 0 4 0 abab abababab ababaababc

Implicit comparison KMP Example using Failure Link a a t a a c T: aacaataaaaataaccttacta aataac ^^* • Time complexity analysis: • Each char in T may be compared up to n times. A lousy analysis gives O(mn) time. • More careful analysis: number of comparisons can be broken to two phases: • Comparison phase: the first time a char in T is compared to P. Total is exactly m. • Shift phase. First comparisons made after a shift. Total is at most m. • Time complexity: O(2m) aataac .* aataac ^^^^^* aataac ..* aataac .^^^^^

KMP algorithm using DFA (Deterministic Finite Automata) P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 Failure link a a t a a c If the next char in T is t after matching 5 chars, go to state 3 a t t a a c a a 0 1 2 3 4 5 DFA 6 a a All other inputs goes to state 0.

DFA Example a t t a a c a a 0 1 2 3 4 5 DFA 6 a a T: aacaataataataaccttacta 1201234534534560001001 Each char in T will be examined exactly once. Therefore, exactly m comparisons are made. But it takes longer to do pre-processing, and needs more space to store the FSA.

Difference between Failure Link and DFA • Failure link • Preprocessing time and space are O(n), regardless of alphabet size • Comparison time is at most 2m (at least m) • DFA • Preprocessing time and space are O(n ||) • May be a problem for very large alphabet size • For example, each “char” is a big integer • Chinese characters • Comparison time is always m.

The set matching problem • Find all occurrences of a set of patterns in T • First idea: run KMP or BM for each P • O(km + n) • k: number of patterns • m: length of text • n: total length of patterns • Better idea: combine all patterns together and search in one run

A simpler problem: spell-checking • A dictionary contains five words: • potato • poetry • pottery • science • school • Given a document, check if any word is (not) in the dictionary • Words in document are separated by special chars. • Relatively easy.

Keyword tree for spell checking This version of the potato gun was inspired by the Weird Science team out of Illinois • O(n) time to construct. n: total length of patterns. • Search time: O(m). m: length of text • Common prefix only need to be compared once. • What if there is no space between words? p s o c l h o o 5 e i t e t a t r n t e y c o r e y 3 1 4 2

Aho-Corasick algorithm • Basis of the fgrep algorithm • Generalizing KMP • Using failure links • Example: given the following 4 patterns: • potato • tattoo • theater • other

Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2

Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo O(mn) m: length of text. n: length of longest pattern

Keyword Tree with a failure link 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

Keyword Tree with all failure links 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2

Example 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

Aho-Corasick algorithm • O(n) preprocessing, and O(m+k) searching. • n: total length of patterns. • m: length of text • k is # of occurrence. • Can create a DFA similar as in KMP. • Requires more space, • Preprocessing time depends on alphabet size • Search time is constant

Suffix Tree • All algorithms we talked about so far preprocess pattern(s) • Karp-Rabin: small pattern, small alphabet • Boyer-Moore: fastest in practice. O(m) worst case. • KMP: O(m) • Aho-Corasick: O(m) • In some cases we may prefer to pre-process T • Fixed T, varying P • Suffix tree: basically a keyword tree of all suffixes

Suffix tree • T: xabxac • Suffixes: • xabxac • abxac • bxac • xac • ac • c x a b x a a c c 1 c b b x x c 4 6 a a c c 5 2 3 Naïve construction: O(m2) using Aho-Corasick. Smarter: O(m). Very technical. big constant factor Difference from a keyword tree: create an internal node only when there is a branch

Suffix tree implementation • Explicitly labeling seq end • T: xabxa T: xabxa$ x a x a b x b a a x a a $ 1 1 $ b b b b x $ x x x 4 a a a a $ 5 $ 2 2 3 3

Suffix tree implementation • Implicitly labeling edges • T: xabxa$ 1:2 x a 3:$ b x 2:2 a a $ 1 1 $ $ b b $ $ x x 3:$ 3:$ 4 4 a a 5 $ 5 $ 2 2 3 3

Suffix links • Similar to failure link in a keyword tree • Only link internal nodes having branches x a b xabcf a b c f c d d e e f f g g h h i i j j

Suffix tree construction 1234567890acatgacatt 1:$ 1

Suffix tree construction 1234567890acatgacatt 2:$ 1:$ 1 2

Suffix tree construction 1234567890acatgacatt a 2:$ 2:$ 4:$ 3 1 2

Suffix tree construction 1234567890acatgacatt a 4:$ 2:$ 2:$ 4:$ 4 3 1 2

Suffix tree construction 5:$ 1234567890acatgacatt 5 a 4:$ 2:$ 2:$ 4:$ 4 3 1 2

Suffix tree construction 5:$ 1234567890acatgacatt 5 a 4:$ c a 2:$ 4:$ t 4 t 5:$ $ 3 6 1 2

Suffix tree construction 5:$ 1234567890acatgacatt 5 a c 4:$ a c t a 4:$ t 4 t 5:$ 5:$ t $ 7 3 6 1 2

Suffix tree construction 5:$ 1234567890acatgacatt 5 a c 4:$ a c t t a t 4 t 5:$ 5:$ 5:$ t t $ 7 3 6 8 1 2

Suffix tree construction 5:$ 1234567890acatgacatt 5 t a c a 5:$ t c t t a 9 t 4 t 5:$ 5:$ 5:$ t t $ 7 3 6 8 1 2

CS 3343: Analysis of Algorithms