CS5263 Bioinformatics

CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Definitions • Text: a longer string T • Pattern: a shorter string P • Exact matching: find all occurrence of P in T length m T P length n

The naïve algorithm

Time complexity • Worst case: O(mn) • Best case: O(m) • aaaaaaaaaaaaaa vs baaaaaaa • Average case? • Alphabet A, C, G, T • Assume both P and T are random • Equal probability • How many chars do you need to compare before moving to the next position?

Average case time complexity P(mismatch at 1st position): ¾ P(mismatch at 2nd position): ¼ * ¾ P(mismatch at 3nd position): (¼)2 * ¾ P(mismatch at kth position): (¼)k-1 * ¾ Expected number of comparison per position: p = 1/4 k (1-p) p(k-1) k = (1-p) / p * k pk k = 1/(1-p) = 4/3 Average complexity: 4m/3 Not as bad as you thought it might be

Biological sequences are not random T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaab Plus: 4m/3 average case is still bad for long genomic sequences! Especially if P is not in T… Smarter algorithms: O(m + n) in worst case sub-linear in practice

String matching scenarios • One T and one P • Search a word in a document • One T and many P all at once • Search a set of words in a document • Spell checking • One fixed T, many P • Search a completed genome for a short sequence • Two (or many) T’s for common patterns

How to speedup? • Pre-processing T or P • Why pre-processing can save us time? • Uncovers the structure of T or P • Determines when we can skip ahead without missing anything • Determines when we can infer the result of character comparisons without doing them. ACGTAXACXTAXACGXAX ACGTACA

Cost for exact string matching Total cost = cost (preprocessing) + cost(comparison) + cost(output) Overhead Minimize Constant Hope: gain > overhead

Which string to preprocess? • One T and one P • Preprocessing P? • One T and many P all at once • Preprocessing P or T? • One fixed T, many P (unknown) • Preprocessing T? • Two (or many) T’s for common patterns • ???

Pattern pre-processing algs • Karp – Rabin algorithm • Small alphabet and small pattern • Boyer – Moore algorithm • the choice of most cases • Typically sub-linear time • Knuth-Morris-Pratt algorithm (KMP) • grep • Aho-Corasick algorithm • fgrep

Karp – Rabin Algorithm • Let’s say we are dealing with binary numbers Text: 01010001011001010101001 Pattern: 101100 • Convert pattern to integer 101100 = 2^5 + 2^3 + 2^2 = 44

Karp – Rabin algorithm Text: 01010001011001010101001 Pattern: 101100 = 44 decimal 10111011001010101001 = 2^5 + 2^3 + 2^2 + 2^1 = 46 10111011001010101001 = 46 * 2 – 64 + 1 = 29 10111011001010101001 = 29 * 2 - 0 + 1 = 59 10111011001010101001 = 59 * 2 - 64 + 0 = 54 10111011001010101001 = 54 * 2 - 64 + 0 = 44

Karp – Rabin algorithm What if the pattern is too long to fit into a single integer? Pattern: 101100. But our machine only has 5 bits Basic idea: hashing. 44 % 13 = 5 10111011001010101001 = 46 (% 13 = 7) 10111011001010101001 = 46 * 2 – 64 + 1 = 29 (% 13 = 3) 10111011001010101001 = 29 * 2 - 0 + 1 = 59 (% 13 = 7) 10111011001010101001 = 59 * 2 - 64 + 0 = 54 (% 13 = 2) 10111011001010101001 = 54 * 2 - 64 + 0 = 44 (% 13 = 5)

Boyer – Moore algorithm • Three ideas: • Right-to-left comparison • Bad character rule • Good suffix rule

Boyer – Moore algorithm • Right to left comparison x y Skip some chars without missing any occurrence. y But how?

Bad character rule 0 1 12345678901234567 T:xpbctbxabpqqaabpq P: tpabxab *^^^^ What would you do now?

Bad character rule 0 1 12345678901234567 T:xpbctbxabpqqaabpq P: tpabxab *^^^^ P: tpabxab

Bad character rule 0 1 123456789012345678 T:xpbctbxabpqqaabpqz P: tpabxab *^^^^ P: tpabxab * P: tpabxab

Basic bad character rule tpabxab Pre-processing: O(n)

Basic bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab *^^^^ When rightmost T(k) in P is left to i, shift pattern P to align T(k) with the rightmost T(k) in P Shift 3 – 1 = 2 i = 3 P: tpabxab

Basic bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab * When T(k) is not in P, shift left end of P to align with T(k+1) i = 7 Shift 7 – 0 = 7 P: tpabxab

Basic bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab *^^ When rightmost T(k) in P is right to i, shift pattern P one pos i = 5 5 – 6 < 0. so shift 1 P: tpabxab

Extended bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab *^^ Find T(k) in P that is immediately left to i, shift P to align T(k) with that position i = 5 5 – 3 = 2. so shift 2 P: tpabxab Preprocessing still O(n)

Extended bad character rule • Best possible: m / n comparisons • Works better for large alphabet size • In some cases the extended bad character rule is sufficiently good • Worst-case: O(mn)

0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab According to extended bad character rule

(weak) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab

(Weak) good suffix rule x t T In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’ y P t’ t y P t t’

(Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^

(Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab

(Strong) good suffix rule • Pre-processing can be done in linear time • If P in T, may take O(mn) • If P not in T, worst-case O(m+n) x t T In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’, and the char left to t ≠ the char left to t’ z y P t’ t z y P t t’

Lessons From B-M • Sub-linear time is possible • But we still need to read T from disk! • Bad cases require periodicity in P or T • matching random P with T is easy! • Large alphabets mean large shifts • Small alphabets make complicated shift data-structures possible • B-M better for “english” and amino-acids than for DNA.

Algorithm KMP • Not the fastest • Best known • Good for multiple pattern matching and real-time matching • Idea • Left-to-right comparison • Shift P more chars when possible

Basic idea x t T z y P t t’ z y P t t’ In pre-processing: for any position i in P, find the longest proper suffix of P, t = P[j+1..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’, i.e., P[i+1] != P[i-j+1]. Sp’(i) = length(t)

Example P: aataac a a t a a c Sp’(i) 0 1 0 0 2 0 aaat aataac

Failure link P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 a a t a a c Sp’(i) 0 1 0 0 2 0 aaat aataac

FSA If the next char in T is t, we go to state 3 P: aataac t a a t a a c a 0 1 2 3 4 5 6 Sp’(i) 0 1 0 0 2 0 aaat aataac All other input goes to state 0

Another example P: abababc a b a b a b c Sp’(i) 0 0 0 0 0 4 0 abab abababab ababaababc

Failure link P: abababc If a char in T fails to match at pos 7, re-compare it with the char at pos 5 a b a b a b c Sp’(i) 0 0 0 0 0 4 0 ababaababc

FSA P: abababc If the next char in T is a, go to state 5 a a a b b a b c 0 1 2 3 4 5 6 7 Sp’(i) 0 0 0 0 0 4 0 ababaababc All other input goes to state 0

Difference between Failure Link and FSA? • Failure link • Preprocessing time and space are O(n), regardless of alphabet size • Comparison time is at most 2m • FSA • Preprocessing time and space are O(n ||) • May be a problem for very large alphabet size • Comparison time is always m.

Failure link P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 a a t a a c Sp’(i) 0 1 0 0 2 0 aaat aataac

Example a a t a a c T: aacaataaaaataaccttacta aataac ^^* aataac .* Each char in T may be compared multiple times. Up to n. Time complexity: O(2m). Comparison phase and shift phase. Comparison is bounded by m, shift is also bounded by m. aataac ^^^^^* aataac ..* aataac .^^^^^

Example t a a t a a c a 0 1 2 3 4 5 6 T: aacaataaaaataaccttacta 1201234501234560001001 Each char in T will be examined exactly once. Therefore, exact m comparisons are needed. Takes longer to do pre-processing.

How to do pre-processing?

CS5263 Bioinformatics