CS5263 Bioinformatics

CS5263 Bioinformatics Lecture 9-10 Exact String Matching Algorithms

Overview • Pair-wise alignment • Multiple alignment • Commonality: allowing errors when comparing strings • Two sub-problems: • How to score an alignment with errors • How to find an alignment with the best score • Today: exact string matching • Do not allow any errors • Efficiency becomes the sole consideration

Why exact string matching? • The most fundamental string comparison problem • Work processors • Information retrieval • DNA sequence retrieval • Many many more • Is it still an interesting research problem? • Yes, if database is large • Exact string matching is often the core of more complex string comparison algorithms • E.g., BLAST • Often repeatedly called by other methods • Usually the most time consuming part • Small improvement could improve overall efficiency considerably

Definitions • Text: a longer string T (length m) • Pattern: a shorter string P (length n) • Exact matching: find all occurrences of P in T lengthm T lengthn P

The naïve algorithm

Time complexity • Worst case: O(mn) • Best case: O(m) e.g. aaaaaaaaaaaaaa vs baaaaaaa • Average case? • Alphabet A, C, G, T • Assume both P and T are random • Equal probability • In average how many chars do you need to compare before giving up?

Average case time complexity P(mismatch at 1st position): ¾ P(mismatch at 2nd position): ¼ * ¾ P(mismatch at 3nd position): (¼)2 * ¾ P(mismatch at kth position): (¼)k-1 * ¾ Expected number of comparison per position: p = 1/4 k (1-p) p(k-1) k = (1-p) / p * k pk k = 1/(1-p) = 4/3 Average complexity: 4m/3 Not as bad as you thought it might be

Biological sequences are not random T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaab Plus: 4m/3 average case is still bad for long genomic sequences! Especially if this has to be done again and again Smarter algorithms: O(m + n) in worst case sub-linear in practice

How to speedup? • Pre-processing T or P • Why pre-processing can save us time? • Uncovers the structure of T or P • Determines when we can skip ahead without missing anything • Determines when we can infer the result of character comparisons without doing them. ACGTAXACXTAXACGXAX ACGTACA

Cost for exact string matching Total cost = cost (preprocessing) + cost(comparison) + cost(output) Overhead Minimize Constant Hope: gain > overhead

String matching scenarios • One T and one P • Search a word in a document • One T and many P all at once • Search a set of words in a document • Spell checking (fixed P) • One fixed T, many P • Search a completed genome for short sequences • Two (or many) T’s for common patterns • Q: Which one to pre-process? • A: Always pre-process the shorter seq, or the one that is repeatedly used

Pre-processing algs • Pattern preprocessing • Karp – Rabin algorithm • Small alphabet and short patterns • Knuth-Morris-Pratt algorithm (KMP) • Aho-Corasick algorithm • Multiple patterns • Boyer – Moore algorithm • The choice of most cases • Typically sub-linear time • Text preprocessing • Suffix tree • Very useful for many purposes

Karp – Rabin Algorithm • Let’s say we are dealing with binary numbers Text: 01010001011001010101001 Pattern: 101100 • Convert pattern to integer 101100 = 2^5 + 2^3 + 2^2 = 44

Karp – Rabin algorithm Text: 01010001011001010101001 Pattern: 101100 = 44 decimal 10111011001010101001 = 2^5 + 2^3 + 2^2 + 2^1 = 46 10111011001010101001 = 46 * 2 – 64 + 1 = 29 10111011001010101001 = 29 * 2 - 0 + 1 = 59 10111011001010101001 = 59 * 2 - 64 + 0 = 54 10111011001010101001 = 54 * 2 - 64 + 0 = 44

Karp – Rabin algorithm What if the pattern is too long to fit into a single integer? Pattern: 101100. But our machine only has 5 bits Basic idea: hashing. 44 % 13 = 5 10111011001010101001 = 46 (% 13 = 7) 10111011001010101001 = 46 * 2 – 64 + 1 = 29 (% 13 = 3) 10111011001010101001 = 29 * 2 - 0 + 1 = 59 (% 13 = 7) 10111011001010101001 = 59 * 2 - 64 + 0 = 54 (% 13 = 2) 10111011001010101001 = 54 * 2 - 64 + 0 = 44 (% 13 = 5)

Algorithm KMP • Not the fastest • Best known • Good for “real-time matching” • i.e. text comes one char at a time • No memory of previous chars • Idea • Left-to-right comparison • Shift P more than one char whenever possible

abcxabcde abcxabcde abcxabcde Intuitive example 1 • Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened when comparing P[8] with T[i], we can shift P by four chars, and compare P[4] with T[i], without missing any possible matches. • Number of comparisons saved: 6 abcxabc T mismatch P abcxabcde Naïve approach: abcxabc T ? abcxabcde

Should not be a c ? abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde Intuitive example 2 • Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened between P[7] and T[j], we can shift P by six chars and compare T[j] with P[1] without missing any possible matches • Number of comparisons saved: 7 abcxabc T mismatch P abcxabcde Naïve approach: abcxabc T ? abcxabcde

KMP algorithm: pre-processing • Key: the reasoning is done without even knowing what string T is. • Only the location of mismatch in P must be known. x t T z y P t t’ j i z y P t t’ j i Pre-processing: for any position i in P, find P[1..i]’s longest proper suffix, t = P[j..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’ (i.e., y≠ z) For each i, let sp(i) = length(t)

KMP algorithm: shift rule x t T z y P t t’ j i z y P t t’ 1 sp(i) j i Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the right by i – sp(i) chars and compare x with z. This shift rule can be implicitly represented by creating a failure link between y and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume comparison between x and P[sp(i)+1].

Failure Link Example P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 (= 2 + 1) a a t a a c sp(i) 0 1 0 0 2 0 aaat aataac

Another example P: abababc If a char in T fails to match at pos 7, re-compare it with the char at pos 5 (= 4 + 1) a b a b a b c Sp(i) 0 0 0 0 0 4 0 abab abababab ababaababc

Implicit comparison KMP Example using Failure Link a a t a a c T: aacaataaaaataaccttacta aataac ^^* • Time complexity analysis: • Each char in T may be compared up to n times. A lousy analysis gives O(mn) time. • More careful analysis: number of comparisons can be broken to two phases: • Comparison phase: the first time a char in T is compared to P. Total is exactly m. • Shift phase. First comparisons made after a shift. Total is at most m. • Time complexity: O(2m) aataac .* aataac ^^^^^* aataac ..* aataac .^^^^^

KMP algorithm using DFA (Deterministic Finite Automata) P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 Failure link a a t a a c If the next char in T is t after matching 5 chars, go to state 3 a t t a a c a a 0 1 2 3 4 5 DFA 6 a a All other inputs goes to state 0.

DFA Example a t t a a c a a 0 1 2 3 4 5 DFA 6 a a T: aacaataataataaccttacta 1201234534534560001001 Each char in T will be examined exactly once. Therefore, exactly m comparisons are made. But it takes longer to do pre-processing, and needs more space to store the FSA.

Difference between Failure Link and DFA • Failure link • Preprocessing time and space are O(n), regardless of alphabet size • Comparison time is at most 2m (at least m) • DFA • Preprocessing time and space are O(n ||) • May be a problem for very large alphabet size • For example, each “char” is a big integer • Chinese characters • Comparison time is always m.

Boyer – Moore algorithm • Often the choice of algorithm for many cases • One T and one P • We will talk about it later if have time • In its original version does not guarantee linear time • Some modification did it • In practice sub-linear

The set matching problem • Find all occurrences of a set of patterns in T • First idea: run KMP or BM for each P • O(km + n) • k: number of patterns • m: length of text • n: total length of patterns • Better idea: combine all patterns together and search in one run

A simpler problem: spell-checking • A dictionary contains five words: • potato • poetry • pottery • science • school • Given a document, check if any word is (not) in the dictionary • Words in document are separated by special chars. • Relatively easy.

Keyword tree for spell checking This version of the potato gun was inspired by the Weird Science team out of Illinois • O(n) time to construct. n: total length of patterns. • Search time: O(m). m: length of text • Common prefix only need to be compared once. • What if there is no space between words? p s o c l h o o 5 e i t e t a t r n t e y c o r e y 3 1 4 2

Aho-Corasick algorithm • Basis of the fgrep algorithm • Generalizing KMP • Using failure links • Example: given the following 4 patterns: • potato • tattoo • theater • other

Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2

Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo O(mn) m: length of text. n: length of longest pattern

Keyword Tree with a failure link 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

Keyword Tree with all failure links 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2

Example 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

Aho-Corasick algorithm • O(n) preprocessing, and O(m+k) searching. • n: total length of patterns. • m: length of text • k is # of occurrence. • Can create a DFA similar as in KMP. • Requires more space, • Preprocessing time depends on alphabet size • Search time is constant • A: Where can this algorithm be used in previous topics? • Q: BLAST • Given a query sequence, we generate many seed sequences (k-mers) • Search for exact matches to these seed sequences • Extend exact matches into longer inexact matches

Suffix Tree • All algorithms we talked about so far preprocess pattern(s) • Karp-Rabin: small pattern, small alphabet • Boyer-Moore: fastest in practice. O(m) worst case. • KMP: O(m) • Aho-Corasick: O(m) • In some cases we may prefer to pre-process T • Fixed T, varying P • Suffix tree: basically a keyword tree of all suffixes

Suffix tree • T: xabxac • Suffixes: • xabxac • abxac • bxac • xac • ac • c x a b x a a c c 1 c b b x x c 4 6 a a c c 5 2 3 Naïve construction: O(m2) using Aho-Corasick. Smarter: O(m). Very technical. big constant factor Difference from a keyword tree: create an internal node only when there is a branch

Suffix tree implementation • Explicitly labeling sequence end • T: xabxa$ x a x a b x b a a x a a $ 1 1 $ b b b b x $ x x x 4 a a a a $ 5 $ 2 2 3 3

Suffix tree implementation • Implicitly labeling edges • T: xabxa$ 1:2 x a 3:$ b x 2:2 a a $ 1 1 $ $ b b $ $ x x 3:$ 3:$ 4 4 a a 5 $ 5 $ 2 2 3 3

Suffix links • Similar to failure link in a keyword tree • Only link internal nodes having branches x a b P: xabcf a b c f c d d e e f f g g h h i i j j

Suffix tree construction 1234567890acatgacatt 1:$ 1

Suffix tree construction 1234567890acatgacatt 2:$ 1:$ 1 2

CS5263 Bioinformatics