CS 6293 Advanced Topics: Current Bioinformatics

CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

Overview • Sequence alignment: two sub-problems: • How to score an alignment with errors • How to find an alignment with the best score • Today: exact string matching • Does not allow any errors • Efficiency becomes the sole consideration • Time and space

Why exact string matching? • The most fundamental string comparison problem • Often the core of more complex string comparison algorithms • E.g., BLAST • Often repeatedly called by other methods • Usually the most time consuming part • Small improvement could improve overall efficiency considerably

Definitions • Text: a longer string T (length m) • Pattern: a shorter string P (length n) • Exact matching: find all occurrences of P in T lengthm T lengthn P

The naïve algorithm

Time complexity • Worst case: O(mn) • Best case: O(m) e.g. aaaaaaaaaaaaaa vs baaaaaaa • Average case? • Alphabet A, C, G, T • Assume both P and T are random • Equal probability • In average how many chars do you need to compare before giving up?

Average case time complexity P(mismatch at 1st position): ¾ P(mismatch at 2nd position): ¼ * ¾ P(mismatch at 3nd position): (¼)2 * ¾ P(mismatch at kth position): (¼)k-1 * ¾ Expected number of comparison per position: p = 1/4 k (1-p) p(k-1) k = (1-p) / p * k pk k = 1/(1-p) = 4/3 Average complexity: 4m/3 Not as bad as you thought it might be

Biological sequences are not random T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaab Plus: 4m/3 average case is still bad for long genomic sequences! Especially if this has to be done again and again Smarter algorithms: O(m + n) in worst case sub-linear in practice

How to speedup? • Pre-processing T or P • Why pre-processing can save us time? • Uncovers the structure of T or P • Determines when we can skip ahead without missing anything • Determines when we can infer the result of character comparisons without doing them. ACGTAXACXTAXACGXAX ACGTACA

Cost for exact string matching Total cost = cost (preprocessing) + cost(comparison) + cost(output) Overhead Minimize Constant Hope: gain > overhead

String matching scenarios • One T and one P • Search a word in a document • One T and many P all at once • Search a set of words in a document • Spell checking (fixed P) • One fixed T, many P • Search a completed genome for short sequences • Two (or many) T’s for common patterns • Q: Which one to pre-process? • A: Always pre-process the shorter seq, or the one that is repeatedly used

Pre-processing algs • Pattern preprocessing • Knuth-Morris-Pratt algorithm (KMP) • Aho-Corasick algorithm • Multiple patterns • Boyer – Moore algorithm (discuss only if have time) • The choice of most cases • Typically sub-linear time • Text preprocessing • Suffix tree • Very useful for many purposes

abcxabcde abcxabcde abcxabcde Algorithm KMP: Intuitive example 1 • Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened when comparing P[8] with T[i], we can shift P by four chars, and compare P[4] with T[i], without missing any possible matches. • Number of comparisons saved: 6 abcxabc T mismatch P abcxabcde Naïve approach: abcxabc T ? abcxabcde

Should not be a c ? abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde Intuitive example 2 • Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened between P[7] and T[j], we can shift P by six chars and compare T[j] with P[1] without missing any possible matches • Number of comparisons saved: 7 abcxabc T mismatch P abcxabcde Naïve approach: abcxabc T ? abcxabcde

KMP algorithm: pre-processing • Key: the reasoning is done without even knowing what string T is. • Only the location of mismatch in P must be known. x t T z y P t t’ j i z y P t t’ j i Pre-processing: for any position i in P, find P[1..i]’s longest proper suffix, t = P[j..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’ (i.e., y≠ z) For each i, let sp(i) = length(t)

KMP algorithm: shift rule x t T z y P t t’ j i z y P t t’ 1 sp(i) j i Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the right by i – sp(i) chars and compare x with z. This shift rule can be implicitly represented by creating a failure link between y and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume comparison between x and P[sp(i)+1].

Failure Link Example P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 (= 2 + 1) a a t a a c sp(i) 0 1 0 0 2 0 aaat aataac

Another example P: abababc If a char in T fails to match at pos 7, re-compare it with the char at pos 5 (= 4 + 1) a b a b a b c Sp(i) 0 0 0 0 0 4 0 abab abababab ababaababc

Implicit comparison KMP Example using Failure Link a a t a a c T: aacaataaaaataaccttacta aataac ^^* • Time complexity analysis: • Each char in T may be compared up to n times. A lousy analysis gives O(mn) time. • More careful analysis: number of comparisons can be broken to two phases: • Comparison phase: the first time a char in T is compared to P. Total is exactly m. • Shift phase. First comparisons made after a shift. Total is at most m. • Time complexity: O(2m) aataac .* aataac ^^^^^* aataac ..* aataac .^^^^^

KMP algorithm using DFA (Deterministic Finite Automata) P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 Failure link a a t a a c If the next char in T is t after matching 5 chars, go to state 3 a t t a a c a a 0 1 2 3 4 5 DFA 6 a a All other inputs goes to state 0.

DFA Example a t t a a c a a 0 1 2 3 4 5 DFA 6 a a T: aacaataataataaccttacta 1201234534534560001001 Each char in T will be examined exactly once. Therefore, exactly m comparisons are made. But it takes longer to do pre-processing, and needs more space to store the FSA.

Difference between Failure Link and DFA • Failure link • Preprocessing time and space are O(n), regardless of alphabet size • Comparison time is at most 2m (at least m) • DFA • Preprocessing time and space are O(n ||) • May be a problem for very large alphabet size • For example, each “char” is a big integer • Chinese characters • Comparison time is always m.

Boyer – Moore algorithm • Often the choice of algorithm for many cases • One T and one P • We will talk about it later if have time • In practice sub-linear

The set matching problem • Find all occurrences of a set of patterns in T • First idea: run KMP or BM for each P • O(km + n) • k: number of patterns • m: length of text • n: total length of patterns • Better idea: combine all patterns together and search in one run

A simpler problem: spell-checking • A dictionary contains five words: • potato • poetry • pottery • science • school • Given a document, check if any word is (not) in the dictionary • Words in document are separated by special chars. • Relatively easy.

Keyword tree for spell checking This version of the potato gun was inspired by the Weird Science team out of Illinois • O(n) time to construct. n: total length of patterns. • Search time: O(m). m: length of text • Common prefix only need to be compared once. • What if there is no space between words? p s o c l h o o 5 e i t e t a t r n t e y c o r e y 3 1 4 2

Aho-Corasick algorithm • Basis of the fgrep algorithm • Generalizing KMP • Using failure links • Example: given the following 4 patterns: • potato • tattoo • theater • other

Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2

Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo O(mn) m: length of text. n: length of longest pattern

Keyword Tree with a failure link 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

Keyword Tree with all failure links 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2

Example 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

Aho-Corasick algorithm • O(n) preprocessing, and O(m+k) searching. • n: total length of patterns. • m: length of text • k is # of occurrence. • Can create a DFA similar as in KMP. • Requires more space, • Preprocessing time depends on alphabet size • Search time is constant • A: Where can this algorithm be used in previous topics? • Q: BLAST • Given a query sequence, we generate many seed sequences (k-mers) • Search for exact matches to these seed sequences • Extend exact matches into longer inexact matches

Suffix Tree • All algorithms we talked about so far preprocess pattern(s) • Boyer-Moore: fastest in practice. O(m) worst case. • KMP: O(m) • Aho-Corasick: O(m) • In some cases we may prefer to pre-process T • Fixed T, varying P • Suffix tree: basically a keyword tree of all suffixes

Suffix tree • T: xabxac • Suffixes: • xabxac • abxac • bxac • xac • ac • c x a b x a a c c 1 c b b x x c 4 6 a a c c 5 2 3 Naïve construction: O(m2) using Aho-Corasick. Smarter: O(m). Very technical. big constant factor Difference from a keyword tree: create an internal node only when there is a branch

Suffix tree implementation • Explicitly labeling sequence end • T: xabxa$ x a x a b x b a a x a a $ 1 1 $ b b b b x $ x x x 4 a a a a $ 5 $ 2 2 3 3 • One-to-one correspondence of leaves and suffixes • |T| leaves, hence < |T| internal nodes

Suffix tree implementation • Implicitly labeling edges • T: xabxa$ 1:2 x a 3:$ b x 2:2 a a $ 1 1 $ $ b b $ $ x x 3:$ 3:$ 4 4 a a 5 $ 5 $ 2 2 3 3 • |Tree(T)| = O(|T| + size(edge labels))

Suffix links • Similar to failure link in a keyword tree • Only link internal nodes having branches x a b P: xabcf a b c f c d d e e f f g g h h i i j j

ST Application 1: pattern matching • Find all occurrence of P=xa in T • Find node v in the ST that matches to P • Traverse the subtree rooted at v to get the locations x a b x a a c c 1 c b b x x c 4 6 a a c c 5 T: xabxac 2 3 • O(m) to construct ST (large constant factor) • O(n) to find v – linear to length of P instead of T! • O(k) to get all leaves, k is the number of occurrence. • Asymptotic time is the same as KMP. ST wins if T is fixed. KMP wins otherwise.

ST Application 2: set matching • Find all occurrences of a set of patterns in T • Build a ST from T • Match each P to ST x a b x a a c c 1 c b b x x c 4 6 a a c c 5 T: xabxac P: xab 2 3 • O(m) to construct ST (large constant factor) • O(n) to find v – linear to total length of P’s • O(k) to get all leaves, k is the number of occurrence. • Asymptotic time is the same as Aho-Corasick. ST wins if T fixed. AC wins if P’s are fixed. Otherwise depending on relative size.

ST application 3: repeats finding • Genome contains many repeated DNA sequences • Repeat sequence length: Varies from 1 nucleotide to millions • Genes may have multiple copies (50 to 10,000) • Highly repetitive DNA in some non-coding regions • 6 to 10bp x 100,000 to 1,000,000 times • Problem: find all repeats that are at leastk-residues long and appear at least p times in the genome

Repeats finding • at least k-residues long and appear at least p times in the seq • Phase 1: top-down, count label lengths (L) from root to each node • Phase 2: bottom-up: count # of leaves descended from each internal node For each node with L >= k, and N >= p, print all leaves O(m) to traverse tree (L, N)

Maximal repeats finding • Right-maximal repeat • S[i+1..i+k] = S[j+1..j+k], • but S[i+k+1] != S[j+k+1] • Left-maximal repeat • S[i+1..i+k] = S[j+1..j+k] • But S[i] != S[j] • Maximal repeat • S[i+1..i+k] = S[j+1..j+k] • But S[i] != S[j], and S[i+k+1] != S[j+k+1] acatgacatt • cat • aca • acat

Maximal repeats finding • Find repeats with at least 3 bases and 2 occurrence • right-maximal: cat • Maximal: acat • left-maximal: aca 5:e 1234567890acatgacatt 5 t a $ c 10 a 5:e t c t t a 9 t 4 t 5:e 5:e 5:e t t 7 3 6 8 1 2

CS 6293 Advanced Topics: Current Bioinformatics