1 / 90

CS 6293 Advanced Topics: Current Bioinformatics

CS 6293 Advanced Topics: Current Bioinformatics. Lecture 5 Exact String Matching Algorithms. Overview. Sequence alignment: t wo sub-problems: How to score an alignment with errors How to find an alignment with the best score Today: exact string matching Does not allow any errors

Download Presentation

CS 6293 Advanced Topics: Current Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms

  2. Overview • Sequence alignment: two sub-problems: • How to score an alignment with errors • How to find an alignment with the best score • Today: exact string matching • Does not allow any errors • Efficiency becomes the sole consideration • Time and space

  3. Why exact string matching? • The most fundamental string comparison problem • Often the core of more complex string comparison algorithms • E.g., BLAST • Often repeatedly called by other methods • Usually the most time consuming part • Small improvement could improve overall efficiency considerably

  4. Definitions • Text: a longer string T (length m) • Pattern: a shorter string P (length n) • Exact matching: find all occurrences of P in T lengthm T lengthn P

  5. The naïve algorithm

  6. Time complexity • Worst case: O(mn) • Best case: O(m) e.g. aaaaaaaaaaaaaa vs baaaaaaa • Average case? • Alphabet A, C, G, T • Assume both P and T are random • Equal probability • In average how many chars do you need to compare before giving up?

  7. Average case time complexity P(mismatch at 1st position): ¾ P(mismatch at 2nd position): ¼ * ¾ P(mismatch at 3nd position): (¼)2 * ¾ P(mismatch at kth position): (¼)k-1 * ¾ Expected number of comparison per position: p = 1/4 k (1-p) p(k-1) k = (1-p) / p * k pk k = 1/(1-p) = 4/3 Average complexity: 4m/3 Not as bad as you thought it might be

  8. Biological sequences are not random T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaab Plus: 4m/3 average case is still bad for long genomic sequences! Especially if this has to be done again and again Smarter algorithms: O(m + n) in worst case sub-linear in practice

  9. How to speedup? • Pre-processing T or P • Why pre-processing can save us time? • Uncovers the structure of T or P • Determines when we can skip ahead without missing anything • Determines when we can infer the result of character comparisons without doing them. ACGTAXACXTAXACGXAX ACGTACA

  10. Cost for exact string matching Total cost = cost (preprocessing) + cost(comparison) + cost(output) Overhead Minimize Constant Hope: gain > overhead

  11. String matching scenarios • One T and one P • Search a word in a document • One T and many P all at once • Search a set of words in a document • Spell checking (fixed P) • One fixed T, many P • Search a completed genome for short sequences • Two (or many) T’s for common patterns • Q: Which one to pre-process? • A: Always pre-process the shorter seq, or the one that is repeatedly used

  12. Pre-processing algs • Pattern preprocessing • Knuth-Morris-Pratt algorithm (KMP) • Aho-Corasick algorithm • Multiple patterns • Boyer – Moore algorithm (discuss only if have time) • The choice of most cases • Typically sub-linear time • Text preprocessing • Suffix tree • Very useful for many purposes

  13. abcxabcde abcxabcde abcxabcde Algorithm KMP: Intuitive example 1 • Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened when comparing P[8] with T[i], we can shift P by four chars, and compare P[4] with T[i], without missing any possible matches. • Number of comparisons saved: 6 abcxabc T mismatch P abcxabcde Naïve approach: abcxabc T ? abcxabcde

  14. Should not be a c ? abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde Intuitive example 2 • Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened between P[7] and T[j], we can shift P by six chars and compare T[j] with P[1] without missing any possible matches • Number of comparisons saved: 7 abcxabc T mismatch P abcxabcde Naïve approach: abcxabc T ? abcxabcde

  15. KMP algorithm: pre-processing • Key: the reasoning is done without even knowing what string T is. • Only the location of mismatch in P must be known. x t T z y P t t’ j i z y P t t’ j i Pre-processing: for any position i in P, find P[1..i]’s longest proper suffix, t = P[j..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’ (i.e., y≠ z) For each i, let sp(i) = length(t)

  16. KMP algorithm: shift rule x t T z y P t t’ j i z y P t t’ 1 sp(i) j i Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the right by i – sp(i) chars and compare x with z. This shift rule can be implicitly represented by creating a failure link between y and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume comparison between x and P[sp(i)+1].

  17. Failure Link Example P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 (= 2 + 1) a a t a a c sp(i) 0 1 0 0 2 0 aaat aataac

  18. Another example P: abababc If a char in T fails to match at pos 7, re-compare it with the char at pos 5 (= 4 + 1) a b a b a b c Sp(i) 0 0 0 0 0 4 0 abab abababab ababaababc

  19. Implicit comparison KMP Example using Failure Link a a t a a c T: aacaataaaaataaccttacta aataac ^^* • Time complexity analysis: • Each char in T may be compared up to n times. A lousy analysis gives O(mn) time. • More careful analysis: number of comparisons can be broken to two phases: • Comparison phase: the first time a char in T is compared to P. Total is exactly m. • Shift phase. First comparisons made after a shift. Total is at most m. • Time complexity: O(2m) aataac .* aataac ^^^^^* aataac ..* aataac .^^^^^

  20. KMP algorithm using DFA (Deterministic Finite Automata) P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 Failure link a a t a a c If the next char in T is t after matching 5 chars, go to state 3 a t t a a c a a 0 1 2 3 4 5 DFA 6 a a All other inputs goes to state 0.

  21. DFA Example a t t a a c a a 0 1 2 3 4 5 DFA 6 a a T: aacaataataataaccttacta 1201234534534560001001 Each char in T will be examined exactly once. Therefore, exactly m comparisons are made. But it takes longer to do pre-processing, and needs more space to store the FSA.

  22. Difference between Failure Link and DFA • Failure link • Preprocessing time and space are O(n), regardless of alphabet size • Comparison time is at most 2m (at least m) • DFA • Preprocessing time and space are O(n ||) • May be a problem for very large alphabet size • For example, each “char” is a big integer • Chinese characters • Comparison time is always m.

  23. Boyer – Moore algorithm • Often the choice of algorithm for many cases • One T and one P • We will talk about it later if have time • In practice sub-linear

  24. The set matching problem • Find all occurrences of a set of patterns in T • First idea: run KMP or BM for each P • O(km + n) • k: number of patterns • m: length of text • n: total length of patterns • Better idea: combine all patterns together and search in one run

  25. A simpler problem: spell-checking • A dictionary contains five words: • potato • poetry • pottery • science • school • Given a document, check if any word is (not) in the dictionary • Words in document are separated by special chars. • Relatively easy.

  26. Keyword tree for spell checking This version of the potato gun was inspired by the Weird Science team out of Illinois • O(n) time to construct. n: total length of patterns. • Search time: O(m). m: length of text • Common prefix only need to be compared once. • What if there is no space between words? p s o c l h o o 5 e i t e t a t r n t e y c o r e y 3 1 4 2

  27. Aho-Corasick algorithm • Basis of the fgrep algorithm • Generalizing KMP • Using failure links • Example: given the following 4 patterns: • potato • tattoo • theater • other

  28. Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2

  29. Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  30. Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo O(mn) m: length of text. n: length of longest pattern

  31. Keyword Tree with a failure link 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  32. Keyword Tree with a failure link 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  33. Keyword Tree with all failure links 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2

  34. Example 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  35. Example 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  36. Example 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  37. Example 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  38. Example 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  39. Aho-Corasick algorithm • O(n) preprocessing, and O(m+k) searching. • n: total length of patterns. • m: length of text • k is # of occurrence. • Can create a DFA similar as in KMP. • Requires more space, • Preprocessing time depends on alphabet size • Search time is constant • A: Where can this algorithm be used in previous topics? • Q: BLAST • Given a query sequence, we generate many seed sequences (k-mers) • Search for exact matches to these seed sequences • Extend exact matches into longer inexact matches

  40. Suffix Tree • All algorithms we talked about so far preprocess pattern(s) • Boyer-Moore: fastest in practice. O(m) worst case. • KMP: O(m) • Aho-Corasick: O(m) • In some cases we may prefer to pre-process T • Fixed T, varying P • Suffix tree: basically a keyword tree of all suffixes

  41. Suffix tree • T: xabxac • Suffixes: • xabxac • abxac • bxac • xac • ac • c x a b x a a c c 1 c b b x x c 4 6 a a c c 5 2 3 Naïve construction: O(m2) using Aho-Corasick. Smarter: O(m). Very technical. big constant factor Difference from a keyword tree: create an internal node only when there is a branch

  42. Suffix tree implementation • Explicitly labeling sequence end • T: xabxa$ x a x a b x b a a x a a $ 1 1 $ b b b b x $ x x x 4 a a a a $ 5 $ 2 2 3 3 • One-to-one correspondence of leaves and suffixes • |T| leaves, hence < |T| internal nodes

  43. Suffix tree implementation • Implicitly labeling edges • T: xabxa$ 1:2 x a 3:$ b x 2:2 a a $ 1 1 $ $ b b $ $ x x 3:$ 3:$ 4 4 a a 5 $ 5 $ 2 2 3 3 • |Tree(T)| = O(|T| + size(edge labels))

  44. Suffix links • Similar to failure link in a keyword tree • Only link internal nodes having branches x a b P: xabcf a b c f c d d e e f f g g h h i i j j

  45. ST Application 1: pattern matching • Find all occurrence of P=xa in T • Find node v in the ST that matches to P • Traverse the subtree rooted at v to get the locations x a b x a a c c 1 c b b x x c 4 6 a a c c 5 T: xabxac 2 3 • O(m) to construct ST (large constant factor) • O(n) to find v – linear to length of P instead of T! • O(k) to get all leaves, k is the number of occurrence. • Asymptotic time is the same as KMP. ST wins if T is fixed. KMP wins otherwise.

  46. ST Application 2: set matching • Find all occurrences of a set of patterns in T • Build a ST from T • Match each P to ST x a b x a a c c 1 c b b x x c 4 6 a a c c 5 T: xabxac P: xab 2 3 • O(m) to construct ST (large constant factor) • O(n) to find v – linear to total length of P’s • O(k) to get all leaves, k is the number of occurrence. • Asymptotic time is the same as Aho-Corasick. ST wins if T fixed. AC wins if P’s are fixed. Otherwise depending on relative size.

  47. ST application 3: repeats finding • Genome contains many repeated DNA sequences • Repeat sequence length: Varies from 1 nucleotide to millions • Genes may have multiple copies (50 to 10,000) • Highly repetitive DNA in some non-coding regions • 6 to 10bp x 100,000 to 1,000,000 times • Problem: find all repeats that are at leastk-residues long and appear at least p times in the genome

  48. Repeats finding • at least k-residues long and appear at least p times in the seq • Phase 1: top-down, count label lengths (L) from root to each node • Phase 2: bottom-up: count # of leaves descended from each internal node For each node with L >= k, and N >= p, print all leaves O(m) to traverse tree (L, N)

  49. Maximal repeats finding • Right-maximal repeat • S[i+1..i+k] = S[j+1..j+k], • but S[i+k+1] != S[j+k+1] • Left-maximal repeat • S[i+1..i+k] = S[j+1..j+k] • But S[i] != S[j] • Maximal repeat • S[i+1..i+k] = S[j+1..j+k] • But S[i] != S[j], and S[i+k+1] != S[j+k+1] acatgacatt • cat • aca • acat

  50. Maximal repeats finding • Find repeats with at least 3 bases and 2 occurrence • right-maximal: cat • Maximal: acat • left-maximal: aca 5:e 1234567890acatgacatt 5 t a $ c 10 a 5:e t c t t a 9 t 4 t 5:e 5:e 5:e t t 7 3 6 8 1 2

More Related