1 / 98

CS5263 Bioinformatics

CS5263 Bioinformatics. Lecture 9-10 Exact String Matching Algorithms. Overview. Pair-wise alignment Multiple alignment Commonality: allowing errors when comparing strings Two sub-problems: How to score an alignment with errors How to find an alignment with the best score

sammyj
Download Presentation

CS5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS5263 Bioinformatics Lecture 9-10 Exact String Matching Algorithms

  2. Overview • Pair-wise alignment • Multiple alignment • Commonality: allowing errors when comparing strings • Two sub-problems: • How to score an alignment with errors • How to find an alignment with the best score • Today: exact string matching • Do not allow any errors • Efficiency becomes the sole consideration

  3. Why exact string matching? • The most fundamental string comparison problem • Work processors • Information retrieval • DNA sequence retrieval • Many many more • Is it still an interesting research problem? • Yes, if database is large • Exact string matching is often the core of more complex string comparison algorithms • E.g., BLAST • Often repeatedly called by other methods • Usually the most time consuming part • Small improvement could improve overall efficiency considerably

  4. Definitions • Text: a longer string T (length m) • Pattern: a shorter string P (length n) • Exact matching: find all occurrences of P in T lengthm T lengthn P

  5. The naïve algorithm

  6. Time complexity • Worst case: O(mn) • Best case: O(m) e.g. aaaaaaaaaaaaaa vs baaaaaaa • Average case? • Alphabet A, C, G, T • Assume both P and T are random • Equal probability • In average how many chars do you need to compare before giving up?

  7. Average case time complexity P(mismatch at 1st position): ¾ P(mismatch at 2nd position): ¼ * ¾ P(mismatch at 3nd position): (¼)2 * ¾ P(mismatch at kth position): (¼)k-1 * ¾ Expected number of comparison per position: p = 1/4 k (1-p) p(k-1) k = (1-p) / p * k pk k = 1/(1-p) = 4/3 Average complexity: 4m/3 Not as bad as you thought it might be

  8. Biological sequences are not random T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaab Plus: 4m/3 average case is still bad for long genomic sequences! Especially if this has to be done again and again Smarter algorithms: O(m + n) in worst case sub-linear in practice

  9. How to speedup? • Pre-processing T or P • Why pre-processing can save us time? • Uncovers the structure of T or P • Determines when we can skip ahead without missing anything • Determines when we can infer the result of character comparisons without doing them. ACGTAXACXTAXACGXAX ACGTACA

  10. Cost for exact string matching Total cost = cost (preprocessing) + cost(comparison) + cost(output) Overhead Minimize Constant Hope: gain > overhead

  11. String matching scenarios • One T and one P • Search a word in a document • One T and many P all at once • Search a set of words in a document • Spell checking (fixed P) • One fixed T, many P • Search a completed genome for short sequences • Two (or many) T’s for common patterns • Q: Which one to pre-process? • A: Always pre-process the shorter seq, or the one that is repeatedly used

  12. Pre-processing algs • Pattern preprocessing • Karp – Rabin algorithm • Small alphabet and short patterns • Knuth-Morris-Pratt algorithm (KMP) • Aho-Corasick algorithm • Multiple patterns • Boyer – Moore algorithm • The choice of most cases • Typically sub-linear time • Text preprocessing • Suffix tree • Very useful for many purposes

  13. Karp – Rabin Algorithm • Let’s say we are dealing with binary numbers Text: 01010001011001010101001 Pattern: 101100 • Convert pattern to integer 101100 = 2^5 + 2^3 + 2^2 = 44

  14. Karp – Rabin algorithm Text: 01010001011001010101001 Pattern: 101100 = 44 decimal 10111011001010101001 = 2^5 + 2^3 + 2^2 + 2^1 = 46 10111011001010101001 = 46 * 2 – 64 + 1 = 29 10111011001010101001 = 29 * 2 - 0 + 1 = 59 10111011001010101001 = 59 * 2 - 64 + 0 = 54 10111011001010101001 = 54 * 2 - 64 + 0 = 44

  15. Karp – Rabin algorithm What if the pattern is too long to fit into a single integer? Pattern: 101100. But our machine only has 5 bits Basic idea: hashing. 44 % 13 = 5 10111011001010101001 = 46 (% 13 = 7) 10111011001010101001 = 46 * 2 – 64 + 1 = 29 (% 13 = 3) 10111011001010101001 = 29 * 2 - 0 + 1 = 59 (% 13 = 7) 10111011001010101001 = 59 * 2 - 64 + 0 = 54 (% 13 = 2) 10111011001010101001 = 54 * 2 - 64 + 0 = 44 (% 13 = 5)

  16. Algorithm KMP • Not the fastest • Best known • Good for “real-time matching” • i.e. text comes one char at a time • No memory of previous chars • Idea • Left-to-right comparison • Shift P more than one char whenever possible

  17. abcxabcde abcxabcde abcxabcde Intuitive example 1 • Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened when comparing P[8] with T[i], we can shift P by four chars, and compare P[4] with T[i], without missing any possible matches. • Number of comparisons saved: 6 abcxabc T mismatch P abcxabcde Naïve approach: abcxabc T ? abcxabcde

  18. Should not be a c ? abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde Intuitive example 2 • Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened between P[7] and T[j], we can shift P by six chars and compare T[j] with P[1] without missing any possible matches • Number of comparisons saved: 7 abcxabc T mismatch P abcxabcde Naïve approach: abcxabc T ? abcxabcde

  19. KMP algorithm: pre-processing • Key: the reasoning is done without even knowing what string T is. • Only the location of mismatch in P must be known. x t T z y P t t’ j i z y P t t’ j i Pre-processing: for any position i in P, find P[1..i]’s longest proper suffix, t = P[j..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’ (i.e., y≠ z) For each i, let sp(i) = length(t)

  20. KMP algorithm: shift rule x t T z y P t t’ j i z y P t t’ 1 sp(i) j i Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the right by i – sp(i) chars and compare x with z. This shift rule can be implicitly represented by creating a failure link between y and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume comparison between x and P[sp(i)+1].

  21. Failure Link Example P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 (= 2 + 1) a a t a a c sp(i) 0 1 0 0 2 0 aaat aataac

  22. Another example P: abababc If a char in T fails to match at pos 7, re-compare it with the char at pos 5 (= 4 + 1) a b a b a b c Sp(i) 0 0 0 0 0 4 0 abab abababab ababaababc

  23. Implicit comparison KMP Example using Failure Link a a t a a c T: aacaataaaaataaccttacta aataac ^^* • Time complexity analysis: • Each char in T may be compared up to n times. A lousy analysis gives O(mn) time. • More careful analysis: number of comparisons can be broken to two phases: • Comparison phase: the first time a char in T is compared to P. Total is exactly m. • Shift phase. First comparisons made after a shift. Total is at most m. • Time complexity: O(2m) aataac .* aataac ^^^^^* aataac ..* aataac .^^^^^

  24. KMP algorithm using DFA (Deterministic Finite Automata) P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 Failure link a a t a a c If the next char in T is t after matching 5 chars, go to state 3 a t t a a c a a 0 1 2 3 4 5 DFA 6 a a All other inputs goes to state 0.

  25. DFA Example a t t a a c a a 0 1 2 3 4 5 DFA 6 a a T: aacaataataataaccttacta 1201234534534560001001 Each char in T will be examined exactly once. Therefore, exactly m comparisons are made. But it takes longer to do pre-processing, and needs more space to store the FSA.

  26. Difference between Failure Link and DFA • Failure link • Preprocessing time and space are O(n), regardless of alphabet size • Comparison time is at most 2m (at least m) • DFA • Preprocessing time and space are O(n ||) • May be a problem for very large alphabet size • For example, each “char” is a big integer • Chinese characters • Comparison time is always m.

  27. Boyer – Moore algorithm • Often the choice of algorithm for many cases • One T and one P • We will talk about it later if have time • In its original version does not guarantee linear time • Some modification did it • In practice sub-linear

  28. The set matching problem • Find all occurrences of a set of patterns in T • First idea: run KMP or BM for each P • O(km + n) • k: number of patterns • m: length of text • n: total length of patterns • Better idea: combine all patterns together and search in one run

  29. A simpler problem: spell-checking • A dictionary contains five words: • potato • poetry • pottery • science • school • Given a document, check if any word is (not) in the dictionary • Words in document are separated by special chars. • Relatively easy.

  30. Keyword tree for spell checking This version of the potato gun was inspired by the Weird Science team out of Illinois • O(n) time to construct. n: total length of patterns. • Search time: O(m). m: length of text • Common prefix only need to be compared once. • What if there is no space between words? p s o c l h o o 5 e i t e t a t r n t e y c o r e y 3 1 4 2

  31. Aho-Corasick algorithm • Basis of the fgrep algorithm • Generalizing KMP • Using failure links • Example: given the following 4 patterns: • potato • tattoo • theater • other

  32. Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2

  33. Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  34. Keyword tree 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo O(mn) m: length of text. n: length of longest pattern

  35. Keyword Tree with a failure link 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  36. Keyword Tree with a failure link 0 p t t h o e h a t r e t a a 4 t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  37. Keyword Tree with all failure links 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2

  38. Example 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  39. Example 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  40. Example 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  41. Example 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  42. Example 0 p t t h o e h a t r e t a 4 a t t t e o o r 1 o 3 2 potherotathxythopotattooattoo

  43. Aho-Corasick algorithm • O(n) preprocessing, and O(m+k) searching. • n: total length of patterns. • m: length of text • k is # of occurrence. • Can create a DFA similar as in KMP. • Requires more space, • Preprocessing time depends on alphabet size • Search time is constant • A: Where can this algorithm be used in previous topics? • Q: BLAST • Given a query sequence, we generate many seed sequences (k-mers) • Search for exact matches to these seed sequences • Extend exact matches into longer inexact matches

  44. Suffix Tree • All algorithms we talked about so far preprocess pattern(s) • Karp-Rabin: small pattern, small alphabet • Boyer-Moore: fastest in practice. O(m) worst case. • KMP: O(m) • Aho-Corasick: O(m) • In some cases we may prefer to pre-process T • Fixed T, varying P • Suffix tree: basically a keyword tree of all suffixes

  45. Suffix tree • T: xabxac • Suffixes: • xabxac • abxac • bxac • xac • ac • c x a b x a a c c 1 c b b x x c 4 6 a a c c 5 2 3 Naïve construction: O(m2) using Aho-Corasick. Smarter: O(m). Very technical. big constant factor Difference from a keyword tree: create an internal node only when there is a branch

  46. Suffix tree implementation • Explicitly labeling sequence end • T: xabxa$ x a x a b x b a a x a a $ 1 1 $ b b b b x $ x x x 4 a a a a $ 5 $ 2 2 3 3

  47. Suffix tree implementation • Implicitly labeling edges • T: xabxa$ 1:2 x a 3:$ b x 2:2 a a $ 1 1 $ $ b b $ $ x x 3:$ 3:$ 4 4 a a 5 $ 5 $ 2 2 3 3

  48. Suffix links • Similar to failure link in a keyword tree • Only link internal nodes having branches x a b P: xabcf a b c f c d d e e f f g g h h i i j j

  49. Suffix tree construction 1234567890acatgacatt 1:$ 1

  50. Suffix tree construction 1234567890acatgacatt 2:$ 1:$ 1 2

More Related