1 / 30

Exact String Search

Exact String Search. Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005. Boyer-Moore. Method of choice for exact string search, for a single pattern Typically, examines fewer than m characters of the text (sublinear time)

cmerkel
Download Presentation

Exact String Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005

  2. Boyer-Moore • Method of choice for exact string search, for a single pattern • Typically, examines fewer than m characters of the text (sublinear time) • Linear worst case running time • Conceptually very similar to K-M-P, but more complicated to running time proof • Empirically, better for english text than DNA sequence

  3. Boyer-Moore • Three key ideas • Right to left scan • Bad character rule • (Strong) good suffix rule • The combination of these ideas can produce large pattern shifts. • Provable O(n+m) running time when pattern is not in the text • need extension for case when pattern is in the text to achieve linear running time.

  4. Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbxabpqxctbpq P: tpabxab *^^^^

  5. Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbxabpqxctbpq P: tpabxab *^^^^ P: tpabxab *

  6. Right to left scan / bad character rule 0 1 123456789012345678 T:xpbctbxabpqxctbpqz P: tpabxab *^^^^ P: tpabxab * P: tpabxab

  7. Bad character rule Comparing r-to-l, mismatch at i of P, k of T: If T(k) is absent from Pshift left end of P to k+1 of T If right-most T(k) in P is to left of i shift pattern to align T(k) characters Otherwise shift pattern 1 position

  8. Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab *^^

  9. Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab *^^

  10. Extended bad character rule Comparing r-to-l, mismatch at i of P, k of T: If T(k) is absent from P[1…i-1]shift left end of P to k+1 of T For right-most T(k) in P to left of i shift pattern to align T(k) characters Otherwise shift pattern 1 position

  11. Right to left scan / extended bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab *^^

  12. Right to left scan / extended bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab

  13. (Extended) bad character rule • For all x in Σ, R(x) is the position of the right-most occurrence of x in P. R(x) is zero if x is absent from P. • Comp. r-to-l, mismatch i of P, k of T: shift P right max[1,i-R(T(k))] positions • For extended bad character rule, need to lookup R(x,i)

  14. (Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *

  15. (Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab

  16. (Strong) good suffix rule 0 1 123456789012345678 T:prstabstudabvqxrst P: abdubdab *^^^

  17. (Strong) good suffix rule 0 1 123456789012345678 T:prstabstudabvqxrst P: abdubdab *^^^ P: abdabdab

  18. (Strong) good suffix rule Substring t of T matches suffix of P: • Find the right-most copy t’ in Ps.t. t’ is not a suffix of P andchar to left of t’ in P ≠ char to left of t in Pshift P to align t’ in P with t in T • If no such t’ shift P so that the longest proper prefix of P aligns with suffix of P

  19. (Stong) good suffix rule Definitions: L(i) – max j < n such that P[i…n] matches suffix of P[1…j], 0 if no such j. L’(i) – max j < n such that P[i…n] matches suffix of P[1…j] and char. before suffix ≠ P(i-1), 0 if no such j. Weak and strong shifts for first part of good suffix rule.

  20. Computing L’(i) Definition: Nj(P) is the length of the longest suffix of P[1…j] that is also a suffix of P. compare with: Zi(S) is the length of the longest prefix of S[i…|S|] that is also a prefix of S.

  21. Computing L’(i) Definition: Nj(P) is the length of the longest suffix of P[1…j] that is also a suffix of P. (!) compare with: Zi(S) is the length of the longest prefix of S[i…|S|] that is also a prefix of S. Compute Nj(P) as Zn-j+1(reverse(P)).

  22. Computing L’(i) • L’(i) – max j < n s.t. Nj(P) = |P[i…n]| = (n – i +1)

  23. (Strong) good suffix rule Definition: l’(i) – length of the longest prefix of P that is also a suffix of P[i…n], 0 if no such prefix exists. l’(i) – max j < (n – i + 1) s.t. Nj(P) = j

  24. Boyer-Moore psuedo code Compute L’(i), l’(i), and R(x) for x in Σ. k = n while k ≤ n i = n, h = k while i > 0 and P(i) = T(h) i--; h-- if i = 0 occurrence of P in T k = k + n – l’(2) else If L’(i+1) > 0, λ = L’(i+1), λ = l’(i+1) k = k + max{ 1, i - R(T(h)), n – λ }

  25. Running time analysis • Notice that unlike K-M-P, we might re-compare text characters that matched in a previous iteration. • Worst instance does Θ(nm) total comparisons, but only if P is in T • If P is not in T, O(n+m) running time • complicated proof! • What goes wrong when P is in T?

  26. Worst case instance, P in T 0 1 12345678901234567 T:aaaaaaaaaaaaaaaaa P: aaaaaaa ^^^^^^^ P: aaaaaaa ^^^^^^^

  27. Galil’s Extention • Comparing r-to-l, n of P aligned to k of T, matched at character s of T: If pos 1 of P shifts past s, thenprefix of P matches in T up to pos k. • skip these comparisons • Sufficient for linear time bound, whether or not P is in T or not.

  28. Worst case instance, P in T 0 1 12345678901234567 T:aaaaaaaaaaaaaaaaa P: aaaaaaa ^^^^^^^ P: aaaaaaa ^

  29. Galil’s Extention 0 1 123456789012345678 T:prstabstudabvqxrst P: abdubdab *^^^ P: abdabdab

  30. Lessons From B-M • Sub-linear time is possible • But we still need to read T from disk! • Bad cases require periodicity in P or T • matching random P with T is easy! • Large alphabets mean large shifts • Small alphabets make complicated shift data-structures possible • B-M better for “english” and amino-acids than for DNA.

More Related