300 likes | 336 Views
Exact String Search. Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005. Boyer-Moore. Method of choice for exact string search, for a single pattern Typically, examines fewer than m characters of the text (sublinear time)
E N D
Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005
Boyer-Moore • Method of choice for exact string search, for a single pattern • Typically, examines fewer than m characters of the text (sublinear time) • Linear worst case running time • Conceptually very similar to K-M-P, but more complicated to running time proof • Empirically, better for english text than DNA sequence
Boyer-Moore • Three key ideas • Right to left scan • Bad character rule • (Strong) good suffix rule • The combination of these ideas can produce large pattern shifts. • Provable O(n+m) running time when pattern is not in the text • need extension for case when pattern is in the text to achieve linear running time.
Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbxabpqxctbpq P: tpabxab *^^^^
Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbxabpqxctbpq P: tpabxab *^^^^ P: tpabxab *
Right to left scan / bad character rule 0 1 123456789012345678 T:xpbctbxabpqxctbpqz P: tpabxab *^^^^ P: tpabxab * P: tpabxab
Bad character rule Comparing r-to-l, mismatch at i of P, k of T: If T(k) is absent from Pshift left end of P to k+1 of T If right-most T(k) in P is to left of i shift pattern to align T(k) characters Otherwise shift pattern 1 position
Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab *^^
Right to left scan / bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab *^^
Extended bad character rule Comparing r-to-l, mismatch at i of P, k of T: If T(k) is absent from P[1…i-1]shift left end of P to k+1 of T For right-most T(k) in P to left of i shift pattern to align T(k) characters Otherwise shift pattern 1 position
Right to left scan / extended bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab *^^
Right to left scan / extended bad character rule 0 1 12345678901234567 T:xpbctbaabpqxctbpq P: tpabxab
(Extended) bad character rule • For all x in Σ, R(x) is the position of the right-most occurrence of x in P. R(x) is zero if x is absent from P. • Comp. r-to-l, mismatch i of P, k of T: shift P right max[1,i-R(T(k))] positions • For extended bad character rule, need to lookup R(x,i)
(Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *
(Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab
(Strong) good suffix rule 0 1 123456789012345678 T:prstabstudabvqxrst P: abdubdab *^^^
(Strong) good suffix rule 0 1 123456789012345678 T:prstabstudabvqxrst P: abdubdab *^^^ P: abdabdab
(Strong) good suffix rule Substring t of T matches suffix of P: • Find the right-most copy t’ in Ps.t. t’ is not a suffix of P andchar to left of t’ in P ≠ char to left of t in Pshift P to align t’ in P with t in T • If no such t’ shift P so that the longest proper prefix of P aligns with suffix of P
(Stong) good suffix rule Definitions: L(i) – max j < n such that P[i…n] matches suffix of P[1…j], 0 if no such j. L’(i) – max j < n such that P[i…n] matches suffix of P[1…j] and char. before suffix ≠ P(i-1), 0 if no such j. Weak and strong shifts for first part of good suffix rule.
Computing L’(i) Definition: Nj(P) is the length of the longest suffix of P[1…j] that is also a suffix of P. compare with: Zi(S) is the length of the longest prefix of S[i…|S|] that is also a prefix of S.
Computing L’(i) Definition: Nj(P) is the length of the longest suffix of P[1…j] that is also a suffix of P. (!) compare with: Zi(S) is the length of the longest prefix of S[i…|S|] that is also a prefix of S. Compute Nj(P) as Zn-j+1(reverse(P)).
Computing L’(i) • L’(i) – max j < n s.t. Nj(P) = |P[i…n]| = (n – i +1)
(Strong) good suffix rule Definition: l’(i) – length of the longest prefix of P that is also a suffix of P[i…n], 0 if no such prefix exists. l’(i) – max j < (n – i + 1) s.t. Nj(P) = j
Boyer-Moore psuedo code Compute L’(i), l’(i), and R(x) for x in Σ. k = n while k ≤ n i = n, h = k while i > 0 and P(i) = T(h) i--; h-- if i = 0 occurrence of P in T k = k + n – l’(2) else If L’(i+1) > 0, λ = L’(i+1), λ = l’(i+1) k = k + max{ 1, i - R(T(h)), n – λ }
Running time analysis • Notice that unlike K-M-P, we might re-compare text characters that matched in a previous iteration. • Worst instance does Θ(nm) total comparisons, but only if P is in T • If P is not in T, O(n+m) running time • complicated proof! • What goes wrong when P is in T?
Worst case instance, P in T 0 1 12345678901234567 T:aaaaaaaaaaaaaaaaa P: aaaaaaa ^^^^^^^ P: aaaaaaa ^^^^^^^
Galil’s Extention • Comparing r-to-l, n of P aligned to k of T, matched at character s of T: If pos 1 of P shifts past s, thenprefix of P matches in T up to pos k. • skip these comparisons • Sufficient for linear time bound, whether or not P is in T or not.
Worst case instance, P in T 0 1 12345678901234567 T:aaaaaaaaaaaaaaaaa P: aaaaaaa ^^^^^^^ P: aaaaaaa ^
Galil’s Extention 0 1 123456789012345678 T:prstabstudabvqxrst P: abdubdab *^^^ P: abdabdab
Lessons From B-M • Sub-linear time is possible • But we still need to read T from disk! • Bad cases require periodicity in P or T • matching random P with T is easy! • Large alphabets mean large shifts • Small alphabets make complicated shift data-structures possible • B-M better for “english” and amino-acids than for DNA.