1 / 46

CS 5263 Bioinformatics

CS 5263 Bioinformatics. Lecture 5: Affine Gap Penalties. Last lecture. Local Sequence Alignment Bounded Dynamic Programming Linear Space Sequence Alignment. The Smith-Waterman algorithm. Initialization : F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d F(i, j – 1) – d

gloryj
Download Presentation

CS 5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties

  2. Last lecture • Local Sequence Alignment • Bounded Dynamic Programming • Linear Space Sequence Alignment

  3. The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) Iteration: F(i, j) = max

  4. The Smith-Waterman algorithm Termination: • If we want the best local alignment… FOPT = maxi,j F(i, j) • If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

  5. Bounded Dynamic Programming • O(kM) time • O(kN) memory x1 ………………………… xM yN ………………………… y1 k

  6. Linear-space alignment • O(M+N) memory • 2MN time M/2 k* M/2 N-k*

  7. Homework Problem 5 hints Dot matrix for visualizing seq similarities • Seq1: x[1..m] • Seq2: y[1..n] A(i, j) = 1 if (xi, yj) = 1 A(i, j) = 1 if k=1:10((xi+k, yj+k)) > 7 A dot matrix does not do any alignment (global or local). It helps to detect strongly conserved regions. A(i, j) = 1 if k=1:20((xi+k, yj+k)) > 15

  8. Seq1 Seq2

  9. Today • How to model gaps more accurately? • Statistics of alignments • Where does (xi, yj)come from? • Are two aligned sequences actually related? – not today

  10. GACGCCGAACG |||| | | || GACG-C-A-CG What’s a better alignment? GACGCCGAACG ||||| ||| GACGC---ACG Score = 8 x m – 3 x d Score = 8 x m – 3 x d • However, gaps usually occur in bunches. • During evolution, chunks of DNA may be lost entirely • Aligning genomic sequences vs. cDNAs (reverse complimentary to mRNAs)

  11. Model gaps more accurately • Current model: • Gap of length n incurs penalty nd • General: • Convex function • E.g. (n) = c * sqrt (n)  n  n

  12. General gap dynamic programming Initialization: same Iteration: F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – (i-k) maxk=0…j-1F(i,k) – (j-k) Termination: same Running Time: O((M+N)MN) (cubic) Space: O(NM) (linear-space algorithm not applicable)

  13. Compromise: affine gaps (n) (n) = d + (n – 1)e | | gap gap open extension e d Match: 2 Gap open: -5 Gap extension: -1 GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 9 8x2-3x5 = 1 • We want to find the optimal alignment with affine gap penalty in • O(MN) time • O(MN) or better O(M+N) memory

  14. Allowing affine gap penalties • Still three cases • xi aligned with yj • Xi aligns to a gap • Are we continuing a gap in x? (if no, start is more expensive) • Yj aligns to a gap • Are we continuing a gap in y? (if no, start is more expensive) • We can use a finite state machine to represent the three cases as three states • The machine has two heads, reading the chars on the two strings separately • At every step, each head reads 0 or 1 char from each sequence • Depending on what it reads, goes to a different state, and produces different scores

  15. Finite State Machine Input Output ? / ? ? / ? Ix ? / ? ? / ? F ? / ? Iy ? / ? State ? / ? F: have just read 1 char from each seq (xi aligned to yj ) Ix: have read 0 char from x. (yj aligned to a gap) Iy: have read 0 char from y (xi aligned to a gap)

  16. Input Output (-, yj) / e (xi,yj) /  Ix (xi,yj) /  (-, yj) / d F (xi,-) / d Iy (xi,-) / e Start state (xi,yj) / 

  17. (-, yj) / e (xi,yj) /  Ix (xi,yj) /  (-, yj) / d F (xi,-) / d Iy (xi,-) / e start state (xi,yj) /  F-F-Iy-F-Ix F-F-F-F F-Iy-F-F-Ix AAC ||| ACT AAC- | | A-CT AAC- || -ACT AAC ACT Given a pair of sequences, an alignment (not necessarily optimal) corresponds to a state path in the FSM. Optimal alignment: find a state path to read the two sequences such that the total output score is the highest

  18. Dynamic programming • We encode this information in three different matrices • For each element (i,j) we use three variables • F(i,j): best alignment (score) of x1..xi & y1..yj if xi aligns to yj • Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap • Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap xi xi xi yj yj yj Iy(i, j) Ix(i, j) F(i, j)

  19. (-, yj)/e (xi,yj) / Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i-1, j-1) + (xi, yj) F(i, j) = max Ix(i-1, j-1) + (xi, yj) Iy(i-1, j-1) + (xi, yj) xi yj

  20. (-, yj)/e (xi,yj) / Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i, j-1) + d Ix(i, j) = max Ix(i, j-1) + e xi yj Ix(i, j)

  21. (-, yj)/e (xi,yj) / Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i-1, j) + d Iy(i, j) = max Iy(i-1, j) + e xi yj Iy(i, j)

  22. F(i – 1, j – 1) F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) F(i, j – 1) + d Ix(i, j) = max Ix(i, j – 1) + e F(i – 1, j) + d Iy(i, j) = max Iy(i – 1, j) + e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y

  23. Data dependency F i Iy Ix j i-1 i-1 j-1 j-1

  24. Data dependency F i Ix Iy j i i j j

  25. Data dependency • If we stack all three matrices • No cyclic dependency • Therefore, we can fill in all three matrices in order

  26. Algorithm • for i = 1:m • for j = 1:n • Fill in F(i, j), Ix(i, j), Iy(i, j) • end end • F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N)) • Time: O(MN) • Space: O(MN) or O(N) when combine with the linear-space algorithm

  27. Exercise • x = GCAC • y = GCC • m = 2 • s = -2 • d = -5 • e = -1

  28. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F: aligned on both Iy: Insertion on y y = G C C Iy(i-1, j-1) F(i-1, j-1) Iy(i-1,j) x = F(i-1,j) (xi, yj) G C A C e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix: Insertion on x

  29. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = 2 Ix(i-1, j-1) F(i, j) Ix

  30. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = -2 Ix(i-1, j-1) F(i, j) Ix

  31. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = -2 Ix(i-1, j-1) F(i, j) Ix

  32. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = G C A C F(i,j-1) d = -5 Ix(i,j) Ix(i,j-1) e = -1 Ix

  33. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = G C A C F(i,j-1) d = -5 Ix(i,j) Ix(i,j-1) e = -1 Ix

  34. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1,j) G C A C F(i-1,j) e=-1 d=-5 Iy(i,j) Ix

  35. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = -2 Ix(i-1, j-1) F(i, j) Ix

  36. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = 2 Ix(i-1, j-1) F(i, j) Ix

  37. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = 2 Ix(i-1, j-1) F(i, j) Ix

  38. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = G C A C F(i,j-1) d = -5 Ix(i,j) Ix(i,j-1) e = -1 Ix

  39. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1,j) G C A C F(i-1,j) e=-1 d=-5 Iy(i,j) Ix

  40. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C Iy(i-1, j-1) F(i-1, j-1) Iy(i-1,j) x = F(i-1,j) (xi, yj) G C A C e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix

  41. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C Iy(i-1, j-1) F(i-1, j-1) Iy(i-1,j) x = F(i-1,j) (xi, yj) G C A C e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix

  42. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1,j) G C A C F(i-1,j) e=-1 d=-5 Iy(i,j) Ix

  43. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1,j) G C A C F(i-1,j) e=-1 d=-5 Iy(i,j) Ix

  44. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C Iy(i-1, j-1) F(i-1, j-1) Iy(i-1,j) x = F(i-1,j) (xi, yj) G C A C e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix

  45. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C Iy(i-1, j-1) F(i-1, j-1) Iy(i-1,j) x = F(i-1,j) (xi, yj) G C A C e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix

  46. y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C x GCAC || | GC-C y F Iy y = G C C y = G C C x = x = G C A C G C A C Ix

More Related