1 / 36

Dynamic Programming and Biological Sequence Comparison

Dynamic Programming and Biological Sequence Comparison. Part I. Topic II – Biological Sequence Alignment and Database Search. Part I (Topic-2a): Dynamic programming and Sequence comparison Part II (Topic-2b): Heuristic and Database Search (e.g. FAST, BLAST) sequence alignment

Download Presentation

Dynamic Programming and Biological Sequence Comparison

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Programming and Biological Sequence Comparison Part I

  2. Topic II – Biological Sequence Alignment and Database Search • Part I (Topic-2a): Dynamic programming and Sequence comparison • Part II (Topic-2b): Heuristic and Database Search (e.g. FAST, BLAST) sequence alignment • Part III (Topic-2c): Multiple sequence alignment \course\eleg667-01-f\Topic-2a.ppt

  3. Outline • Concept of alignment • Two algorithm design techniques; • Dynamic Programming: Examples • Applying DP to Sequence Comparison; • The database search problem • Heuristic algorithms to database search \course\eleg667-01-f\Topic-2a.ppt

  4. Alignment • The two sequences will have the same length (after possible insertions of spaces on either or both of them) • No space in one sequence can be aligned with a space in the other • Spaces can be inserted at the beginning or end of the sequences \course\eleg667-01-f\Topic-2a.ppt

  5. Biological Sequence Alignment and Database Search • We have two sequences over the same alphabet, both about the same length (tens of thousands of characters) and the sequences are almost equal. The average frequency of these differences is low, say, one each hundred characters. We want to find the places where the differences occur. • We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there is a prefix of one which is similar to suffix of the other. \course\eleg667-01-f\Topic-2a.ppt

  6. (cont’d) 3. We have the same problem as in (2), but now we have several hundred sequences that must be compared (each one against all). In addition, we know that the great majority of sequence pairs are unrelated, that is, they will not have the required degree of similarity. 4. We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there are two substrings, one from each sequence, that are similar. 5. We have the same problem as in (4), but instead of two sequences we have one sequence that must be compared to thousands of others. \course\eleg667-01-f\Topic-2a.ppt

  7. Breaking Problems Down: Two Related Algorithm Design Techniques • Divide and Conquer: Starting with the complete instance of a problem, divide it into smaller subinstances, solve each of them recursively and combine the partial solutions into a solution to the original problem. • Dynamic Programming: Starting with the smallest subinstances of a problem, solve and combine them until the complete instance of the original problem is solved. \course\eleg667-01-f\Topic-2a.ppt

  8. 9 1 25 4 15 becomes 4 1 9 25 15 4 1 25 15 becomes becomes 1 4 15 25 15 1 4 25 Divide and Conquer – Example 1 Quick Sort \course\eleg667-01-f\Topic-2a.ppt

  9. Divide and Conquer – Example 2 The Fibonacci numbers F1 = 1, F2 = 1 Fn = Fn-1 + Fn-2 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, … Fib(n) { if (n < 2) return 1; else return Fib(n-1)+Fib(n-2); } \course\eleg667-01-f\Topic-2a.ppt

  10. F(7) + F(6) + F(5) F(5) F(3) F(3) F(3) F(3) F(3) F(4) F(4) F(4) + + + + + + + + + + T(n)  #Internal_nodes = #leaves - 1 F(1) F(1) F(1) F(1) F(1) F(2) F(2) F(2) F(2) F(2) F(2) F(2) F(2) but #leaves = Fn Fn 1.6n, n >> 1 Fn /Fn-1  1.6 n 1 2 3 4 5 6 7 8 9 10 11 … Fn 1 1 2 3 5 8 13 21 34 55 89 … T(n) = O(1.6n) Divide and Conquer – Example 2 F1 = 1, F2 = 1 Fn = Fn-1 + Fn-2 Exponential Time! \course\eleg667-01-f\Topic-2a.ppt

  11. How to Compute Fib Function Using Dynamic Programming Method? \course\eleg667-01-f\Topic-2a.ppt

  12. tab Extra memory to store intermediate values 1 1 2 3 5 8 13 21 34 55 89 …. Dynamic Programming–Example 1 Fib(n) { int tab[n]; tab[1] = 1; tab[2] = 1; for (j = 3; j <= n; j++) tab[j]=tab[j-1] + tab[j-2]; return tab[n]; } Start by solving the smallest problems Use the partial solutions to solve bigger and bigger problems Linear Time! Space-Time Tradeoff T(n) = O(n) \course\eleg667-01-f\Topic-2a.ppt

  13. Sequence Comparison code full name A alanine C cysteine D aspartate E glutamate F phenylalanine G glycine H histidine I isoleucine K lysine L leucine M methionine N aspartamine P proline Q glutamine R arginine S serine T threonine V valine W tryptophan Y tyrosine Molecular sequence data are at the heart of Computational Biology • DNA sequences • RNA sequences • Protein sequences We can think of these sequences as strings of letters • DNA & RNA: alphabet of 4 letters (A,T,C,G) • Protein: alphabet of 20 letters \course\eleg667-01-f\Topic-2a.ppt

  14. Sequence Comparison – (Cont.) Why compare sequences? • Find similar genes/proteins • Allows to predict function & structure • Locate common subsequences in genes/proteins • Identify common recurrent patterns • Locate sequences that might overlap • Help in sequence assembly \course\eleg667-01-f\Topic-2a.ppt

  15. Sequence X = A T A A G T Sequence Y = A T G C A G T Total = 2 Sequence Comparison – (Cont.) To compare the sequences we need to quantify the similariy matches = 1 mismatches = 0 Score 1 1 0 0 0 0 0 \course\eleg667-01-f\Topic-2a.ppt

  16. Sequence X = A T A A G T Sequence Y = A T G C A G T Total = 3 Sequence Comparison – (Cont.) Taking positions of the letters into account Sequence X = A T A A G T matches = 1 mismatches = 0 Score 0 0 0 0 1 1 1 \course\eleg667-01-f\Topic-2a.ppt

  17. Sequence X = A T A A G T Sequence Y = A T G C A G T Score 1 1 0 –1 1 1 1 Total = 4 Sequence Comparison – (Cont.) How to take possible mutations into account? Sequence X = A T A - A G T matches = 1 mismatches = 0 matches = 1 mismatches = 0 gap = -1 \course\eleg667-01-f\Topic-2a.ppt

  18. G - - A -1 -1 GA - - G - - A G A - G A - - - AG -2 -2 0 -2 -2 GA - - - A GA - A G - A - A - G - - - AG GA A - G - AG - GA A - - - G - A -G - G AG - - G AG - -3 -1 -1 -3 -3 -3 0 -3 -3 0 GA - - - - AG GA - - AG G - A - - A -G G - A - AG G - - A - AG - GA - A -G GA AG G - A AG - - GA - A - -G - GA A -G - G - A A -G - - GA AG - - - GA AG - - -4 -1 -4 -2 -4 -2 0 -2 -4 -4 -1 -4 -2 choose the best score, i.e max(-3, 0, -1) choose the best score, i.e max(-2, 0, -2) choose the best score, i.e max(-1, 0, -3) choose the best score, i.e max(-1, 0, -1) Applying DP to Sequence Comparison Sequence X = GA Sequence Y = AG scores T(n,n) = O(kn) Exponential Time! total score = 0 \course\eleg667-01-f\Topic-2a.ppt

  19. 0 -1 -2 -1 -2 GA - - G - - A G A - G A - - - AG GA - - - A GA - A G - A - A - G - - - AG GA A - G - AG - GA A - - - G - A -G - G AG - - G AG - GA - - - - AG GA - - AG G - A - - A -G G - A - AG G - - A - AG - GA - A -G GA AG G - A AG - - GA - A - -G - GA A -G - G - A A -G - - GA AG - - - GA AG - - Applying DP to Sequence Comparison G A Sequence X = GA Sequence Y = AG T(n,n) = O(n2) A G 0 0 Polynomial Time! 0 0 G - - A -1 -1 -2 -2 0 -2 -2 -3 -1 -1 -3 -3 -3 0 -3 -3 0 -4 -1 -4 -2 -4 -2 0 -2 -4 -4 -1 -4 -2 \course\eleg667-01-f\Topic-2a.ppt

  20. Question: from 1 to 9 how many paths? G A 0 -1 -2 A -1 0 0 G -2 0 0 1 2 4 1 3 5 7 2 3 5 5 6 4 8 8 7 7 8 7 6 8 9 8 7 9 9 9 9 9 9 9 5 9 9 9 8 7 9 9 9 Questions • Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example? • Answer: Let us count Total = 13 \course\eleg667-01-f\Topic-2a.ppt

  21. DP algorithm for Sequence Comparison Extra memory to store intermediate values sb[i,j] - Substitution Matrix int S[m,n] m = length(X) n = length(Y) for i = 0 to m do S[i,0] = i . g for j = 0 to n do S[j,0] = j . g for i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g ) return S[m,n] A T C G 1 0 0 0 A T C G Start by solving the smallest problems 0 1 0 0 0 0 1 0 0 0 0 1 Use the partial solutions to solve bigger and bigger problems \course\eleg667-01-f\Topic-2a.ppt

  22. The Substitution Matrix A B C D E F G H I K L M N P Q R S T V W Y Z A 2 0 -2 0 0 -4 1 -1 -1 -1 -2 -1 0 1 0 -2 1 1 0 -6 -3 0 B 0 2 -4 3 2 -5 0 1 -2 1 -3 -2 2 -1 1 -1 0 0 -2 -5 -3 2 C -2 -4 12 -5 -5 -4 -3 -3 -2 -5 -6 -5 -4 -3 -5 -4 0 -2 -2 -8 0 -5 D 0 3 -5 4 3 -6 1 1 -2 0 -4 -3 2 -1 2 -1 0 0 -2 -7 -4 3 E 0 2 -5 3 4 -5 0 1 -2 0 -3 -2 1 -1 2 -1 0 0 -2 -7 -4 3 F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2 0 -4 -5 -5 -4 -3 -3 -1 0 7 -5 G 1 0 -3 1 0 -5 5 -2 -3 -2 -4 -3 0 -1 -1 -3 1 0 -1 -7 -5 -1 H -1 1 -3 1 1 -2 -2 6 -2 0 -2 -2 2 0 3 2 -1 -1 -2 -3 0 2 I -1 -2 -2 -2 -2 1 -3 -2 5 -2 2 2 -2 -2 -2 -2 -1 0 4 -5 -1 -2 K -1 1 -5 0 0 -5 -2 0 -2 5 -3 0 1 -1 1 3 0 0 -2 -3 -4 0 L -2 -3 -6 -4 -3 2 -4 -2 2 -3 6 4 -3 -3 -2 -3 -3 -2 2 -2 -1 -3 M -1 -2 -5 -3 -2 0 -3 -2 2 0 4 6 -2 -2 -1 0 -2 -1 2 -4 -2 -2 N 0 2 -4 2 1 -4 0 2 -2 1 -3 -2 2 -1 1 0 1 0 -2 -4 -2 1 P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 0 0 1 0 -1 -6 -5 0 Q 0 1 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 1 -1 -1 -2 -5 -4 3 R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 0 -1 -2 2 -4 0 S 1 0 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 1 -1 -2 -3 0 T 1 0 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 0 -5 -3 -1 V 0 -2 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 -6 -2 -2 W -6 -5 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 0 -6 Y -3 -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10 -4 Z 0 2 -5 3 3 -5 -1 2 -2 0 -3 -2 1 0 3 0 0 -1 -2 -6 -4 3 • For DNA we usually use identity matrices; For proteins more sensitive matrices, derived empirically, are used; A T C G 1 0 0 0 A T C G 0 1 0 0 0 0 1 0 0 0 0 1 \course\eleg667-01-f\Topic-2a.ppt

  23. 0 -1 -2 -3 -4 -5 -6 -7 -1 0 2 1 0 -1 -2 -3 -2 -1 1 2 1 1 0 -1 -3 -2 0 1 2 2 1 0 -4 -3 -1 1 1 2 3 2 -5 -4 -2 0 1 1 2 4 -6 -4 + (-1) -3 + ( 0 ) -1 + (-1) -5 + (-1) -4 + (+1) -2 + (-1) -2 + (-1) -1 + ( 0 ) 1 + (-1) -7 + (-1) -6 + ( 0 ) -4 + (-1) -3 + (-1) -2 + ( 0 ) 0 + (-1) -1 + (-1) 0 + (+1) -1 + (-1) -6 + (-1) -5 + ( 0 ) -3 + (-1) -3 -2 -1 -5 0 1 -4 Sequence Comparison revisited A T G C A G T int S[m,n] m = length(X) n = length(Y) for i = 0 to m do S[i,0] = i . g for j = 0 to n do S[j,0] = j . g for i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g ) return S[m,n] -1 -2 -3 -4 -5 A T A A G T 1 0 Similarity Matrix \course\eleg667-01-f\Topic-2a.ppt

  24. What To Do Next? Answer: Finding alignments But, How? \course\eleg667-01-f\Topic-2a.ppt

  25. -1 + (-1) 0 + (+1) -1 + (-1) 0 + (-1) 1 + (+1) 0 + (-1) 2 + (-1) 3 + (+1) 2 + (-1) 1 + (-1) 2 + ( 0 ) 1 + (-1) 1 + (-1) 2 + (+1) 2 + (-1) 1 + (-1) 1 + (+1) 2 + (-1) 0 + (-1) 1 + ( 0 ) 2 + (-1) -1 + (-1) 0 + ( 0 ) 2 + (-1) 0 -1 -2 -3 -4 -5 -6 -7 1 4 3 2 1 1 2 2 1 0 -1 -2 -3 -4 -5 -1 0 2 1 0 -1 -2 -3 -2 -1 1 2 1 1 0 -1 -3 -2 0 1 2 2 1 0 -4 -3 -1 1 1 2 3 2 -5 -4 -2 0 1 1 2 4 -6 T G C A G T T - A A G T G C A G T - A A G T A T G C A G T A T - A A G T A G T A G T C A G T A A G T G T G T T T C A G T - A G T G C A G T A - A G T A T G C A G T A T A - A G T T G C A G T T A - A G T Finding the Alignment(s) A T G C A G T A T A A G T Similarity Matrix Global Alignments \course\eleg667-01-f\Topic-2a.ppt

  26. How to Break a Tie? • Should one report all? • Or, report only one? \course\eleg667-01-f\Topic-2a.ppt

  27. Advantage of DP Alignment Algorithms • Build up the solution by determining all similarities between arbitrary prefixes of the two sequences • Starting with the shorter prefixes and use previously computed results to solve for larger prefixes \course\eleg667-01-f\Topic-2a.ppt

  28. The Complexity of the DP Alignment Algorithm? • Construction of the similarity matrix: O (m • n) • Find an optimal alignment O (m + n) \course\eleg667-01-f\Topic-2a.ppt

  29. Global versus Local Alignments • A global alignment attempts to match all of one sequence against all of another LGPSTKQFGKGSSSRIWDN | |||| | | LNQIERSFGKGAIMRLGDA • A local alignment attempts to match subsequences of the two sequences; -------FGKG-------- |||| -------FGKG-------- \course\eleg667-01-f\Topic-2a.ppt

  30. How to Compute Local Alignment? \course\eleg667-01-f\Topic-2a.ppt

  31. Similarity Matrix Computation: a[i,j-1]+g a[i,j]= max a[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 0 0 0 0 0 .. 0 0 0 .. a[i,0]= 0 ; for i= 0…m a[0,j]= 0 ; for j= 0…n Applying DP to Local Alignment Don’t penalize gaps on left and right ends! If the best alignment up to some point has a negative score, it’s better to start a new one, rather than extend the old one. \course\eleg667-01-f\Topic-2a.ppt

  32. Criteria of Finding a Local Alignment • Find the entries with maximum values in the simularity matrix • For each of such entries, construct an local alignment • See next example • We may also be interested in near-optimal alignments \course\eleg667-01-f\Topic-2a.ppt

  33. 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 2 1 0 0 1 1 0 1 1 2 1 1 0 1 0 1 1 1 2 2 1 0 0 0 0 2 1 2 3 2 0 0 1 1 2 1 2 4 0 A T G C A A G T A T G C A G T A T A - A G T A T G C A G T A T - A A G T Applying DP to Local Alignment A T G C A G T Similarity Matrix Computation: a[i,j-1]+g a[i,j]= max a[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 A T A A G T Similarity Matrix \course\eleg667-01-f\Topic-2a.ppt

  34. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 a[i,j-1]+g a[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 0 0 0 2 0 0 0 2 0 0 0 0 1 0 0 3 1 0 0 a[i,j]= max 1 0 1 0 0 0 1 1 2 0 1 0 0 0 0 0 1 0 0 2 3 1 2 1 0 0 0 1 0 0 1 3 1 2 3 1 T G A T G G A G G T A G G T G A T - G G A G G T G A T A G G 0 + (-2) 0 + (+1) 0 + (-2) 0 0 + (-2) 0 + (-1) 0 + (-2) 0 T G A T G G A G G T G A T A G T G A T G G A G G T G A T 1 0 Local Alignment using DP T G A T G G A G G T 0 1 G A T A G G g = -2 A T C G -1 1 -1 -1 A T C G -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1 1 \course\eleg667-01-f\Topic-2a.ppt

  35. How to Break a Tie? • Should one report all? • Or, report only one? \course\eleg667-01-f\Topic-2a.ppt

  36. Extension to the Basic DP Method • Improving space complexity • Introduce general gap functions • That is, the probability of a sequence of consecutive spaces is more likely than individual spaces • Affine gap functions: w(k) = h + gk \course\eleg667-01-f\Topic-2a.ppt

More Related