Dynamic Programming and Biological Sequence Comparison

Dynamic Programming and Biological Sequence Comparison Part I

Topic II – Biological Sequence Alignment and Database Search • Part I (Topic-2a): Dynamic programming and Sequence comparison • Part II (Topic-2b): Heuristic and Database Search (e.g. FAST, BLAST) sequence alignment • Part III (Topic-2c): Multiple sequence alignment \course\eleg667-01-f\Topic-2a.ppt

Outline • Concept of alignment • Two algorithm design techniques; • Dynamic Programming: Examples • Applying DP to Sequence Comparison; • The database search problem • Heuristic algorithms to database search \course\eleg667-01-f\Topic-2a.ppt

Alignment • The two sequences will have the same length (after possible insertions of spaces on either or both of them) • No space in one sequence can be aligned with a space in the other • Spaces can be inserted at the beginning or end of the sequences \course\eleg667-01-f\Topic-2a.ppt

Biological Sequence Alignment and Database Search • We have two sequences over the same alphabet, both about the same length (tens of thousands of characters) and the sequences are almost equal. The average frequency of these differences is low, say, one each hundred characters. We want to find the places where the differences occur. • We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there is a prefix of one which is similar to suffix of the other. \course\eleg667-01-f\Topic-2a.ppt

(cont’d) 3. We have the same problem as in (2), but now we have several hundred sequences that must be compared (each one against all). In addition, we know that the great majority of sequence pairs are unrelated, that is, they will not have the required degree of similarity. 4. We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there are two substrings, one from each sequence, that are similar. 5. We have the same problem as in (4), but instead of two sequences we have one sequence that must be compared to thousands of others. \course\eleg667-01-f\Topic-2a.ppt

Breaking Problems Down: Two Related Algorithm Design Techniques • Divide and Conquer: Starting with the complete instance of a problem, divide it into smaller subinstances, solve each of them recursively and combine the partial solutions into a solution to the original problem. • Dynamic Programming: Starting with the smallest subinstances of a problem, solve and combine them until the complete instance of the original problem is solved. \course\eleg667-01-f\Topic-2a.ppt

9 1 25 4 15 becomes 4 1 9 25 15 4 1 25 15 becomes becomes 1 4 15 25 15 1 4 25 Divide and Conquer – Example 1 Quick Sort \course\eleg667-01-f\Topic-2a.ppt

Divide and Conquer – Example 2 The Fibonacci numbers F1 = 1, F2 = 1 Fn = Fn-1 + Fn-2 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, … Fib(n) { if (n < 2) return 1; else return Fib(n-1)+Fib(n-2); } \course\eleg667-01-f\Topic-2a.ppt

F(7) + F(6) + F(5) F(5) F(3) F(3) F(3) F(3) F(3) F(4) F(4) F(4) + + + + + + + + + + T(n)  #Internal_nodes = #leaves - 1 F(1) F(1) F(1) F(1) F(1) F(2) F(2) F(2) F(2) F(2) F(2) F(2) F(2) but #leaves = Fn Fn 1.6n, n >> 1 Fn /Fn-1  1.6 n 1 2 3 4 5 6 7 8 9 10 11 … Fn 1 1 2 3 5 8 13 21 34 55 89 … T(n) = O(1.6n) Divide and Conquer – Example 2 F1 = 1, F2 = 1 Fn = Fn-1 + Fn-2 Exponential Time! \course\eleg667-01-f\Topic-2a.ppt

How to Compute Fib Function Using Dynamic Programming Method? \course\eleg667-01-f\Topic-2a.ppt

tab Extra memory to store intermediate values 1 1 2 3 5 8 13 21 34 55 89 …. Dynamic Programming–Example 1 Fib(n) { int tab[n]; tab[1] = 1; tab[2] = 1; for (j = 3; j <= n; j++) tab[j]=tab[j-1] + tab[j-2]; return tab[n]; } Start by solving the smallest problems Use the partial solutions to solve bigger and bigger problems Linear Time! Space-Time Tradeoff T(n) = O(n) \course\eleg667-01-f\Topic-2a.ppt

Sequence Comparison code full name A alanine C cysteine D aspartate E glutamate F phenylalanine G glycine H histidine I isoleucine K lysine L leucine M methionine N aspartamine P proline Q glutamine R arginine S serine T threonine V valine W tryptophan Y tyrosine Molecular sequence data are at the heart of Computational Biology • DNA sequences • RNA sequences • Protein sequences We can think of these sequences as strings of letters • DNA & RNA: alphabet of 4 letters (A,T,C,G) • Protein: alphabet of 20 letters \course\eleg667-01-f\Topic-2a.ppt

Sequence Comparison – (Cont.) Why compare sequences? • Find similar genes/proteins • Allows to predict function & structure • Locate common subsequences in genes/proteins • Identify common recurrent patterns • Locate sequences that might overlap • Help in sequence assembly \course\eleg667-01-f\Topic-2a.ppt

Sequence X = A T A A G T Sequence Y = A T G C A G T Total = 2 Sequence Comparison – (Cont.) To compare the sequences we need to quantify the similariy matches = 1 mismatches = 0 Score 1 1 0 0 0 0 0 \course\eleg667-01-f\Topic-2a.ppt

Sequence X = A T A A G T Sequence Y = A T G C A G T Total = 3 Sequence Comparison – (Cont.) Taking positions of the letters into account Sequence X = A T A A G T matches = 1 mismatches = 0 Score 0 0 0 0 1 1 1 \course\eleg667-01-f\Topic-2a.ppt

Sequence X = A T A A G T Sequence Y = A T G C A G T Score 1 1 0 –1 1 1 1 Total = 4 Sequence Comparison – (Cont.) How to take possible mutations into account? Sequence X = A T A - A G T matches = 1 mismatches = 0 matches = 1 mismatches = 0 gap = -1 \course\eleg667-01-f\Topic-2a.ppt

G - - A -1 -1 GA - - G - - A G A - G A - - - AG -2 -2 0 -2 -2 GA - - - A GA - A G - A - A - G - - - AG GA A - G - AG - GA A - - - G - A -G - G AG - - G AG - -3 -1 -1 -3 -3 -3 0 -3 -3 0 GA - - - - AG GA - - AG G - A - - A -G G - A - AG G - - A - AG - GA - A -G GA AG G - A AG - - GA - A - -G - GA A -G - G - A A -G - - GA AG - - - GA AG - - -4 -1 -4 -2 -4 -2 0 -2 -4 -4 -1 -4 -2 choose the best score, i.e max(-3, 0, -1) choose the best score, i.e max(-2, 0, -2) choose the best score, i.e max(-1, 0, -3) choose the best score, i.e max(-1, 0, -1) Applying DP to Sequence Comparison Sequence X = GA Sequence Y = AG scores T(n,n) = O(kn) Exponential Time! total score = 0 \course\eleg667-01-f\Topic-2a.ppt

0 -1 -2 -1 -2 GA - - G - - A G A - G A - - - AG GA - - - A GA - A G - A - A - G - - - AG GA A - G - AG - GA A - - - G - A -G - G AG - - G AG - GA - - - - AG GA - - AG G - A - - A -G G - A - AG G - - A - AG - GA - A -G GA AG G - A AG - - GA - A - -G - GA A -G - G - A A -G - - GA AG - - - GA AG - - Applying DP to Sequence Comparison G A Sequence X = GA Sequence Y = AG T(n,n) = O(n2) A G 0 0 Polynomial Time! 0 0 G - - A -1 -1 -2 -2 0 -2 -2 -3 -1 -1 -3 -3 -3 0 -3 -3 0 -4 -1 -4 -2 -4 -2 0 -2 -4 -4 -1 -4 -2 \course\eleg667-01-f\Topic-2a.ppt

Question: from 1 to 9 how many paths? G A 0 -1 -2 A -1 0 0 G -2 0 0 1 2 4 1 3 5 7 2 3 5 5 6 4 8 8 7 7 8 7 6 8 9 8 7 9 9 9 9 9 9 9 5 9 9 9 8 7 9 9 9 Questions • Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example? • Answer: Let us count Total = 13 \course\eleg667-01-f\Topic-2a.ppt

DP algorithm for Sequence Comparison Extra memory to store intermediate values sb[i,j] - Substitution Matrix int S[m,n] m = length(X) n = length(Y) for i = 0 to m do S[i,0] = i . g for j = 0 to n do S[j,0] = j . g for i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g ) return S[m,n] A T C G 1 0 0 0 A T C G Start by solving the smallest problems 0 1 0 0 0 0 1 0 0 0 0 1 Use the partial solutions to solve bigger and bigger problems \course\eleg667-01-f\Topic-2a.ppt

The Substitution Matrix A B C D E F G H I K L M N P Q R S T V W Y Z A 2 0 -2 0 0 -4 1 -1 -1 -1 -2 -1 0 1 0 -2 1 1 0 -6 -3 0 B 0 2 -4 3 2 -5 0 1 -2 1 -3 -2 2 -1 1 -1 0 0 -2 -5 -3 2 C -2 -4 12 -5 -5 -4 -3 -3 -2 -5 -6 -5 -4 -3 -5 -4 0 -2 -2 -8 0 -5 D 0 3 -5 4 3 -6 1 1 -2 0 -4 -3 2 -1 2 -1 0 0 -2 -7 -4 3 E 0 2 -5 3 4 -5 0 1 -2 0 -3 -2 1 -1 2 -1 0 0 -2 -7 -4 3 F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2 0 -4 -5 -5 -4 -3 -3 -1 0 7 -5 G 1 0 -3 1 0 -5 5 -2 -3 -2 -4 -3 0 -1 -1 -3 1 0 -1 -7 -5 -1 H -1 1 -3 1 1 -2 -2 6 -2 0 -2 -2 2 0 3 2 -1 -1 -2 -3 0 2 I -1 -2 -2 -2 -2 1 -3 -2 5 -2 2 2 -2 -2 -2 -2 -1 0 4 -5 -1 -2 K -1 1 -5 0 0 -5 -2 0 -2 5 -3 0 1 -1 1 3 0 0 -2 -3 -4 0 L -2 -3 -6 -4 -3 2 -4 -2 2 -3 6 4 -3 -3 -2 -3 -3 -2 2 -2 -1 -3 M -1 -2 -5 -3 -2 0 -3 -2 2 0 4 6 -2 -2 -1 0 -2 -1 2 -4 -2 -2 N 0 2 -4 2 1 -4 0 2 -2 1 -3 -2 2 -1 1 0 1 0 -2 -4 -2 1 P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 0 0 1 0 -1 -6 -5 0 Q 0 1 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 1 -1 -1 -2 -5 -4 3 R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 0 -1 -2 2 -4 0 S 1 0 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 1 -1 -2 -3 0 T 1 0 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 0 -5 -3 -1 V 0 -2 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 -6 -2 -2 W -6 -5 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 0 -6 Y -3 -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10 -4 Z 0 2 -5 3 3 -5 -1 2 -2 0 -3 -2 1 0 3 0 0 -1 -2 -6 -4 3 • For DNA we usually use identity matrices; For proteins more sensitive matrices, derived empirically, are used; A T C G 1 0 0 0 A T C G 0 1 0 0 0 0 1 0 0 0 0 1 \course\eleg667-01-f\Topic-2a.ppt

0 -1 -2 -3 -4 -5 -6 -7 -1 0 2 1 0 -1 -2 -3 -2 -1 1 2 1 1 0 -1 -3 -2 0 1 2 2 1 0 -4 -3 -1 1 1 2 3 2 -5 -4 -2 0 1 1 2 4 -6 -4 + (-1) -3 + ( 0 ) -1 + (-1) -5 + (-1) -4 + (+1) -2 + (-1) -2 + (-1) -1 + ( 0 ) 1 + (-1) -7 + (-1) -6 + ( 0 ) -4 + (-1) -3 + (-1) -2 + ( 0 ) 0 + (-1) -1 + (-1) 0 + (+1) -1 + (-1) -6 + (-1) -5 + ( 0 ) -3 + (-1) -3 -2 -1 -5 0 1 -4 Sequence Comparison revisited A T G C A G T int S[m,n] m = length(X) n = length(Y) for i = 0 to m do S[i,0] = i . g for j = 0 to n do S[j,0] = j . g for i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g ) return S[m,n] -1 -2 -3 -4 -5 A T A A G T 1 0 Similarity Matrix \course\eleg667-01-f\Topic-2a.ppt

What To Do Next? Answer: Finding alignments But, How? \course\eleg667-01-f\Topic-2a.ppt

-1 + (-1) 0 + (+1) -1 + (-1) 0 + (-1) 1 + (+1) 0 + (-1) 2 + (-1) 3 + (+1) 2 + (-1) 1 + (-1) 2 + ( 0 ) 1 + (-1) 1 + (-1) 2 + (+1) 2 + (-1) 1 + (-1) 1 + (+1) 2 + (-1) 0 + (-1) 1 + ( 0 ) 2 + (-1) -1 + (-1) 0 + ( 0 ) 2 + (-1) 0 -1 -2 -3 -4 -5 -6 -7 1 4 3 2 1 1 2 2 1 0 -1 -2 -3 -4 -5 -1 0 2 1 0 -1 -2 -3 -2 -1 1 2 1 1 0 -1 -3 -2 0 1 2 2 1 0 -4 -3 -1 1 1 2 3 2 -5 -4 -2 0 1 1 2 4 -6 T G C A G T T - A A G T G C A G T - A A G T A T G C A G T A T - A A G T A G T A G T C A G T A A G T G T G T T T C A G T - A G T G C A G T A - A G T A T G C A G T A T A - A G T T G C A G T T A - A G T Finding the Alignment(s) A T G C A G T A T A A G T Similarity Matrix Global Alignments \course\eleg667-01-f\Topic-2a.ppt

How to Break a Tie? • Should one report all? • Or, report only one? \course\eleg667-01-f\Topic-2a.ppt

Advantage of DP Alignment Algorithms • Build up the solution by determining all similarities between arbitrary prefixes of the two sequences • Starting with the shorter prefixes and use previously computed results to solve for larger prefixes \course\eleg667-01-f\Topic-2a.ppt

The Complexity of the DP Alignment Algorithm? • Construction of the similarity matrix: O (m • n) • Find an optimal alignment O (m + n) \course\eleg667-01-f\Topic-2a.ppt

Global versus Local Alignments • A global alignment attempts to match all of one sequence against all of another LGPSTKQFGKGSSSRIWDN | |||| | | LNQIERSFGKGAIMRLGDA • A local alignment attempts to match subsequences of the two sequences; -------FGKG-------- |||| -------FGKG-------- \course\eleg667-01-f\Topic-2a.ppt

How to Compute Local Alignment? \course\eleg667-01-f\Topic-2a.ppt

Similarity Matrix Computation: a[i,j-1]+g a[i,j]= max a[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 0 0 0 0 0 .. 0 0 0 .. a[i,0]= 0 ; for i= 0…m a[0,j]= 0 ; for j= 0…n Applying DP to Local Alignment Don’t penalize gaps on left and right ends! If the best alignment up to some point has a negative score, it’s better to start a new one, rather than extend the old one. \course\eleg667-01-f\Topic-2a.ppt

Criteria of Finding a Local Alignment • Find the entries with maximum values in the simularity matrix • For each of such entries, construct an local alignment • See next example • We may also be interested in near-optimal alignments \course\eleg667-01-f\Topic-2a.ppt

0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 2 1 0 0 1 1 0 1 1 2 1 1 0 1 0 1 1 1 2 2 1 0 0 0 0 2 1 2 3 2 0 0 1 1 2 1 2 4 0 A T G C A A G T A T G C A G T A T A - A G T A T G C A G T A T - A A G T Applying DP to Local Alignment A T G C A G T Similarity Matrix Computation: a[i,j-1]+g a[i,j]= max a[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 A T A A G T Similarity Matrix \course\eleg667-01-f\Topic-2a.ppt

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 a[i,j-1]+g a[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 0 0 0 2 0 0 0 2 0 0 0 0 1 0 0 3 1 0 0 a[i,j]= max 1 0 1 0 0 0 1 1 2 0 1 0 0 0 0 0 1 0 0 2 3 1 2 1 0 0 0 1 0 0 1 3 1 2 3 1 T G A T G G A G G T A G G T G A T - G G A G G T G A T A G G 0 + (-2) 0 + (+1) 0 + (-2) 0 0 + (-2) 0 + (-1) 0 + (-2) 0 T G A T G G A G G T G A T A G T G A T G G A G G T G A T 1 0 Local Alignment using DP T G A T G G A G G T 0 1 G A T A G G g = -2 A T C G -1 1 -1 -1 A T C G -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1 1 \course\eleg667-01-f\Topic-2a.ppt

How to Break a Tie? • Should one report all? • Or, report only one? \course\eleg667-01-f\Topic-2a.ppt

Extension to the Basic DP Method • Improving space complexity • Introduce general gap functions • That is, the probability of a sequence of consecutive spaces is more likely than individual spaces • Affine gap functions: w(k) = h + gk \course\eleg667-01-f\Topic-2a.ppt

Dynamic Programming and Biological Sequence Comparison

Dynamic Programming and Biological Sequence Comparison

Presentation Transcript

Dynamic Programming: Sequence alignment

Sequence Alignments and Dynamic Programming

Sequence comparison: More dynamic programming

Sequence Comparison

Sequence Comparison

Sequence comparison

Biological Sequence Comparison and Alignment

Sequence Alignments and Dynamic Programming

Biological Sequence Comparison / Database Homology Searching

Sequence Comparison

Biological Sequence Comparison / Database Homology Searching

Sequence Comparison

Dynamic Programming for Sequence alignment

Sequence Alignment and Dynamic Programming

Improved tools for biological sequence comparison

Global Sequence Alignment by Dynamic Programming

Sequence comparison: Dynamic programming

Sequence Alignment by Dynamic Programming

Dynamic Programming: Sequence alignment

Sequence comparison: More dynamic programming

Sequence comparison: Dynamic programming