360 likes | 460 Views
Dynamic Programming and Biological Sequence Comparison. Part I. Topic II – Biological Sequence Alignment and Database Search. Part I (Topic-2a): Dynamic programming and Sequence comparison Part II (Topic-2b): Heuristic and Database Search (e.g. FAST, BLAST) sequence alignment
E N D
Dynamic Programming and Biological Sequence Comparison Part I
Topic II – Biological Sequence Alignment and Database Search • Part I (Topic-2a): Dynamic programming and Sequence comparison • Part II (Topic-2b): Heuristic and Database Search (e.g. FAST, BLAST) sequence alignment • Part III (Topic-2c): Multiple sequence alignment \course\eleg667-01-f\Topic-2a.ppt
Outline • Concept of alignment • Two algorithm design techniques; • Dynamic Programming: Examples • Applying DP to Sequence Comparison; • The database search problem • Heuristic algorithms to database search \course\eleg667-01-f\Topic-2a.ppt
Alignment • The two sequences will have the same length (after possible insertions of spaces on either or both of them) • No space in one sequence can be aligned with a space in the other • Spaces can be inserted at the beginning or end of the sequences \course\eleg667-01-f\Topic-2a.ppt
Biological Sequence Alignment and Database Search • We have two sequences over the same alphabet, both about the same length (tens of thousands of characters) and the sequences are almost equal. The average frequency of these differences is low, say, one each hundred characters. We want to find the places where the differences occur. • We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there is a prefix of one which is similar to suffix of the other. \course\eleg667-01-f\Topic-2a.ppt
(cont’d) 3. We have the same problem as in (2), but now we have several hundred sequences that must be compared (each one against all). In addition, we know that the great majority of sequence pairs are unrelated, that is, they will not have the required degree of similarity. 4. We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there are two substrings, one from each sequence, that are similar. 5. We have the same problem as in (4), but instead of two sequences we have one sequence that must be compared to thousands of others. \course\eleg667-01-f\Topic-2a.ppt
Breaking Problems Down: Two Related Algorithm Design Techniques • Divide and Conquer: Starting with the complete instance of a problem, divide it into smaller subinstances, solve each of them recursively and combine the partial solutions into a solution to the original problem. • Dynamic Programming: Starting with the smallest subinstances of a problem, solve and combine them until the complete instance of the original problem is solved. \course\eleg667-01-f\Topic-2a.ppt
9 1 25 4 15 becomes 4 1 9 25 15 4 1 25 15 becomes becomes 1 4 15 25 15 1 4 25 Divide and Conquer – Example 1 Quick Sort \course\eleg667-01-f\Topic-2a.ppt
Divide and Conquer – Example 2 The Fibonacci numbers F1 = 1, F2 = 1 Fn = Fn-1 + Fn-2 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, … Fib(n) { if (n < 2) return 1; else return Fib(n-1)+Fib(n-2); } \course\eleg667-01-f\Topic-2a.ppt
F(7) + F(6) + F(5) F(5) F(3) F(3) F(3) F(3) F(3) F(4) F(4) F(4) + + + + + + + + + + T(n) #Internal_nodes = #leaves - 1 F(1) F(1) F(1) F(1) F(1) F(2) F(2) F(2) F(2) F(2) F(2) F(2) F(2) but #leaves = Fn Fn 1.6n, n >> 1 Fn /Fn-1 1.6 n 1 2 3 4 5 6 7 8 9 10 11 … Fn 1 1 2 3 5 8 13 21 34 55 89 … T(n) = O(1.6n) Divide and Conquer – Example 2 F1 = 1, F2 = 1 Fn = Fn-1 + Fn-2 Exponential Time! \course\eleg667-01-f\Topic-2a.ppt
How to Compute Fib Function Using Dynamic Programming Method? \course\eleg667-01-f\Topic-2a.ppt
tab Extra memory to store intermediate values 1 1 2 3 5 8 13 21 34 55 89 …. Dynamic Programming–Example 1 Fib(n) { int tab[n]; tab[1] = 1; tab[2] = 1; for (j = 3; j <= n; j++) tab[j]=tab[j-1] + tab[j-2]; return tab[n]; } Start by solving the smallest problems Use the partial solutions to solve bigger and bigger problems Linear Time! Space-Time Tradeoff T(n) = O(n) \course\eleg667-01-f\Topic-2a.ppt
Sequence Comparison code full name A alanine C cysteine D aspartate E glutamate F phenylalanine G glycine H histidine I isoleucine K lysine L leucine M methionine N aspartamine P proline Q glutamine R arginine S serine T threonine V valine W tryptophan Y tyrosine Molecular sequence data are at the heart of Computational Biology • DNA sequences • RNA sequences • Protein sequences We can think of these sequences as strings of letters • DNA & RNA: alphabet of 4 letters (A,T,C,G) • Protein: alphabet of 20 letters \course\eleg667-01-f\Topic-2a.ppt
Sequence Comparison – (Cont.) Why compare sequences? • Find similar genes/proteins • Allows to predict function & structure • Locate common subsequences in genes/proteins • Identify common recurrent patterns • Locate sequences that might overlap • Help in sequence assembly \course\eleg667-01-f\Topic-2a.ppt
Sequence X = A T A A G T Sequence Y = A T G C A G T Total = 2 Sequence Comparison – (Cont.) To compare the sequences we need to quantify the similariy matches = 1 mismatches = 0 Score 1 1 0 0 0 0 0 \course\eleg667-01-f\Topic-2a.ppt
Sequence X = A T A A G T Sequence Y = A T G C A G T Total = 3 Sequence Comparison – (Cont.) Taking positions of the letters into account Sequence X = A T A A G T matches = 1 mismatches = 0 Score 0 0 0 0 1 1 1 \course\eleg667-01-f\Topic-2a.ppt
Sequence X = A T A A G T Sequence Y = A T G C A G T Score 1 1 0 –1 1 1 1 Total = 4 Sequence Comparison – (Cont.) How to take possible mutations into account? Sequence X = A T A - A G T matches = 1 mismatches = 0 matches = 1 mismatches = 0 gap = -1 \course\eleg667-01-f\Topic-2a.ppt
G - - A -1 -1 GA - - G - - A G A - G A - - - AG -2 -2 0 -2 -2 GA - - - A GA - A G - A - A - G - - - AG GA A - G - AG - GA A - - - G - A -G - G AG - - G AG - -3 -1 -1 -3 -3 -3 0 -3 -3 0 GA - - - - AG GA - - AG G - A - - A -G G - A - AG G - - A - AG - GA - A -G GA AG G - A AG - - GA - A - -G - GA A -G - G - A A -G - - GA AG - - - GA AG - - -4 -1 -4 -2 -4 -2 0 -2 -4 -4 -1 -4 -2 choose the best score, i.e max(-3, 0, -1) choose the best score, i.e max(-2, 0, -2) choose the best score, i.e max(-1, 0, -3) choose the best score, i.e max(-1, 0, -1) Applying DP to Sequence Comparison Sequence X = GA Sequence Y = AG scores T(n,n) = O(kn) Exponential Time! total score = 0 \course\eleg667-01-f\Topic-2a.ppt
0 -1 -2 -1 -2 GA - - G - - A G A - G A - - - AG GA - - - A GA - A G - A - A - G - - - AG GA A - G - AG - GA A - - - G - A -G - G AG - - G AG - GA - - - - AG GA - - AG G - A - - A -G G - A - AG G - - A - AG - GA - A -G GA AG G - A AG - - GA - A - -G - GA A -G - G - A A -G - - GA AG - - - GA AG - - Applying DP to Sequence Comparison G A Sequence X = GA Sequence Y = AG T(n,n) = O(n2) A G 0 0 Polynomial Time! 0 0 G - - A -1 -1 -2 -2 0 -2 -2 -3 -1 -1 -3 -3 -3 0 -3 -3 0 -4 -1 -4 -2 -4 -2 0 -2 -4 -4 -1 -4 -2 \course\eleg667-01-f\Topic-2a.ppt
Question: from 1 to 9 how many paths? G A 0 -1 -2 A -1 0 0 G -2 0 0 1 2 4 1 3 5 7 2 3 5 5 6 4 8 8 7 7 8 7 6 8 9 8 7 9 9 9 9 9 9 9 5 9 9 9 8 7 9 9 9 Questions • Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example? • Answer: Let us count Total = 13 \course\eleg667-01-f\Topic-2a.ppt
DP algorithm for Sequence Comparison Extra memory to store intermediate values sb[i,j] - Substitution Matrix int S[m,n] m = length(X) n = length(Y) for i = 0 to m do S[i,0] = i . g for j = 0 to n do S[j,0] = j . g for i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g ) return S[m,n] A T C G 1 0 0 0 A T C G Start by solving the smallest problems 0 1 0 0 0 0 1 0 0 0 0 1 Use the partial solutions to solve bigger and bigger problems \course\eleg667-01-f\Topic-2a.ppt
The Substitution Matrix A B C D E F G H I K L M N P Q R S T V W Y Z A 2 0 -2 0 0 -4 1 -1 -1 -1 -2 -1 0 1 0 -2 1 1 0 -6 -3 0 B 0 2 -4 3 2 -5 0 1 -2 1 -3 -2 2 -1 1 -1 0 0 -2 -5 -3 2 C -2 -4 12 -5 -5 -4 -3 -3 -2 -5 -6 -5 -4 -3 -5 -4 0 -2 -2 -8 0 -5 D 0 3 -5 4 3 -6 1 1 -2 0 -4 -3 2 -1 2 -1 0 0 -2 -7 -4 3 E 0 2 -5 3 4 -5 0 1 -2 0 -3 -2 1 -1 2 -1 0 0 -2 -7 -4 3 F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2 0 -4 -5 -5 -4 -3 -3 -1 0 7 -5 G 1 0 -3 1 0 -5 5 -2 -3 -2 -4 -3 0 -1 -1 -3 1 0 -1 -7 -5 -1 H -1 1 -3 1 1 -2 -2 6 -2 0 -2 -2 2 0 3 2 -1 -1 -2 -3 0 2 I -1 -2 -2 -2 -2 1 -3 -2 5 -2 2 2 -2 -2 -2 -2 -1 0 4 -5 -1 -2 K -1 1 -5 0 0 -5 -2 0 -2 5 -3 0 1 -1 1 3 0 0 -2 -3 -4 0 L -2 -3 -6 -4 -3 2 -4 -2 2 -3 6 4 -3 -3 -2 -3 -3 -2 2 -2 -1 -3 M -1 -2 -5 -3 -2 0 -3 -2 2 0 4 6 -2 -2 -1 0 -2 -1 2 -4 -2 -2 N 0 2 -4 2 1 -4 0 2 -2 1 -3 -2 2 -1 1 0 1 0 -2 -4 -2 1 P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 0 0 1 0 -1 -6 -5 0 Q 0 1 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 1 -1 -1 -2 -5 -4 3 R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 0 -1 -2 2 -4 0 S 1 0 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 1 -1 -2 -3 0 T 1 0 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 0 -5 -3 -1 V 0 -2 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 -6 -2 -2 W -6 -5 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 0 -6 Y -3 -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10 -4 Z 0 2 -5 3 3 -5 -1 2 -2 0 -3 -2 1 0 3 0 0 -1 -2 -6 -4 3 • For DNA we usually use identity matrices; For proteins more sensitive matrices, derived empirically, are used; A T C G 1 0 0 0 A T C G 0 1 0 0 0 0 1 0 0 0 0 1 \course\eleg667-01-f\Topic-2a.ppt
0 -1 -2 -3 -4 -5 -6 -7 -1 0 2 1 0 -1 -2 -3 -2 -1 1 2 1 1 0 -1 -3 -2 0 1 2 2 1 0 -4 -3 -1 1 1 2 3 2 -5 -4 -2 0 1 1 2 4 -6 -4 + (-1) -3 + ( 0 ) -1 + (-1) -5 + (-1) -4 + (+1) -2 + (-1) -2 + (-1) -1 + ( 0 ) 1 + (-1) -7 + (-1) -6 + ( 0 ) -4 + (-1) -3 + (-1) -2 + ( 0 ) 0 + (-1) -1 + (-1) 0 + (+1) -1 + (-1) -6 + (-1) -5 + ( 0 ) -3 + (-1) -3 -2 -1 -5 0 1 -4 Sequence Comparison revisited A T G C A G T int S[m,n] m = length(X) n = length(Y) for i = 0 to m do S[i,0] = i . g for j = 0 to n do S[j,0] = j . g for i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g ) return S[m,n] -1 -2 -3 -4 -5 A T A A G T 1 0 Similarity Matrix \course\eleg667-01-f\Topic-2a.ppt
What To Do Next? Answer: Finding alignments But, How? \course\eleg667-01-f\Topic-2a.ppt
-1 + (-1) 0 + (+1) -1 + (-1) 0 + (-1) 1 + (+1) 0 + (-1) 2 + (-1) 3 + (+1) 2 + (-1) 1 + (-1) 2 + ( 0 ) 1 + (-1) 1 + (-1) 2 + (+1) 2 + (-1) 1 + (-1) 1 + (+1) 2 + (-1) 0 + (-1) 1 + ( 0 ) 2 + (-1) -1 + (-1) 0 + ( 0 ) 2 + (-1) 0 -1 -2 -3 -4 -5 -6 -7 1 4 3 2 1 1 2 2 1 0 -1 -2 -3 -4 -5 -1 0 2 1 0 -1 -2 -3 -2 -1 1 2 1 1 0 -1 -3 -2 0 1 2 2 1 0 -4 -3 -1 1 1 2 3 2 -5 -4 -2 0 1 1 2 4 -6 T G C A G T T - A A G T G C A G T - A A G T A T G C A G T A T - A A G T A G T A G T C A G T A A G T G T G T T T C A G T - A G T G C A G T A - A G T A T G C A G T A T A - A G T T G C A G T T A - A G T Finding the Alignment(s) A T G C A G T A T A A G T Similarity Matrix Global Alignments \course\eleg667-01-f\Topic-2a.ppt
How to Break a Tie? • Should one report all? • Or, report only one? \course\eleg667-01-f\Topic-2a.ppt
Advantage of DP Alignment Algorithms • Build up the solution by determining all similarities between arbitrary prefixes of the two sequences • Starting with the shorter prefixes and use previously computed results to solve for larger prefixes \course\eleg667-01-f\Topic-2a.ppt
The Complexity of the DP Alignment Algorithm? • Construction of the similarity matrix: O (m • n) • Find an optimal alignment O (m + n) \course\eleg667-01-f\Topic-2a.ppt
Global versus Local Alignments • A global alignment attempts to match all of one sequence against all of another LGPSTKQFGKGSSSRIWDN | |||| | | LNQIERSFGKGAIMRLGDA • A local alignment attempts to match subsequences of the two sequences; -------FGKG-------- |||| -------FGKG-------- \course\eleg667-01-f\Topic-2a.ppt
How to Compute Local Alignment? \course\eleg667-01-f\Topic-2a.ppt
Similarity Matrix Computation: a[i,j-1]+g a[i,j]= max a[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 0 0 0 0 0 .. 0 0 0 .. a[i,0]= 0 ; for i= 0…m a[0,j]= 0 ; for j= 0…n Applying DP to Local Alignment Don’t penalize gaps on left and right ends! If the best alignment up to some point has a negative score, it’s better to start a new one, rather than extend the old one. \course\eleg667-01-f\Topic-2a.ppt
Criteria of Finding a Local Alignment • Find the entries with maximum values in the simularity matrix • For each of such entries, construct an local alignment • See next example • We may also be interested in near-optimal alignments \course\eleg667-01-f\Topic-2a.ppt
0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 2 1 0 0 1 1 0 1 1 2 1 1 0 1 0 1 1 1 2 2 1 0 0 0 0 2 1 2 3 2 0 0 1 1 2 1 2 4 0 A T G C A A G T A T G C A G T A T A - A G T A T G C A G T A T - A A G T Applying DP to Local Alignment A T G C A G T Similarity Matrix Computation: a[i,j-1]+g a[i,j]= max a[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 A T A A G T Similarity Matrix \course\eleg667-01-f\Topic-2a.ppt
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 a[i,j-1]+g a[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 0 0 0 2 0 0 0 2 0 0 0 0 1 0 0 3 1 0 0 a[i,j]= max 1 0 1 0 0 0 1 1 2 0 1 0 0 0 0 0 1 0 0 2 3 1 2 1 0 0 0 1 0 0 1 3 1 2 3 1 T G A T G G A G G T A G G T G A T - G G A G G T G A T A G G 0 + (-2) 0 + (+1) 0 + (-2) 0 0 + (-2) 0 + (-1) 0 + (-2) 0 T G A T G G A G G T G A T A G T G A T G G A G G T G A T 1 0 Local Alignment using DP T G A T G G A G G T 0 1 G A T A G G g = -2 A T C G -1 1 -1 -1 A T C G -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1 1 \course\eleg667-01-f\Topic-2a.ppt
How to Break a Tie? • Should one report all? • Or, report only one? \course\eleg667-01-f\Topic-2a.ppt
Extension to the Basic DP Method • Improving space complexity • Introduce general gap functions • That is, the probability of a sequence of consecutive spaces is more likely than individual spaces • Affine gap functions: w(k) = h + gk \course\eleg667-01-f\Topic-2a.ppt