290 likes | 435 Views
Dynamic Programming. Prof. Navneet Goyal Department of Computer Science & Information Systems BITS, Pilani. Sequence Comparison. code full name A alanine C cysteine D aspartate E glutamate F phenylalanine G glycine H histidine
E N D
Dynamic Programming Prof. Navneet Goyal Department of Computer Science & Information Systems BITS, Pilani
Sequence Comparison code full name A alanine C cysteine D aspartate E glutamate F phenylalanine G glycine H histidine I isoleucine K lysine L leucine M methionine N aspartamine P proline Q glutamine R arginine S serine T threonine V valine W tryptophan Y tyrosine Molecular sequence data are at the heart of Computational Biology • DNA sequences • RNA sequences • Protein sequences We can think of these sequences as strings of letters • DNA & RNA: alphabet of 4 letters (A,T,C,G) (A,U,C,G) • Protein: alphabet of 20 letters
Dynamic Programming • How to get optimal alignment? • Introducing one or more gaps at every position & computing an alignment score • Computational overheads grow exponentially with sequence length • Solution: Dynamic Programming Algorithm
Dynamic Programming • Good News: The method is guaranteed to give a global optimum given the choice of parameters – the scoring matrix and gap penalty – with no approxuimation • Bad News: Many alignment may give the same optimal score. And none of these correspond to the biologically correct alignment
Dynamic Programming • Comparison of α- & β- chains of chicken haemoglobin, Fitch & Smith found 17 optimal alignments, only one of which is correct biologically (1317 alignments were ± 5% of optimal score) • Another bad news: The time required to align two sequences of length n & m is proportional to nxm • This makes DP unsuitable for use in searching a sequence DB for a match to a probe sequence
Dynamic Programming • Recursive Approach • Intermediate results are saved in a matrix to be used later • DP is processor and RAM intensive • Computationally feasible
Dynamic Programming • General optimization method • Proposed by Richard Bellman of Princeton University in 1950s • Extensively used in sequence alignment and other computational problems • Applied to biological seqs. by Needleman and Wunsch
Dynamic Programming • Original problem is broken into smaller subproblems and then solved • Pieces of larger problem have a sequential dependency • 4th piece can be solved using solution of the 3rd piece, the 3rd piece can be solved by using solution of the 2nd piece and so on…
Dynamic Programming • First solve all the subproblems • Store each intermediate solution in a table along with a score • Uses an mxn matrix of scores where m and n are the lengths of seqs. being aligned • Can be used for • Local Alignment (Smith-Waterman Algorithm) • Global Alignment (Needleman-Wunsch Algorithm)
Dynamic Programming New best alignment = previous best + local best Sequence A Best previous alignment ... ... ... ... Sequence B
Example Greater simplifications is possible by systematically resubdividing the problem Best path from start to finish is the best path from start to A followed by best path from A to finish A START FINISH HINT: Choice of the best path from A to finish is independent of the choice of the path from start to A 6 paths from start to A 6 paths from A to finish Total paths from Start to finish= 36 DO we need to check all 36path to find the optimal path? NO NOT more than 12 paths need to be examined
Optimal Alignment of Sequences • Suppose X & Z are two seqs with lengths m and n • If gaps are allowed, then the length of each sequence could be m+n • 2m+n subsequences with spaces for the sequence X and same for Z • 4m+n comparisons using brute force method
Dynamic Programming • Sequence alignment has an optimal-substructure property • As a result DP makes it easier to consider all possible alignments • DP algorithms solve optimization problems by dividing the problem into independent subproblems. • Each subproblem is then only solved once, and the answer is stored in a table, thus avoiding the work of recomputing the solution.
Dynamic Programming • With sequence alignment, the subproblems can be thought of as the alignment of the “prefixes” of the two sequences to a certain point. • DP matrix is computed. • The optimal alignment score for any particular point in the matrix is built upon the optimal alignment that has been computed to that point.
Dynamic Programming Steps Involved • Initialization • Matrix Fill (scoring) • Traceback (alignment)
Dynamic Programming Consider the following 2 seqs: GAATTCAGTTA (11) GGATCGA (7) • Create a 8x12 matrix • Row 0 & Column 0 would represent gaps • Rows 1-7 will be labeled with the corresponding residue of the sequence GGATCGA, while columns 1-11 will be labeled with the corresponding residue of the sequence GAATTCAGTTA
Scoring Scheme • s(aibj) = +5 if ai = bj (match score) • s(aibj) = -3 if aibj (mismatch score) • w = -4 (gap penalty)
Global Alignment: Needleman-Wunsch Algorithm In global sequence alignment, an attempt to align the entirety of two different sequences is made, up to and including the ends of the sequence. Needleman and Wunsch (1970) were among the first to describe a dynamic programming algorithm for global sequence alignment.
Global Alignment: Needleman-Wunsch Algorithm For each position, Si,j is defined to be the maximum score at position i,j; i.e. Si,j = MAXIMUM[ Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal), Si,j-1 + w (gap in sequence #1), Si-1,j + w (gap in sequence #2) ]
The Alignment G A A T T C A G T T A G G A - T C - G - - A