Dynamic Programming for Sequence alignment

Dynamic Programming for Sequence alignment Neha Jain Lecturer School of Biotechnology Devi Ahilya University, Indore

Sequence alignment • Sequence alignment is the procedure of comparing two (pair-wise alignment) or more multiple sequences by searching for a series of individual characters or patterns that are in the same order in the sequences. • There are two types of alignment: local and global. • In Global alignment, an attempt is made to align the entire sequence. If two sequences have approximately the same length and are quite similar, they are suitable for the global alignment. • Local alignment concentrates on finding stretches of sequences with high level of matches.

Interpretation of sequence alignment • Sequence alignment is useful for discovering structural, functional and evolutionary information. • Sequences that are very much alike may have similar secondary and 3D structure, similar function and likely a common ancestral sequence. It is extremely unlikely that such sequences obtained similarity by chance • Large scale genome studies revealed existence of horizontal transfer of genes and other sequences between species, which may cause similarity between some sequences in very distant species.

Methods of sequence alignment • Dot matrix analysis:- Starting from the first character in second sequence, one moves across the page keeping in the first row and placing a dot in many column where the character in A is the same. The process is continued until all possible comparisons between both the sequences are made. Any region of similarity is revealed by a diagonal row of dots • The dynamic programming (DP) algorithm:- The method compares every pair of characters in the two sequences and generates an alignment, which is the best or optimal. • Word or k-tuple methods: BLAST is the best example to deal with k-tuple.

Pairwise Sequence Alignment The Aim: given two sequences and scoring system find the best alignment Points to remember 1) Should consider all possible Pairs 2) Take the best score found 3) There may be more than one best alignment

Finding the best alignment is hard!! • How to get optimal alignment? • The number of possible alignments is large. • If both sequences have the same length there is one possible for complete alignment with no gap. • More complicated when gaps are allowed • It is not good idea to go over all alignments • Solution: Dynamic Programming Algorithm

Dynamic Programming • General optimization method • Proposed by Richard Bellman of Princeton University in 1950s. The word dynamic was chosen by Bellman to capture the time-varying aspect of the problems, and because it sounded impressive. The word programming referred to the use of the method to find an optimal program • Extensively used in sequence alignment and other computational problems • Applied to biological sequences by Needleman and Wunsch

Dynamic Programming • Original problem is broken into smaller sub problems and then solved • Pieces of larger problem have a sequential dependency • 4th piece can be solved using solution of the 3rd piece, the 3rd piece can be solved by using solution of the 2nd piece and so on…

Dynamic Programming • First solve all the subproblems • Store each intermediate solution in a table along with a score • Uses an “m” x “n” matrix of scores where m and n are the lengths of sequences being aligned. • Can be used for • Local Alignment (Smith-Waterman Algorithm) • Global Alignment (Needleman-Wunsch Algorithm)

i -x Si - x,j - wx Si –1, j- 1 + s(ai , bj) i -1 i Si, j - y - wy Si, j i -y j -1 j Formal description of dynamic programming algorithm • This diagram indicates the moves that are possible to reach a certain position (i,j) starting from the previous row and column at position (i -1, j-1) or from any position in the same row or column • Diagonal move with no gap penalties or move from any other position from column j or row i, with a gap penalty that depends on the size of the gap

Dynamic Programming • Sequence alignment has an optimal-substructure property • As a result DP makes it easier to consider all possible alignments • DP algorithms solve optimization problems by dividing the problem into independent subproblems. • Each subproblem is then only solved once, and the answer is stored in a table, thus avoiding the work of recomputing the solution.

Dynamic Programming • With sequence alignment, the subproblems can be thought of as the alignment of the “prefixes” of the two sequences to a certain point. • DP matrix is computed. • The optimal alignment score for any particular point in the matrix is built upon the optimal alignment that has been computed to that point.

Dynamic Programming • Advantage: The method is guaranteed to give a global optimum given the choice of parameters – the scoring matrix and gap penalty – with no approximation • A disadvantage: Many alignment may give the same optimal score. And none of these correspond to the biologically correct alignment

Dynamic Programming • Comparison of α- & β- chains of chicken hemoglobin, Fitch & Smith found 17 optimal alignments, only one of which was correct biologically (1317 alignments were ± 5% of optimal score) • Another bad news: The time required to align two sequences of length n & m is proportional to “n” x “m”. • This makes DP unsuitable for use in searching a sequence DB for a match to a probe sequence

Dynamic Programming Steps Involved • Initialization • Matrix Fill (scoring) • Traceback (alignment)

Gap Penalties..???? • Gaps are due to Insertion or deletion mutations in the genes. • Penalties are given for the gaps. • Through empirical studies for globular proteins, a set of penalty values have been developed that appear to suit most alignment purposes. • They are normally implemented as default values in most alignment programs.

Gap Penalties..???? Caution:- • Penalty too low:- gaps numerous, even non related pairs will be aligned. • If penalties too high:- difficult to pair even the related ones. • Another factor to consider is the cost difference between opening a gap and extending an existing gap. It is known that it is easier to extend a gap that has already been started. Thus, gap opening should have a much higher penalty than gap extension. • This is based on the rationale that if insertions and deletions ever occur, several adjacent residues are likely to have been inserted or deleted together. • Affine Gap Penalties:- Gap opening penalty should always be lower then gap extension penalty.. • Constant Penalty:- When gap opening and gap extension penalties are same

Global Alignment: Needleman-Wunsch Algorithm In global sequence alignment, an attempt to align the entirety of two different sequences is made, up to and including the ends of the sequence. Needleman and Wunsch (1970) were among the first to describe a dynamic programming algorithm for global sequence alignment.

Example: • Two sequences: TACT, AATC • Scoring system: • Match: 3 • Mismatch: -1 • Gap: -2

Initializing entry (0,0) = 0 • Fill the matrix from top left to bottom right • The score in each entry (i,j) is calculated using the three near entries values • Global alignment score is the bottom right cell value • May find more than one alignment

Construct a matrix: one sequence (TACT) at the top another sequence (AATC) at the left • Entry (i,j): • i for column, j for row • alignment of i first letters of one sequence • with j firstletters of another

Initialization: entry (0,0) = 0 Fill the matrix from top left to bottom right

entry (1,0) = entry(0,0) + gap score = 0 + (-2) = -2 T - Horizontal line = gap in the left sequence

TA - - entry (2,0) = entry(1,0) + gap score = -2 + (-2) = -4

TAC - - - entry (3,0) = entry(2,0) + gap score = -4 + (-2) = -6

TACT - - - - entry (4,0) = entry(3,0) + gap score = -6 + (-2) = -8

- - - - AATC Vertical line = gap in the top sequence

Global Alignment: Needleman-Wunsch Algorithm For each position, Si,j is defined to be the maximum score at position i,j; i.e. Si,j = MAXIMUM[ Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal), Si,j-1 + w (gap in sequence #1), Si-1,j + w (gap in sequence #2) ]

Three options

First option: Entry(0,0) + mismatch score = 0+(-1) =-1 T A

Second option: Entry(1,0) + gap score = -2+(-2) =-4 T - - A

Third option: Entry(0,1) + gap score = -2+(-2) =-4 - T A -

Choosing the option with the maximal score T A

First option: Entry(1,0) + match score = -2+(3) =+1 TA -A

Second option: Entry(2,0) + gap score = -4+(-2) =-6 TA - - - A

Third option: Entry(1,1) + gap score = -1+(-2) =-3 TA A -

Choosing the option with the maximal score T A - A

TACT - A - -

T - AA - T AA

TAC -AA TACAA -

Three possible of alignments

T A C T – - A A T C

T A - C T A A T C -

T A C T – A A - T C

Local Alignment Algorithm • Algorithm of Smith & Waterman (1981) • Makes an optimal alignment of the best segment of similarity between two sequences • Sequences that are not highly similar as a whole, but contain regions that are highly similar • Use when one sequence is short and the other is very long (e.g. “database”) • Can return a number of highly aligned segments

Dynamic Programming for Sequence alignment