170 likes | 176 Views
Learn about inexact string matching, dynamic programming algorithms for alignment, limitations, and heuristic approaches in bioinformatics. Additional resources and tutorials are available. Master alignment techniques for genetic analysis.
E N D
Introduction to Bioinformatics: Lecture VAlignment Counting and Alignment Algorithms Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC JM - http://folding.chmcc.org
Outline of the lecture • Complexity of inexact string matching: exercises in counting alignments with gaps • The dynamic programming algorithm for sequence alignment: how it works • The dynamic programming algorithm for sequence alignment: why it works • Limitations and faster heuristic approaches: BLAST JM - http://folding.chmcc.org
Web watch: “Genes and Disease” and other NCBI resources Genes (proteins) work in herds. Being co-localized may imply co-expression and interactions. Always check the neighbors of your favorite gene! http://www.ncbi.nlm.nih.gov/ http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowTOC&rid=gnd.TOC&depth=2 Additional reading materials regarding sequence alignment: check out the web site for the course … JM - http://folding.chmcc.org
How many alignments with gaps are there? All the possible alignments (with gaps, however without the unnecessary alignment of two gaps against each other) may be represented in the form of a grid with only three steps (South, East, Southeast) allowed, i.e., there is a bijection between the set of walks on such a grid and the set of alignments. 0 1 1 1 1 3 5 7 1 5 13 25 1 7 25 63 321 1683 JM - http://folding.chmcc.org
How many alignments with gaps are there? JM - http://folding.chmcc.org
How many alignments with gaps are there? JM - http://folding.chmcc.org
How many alignments with gaps are there? JM - http://folding.chmcc.org
Time for the main idea of the algorithm … • Suppose we knew best alignments (and their scores) up to the nodes • which delineate the yellow part of the DP graph. What would be the best • extension, given that we have three choices: • Align the last characters in each string (diagonal extension) and add the score for this pair, e.g., s(a3,b3)= -5 • Align a3to a gap (horizontal extension), s(a3,-)= -8 • Align b3 to a gap (vertical extension), s(-,b3)= -8 Note well, the score for an extension does not depend on the alignment up to this point. 5 12 4 -3 JM - http://folding.chmcc.org
Tracing back optimal local extensions given the best alignment up to the previous node in the graph … s(a2,b1)= 10 s(a3,b2)= 10 s(a1,b1)= -5 2 -6 5 12 -8 -16 4 -3 -5 2 -6 5 12 4 -3 JM - http://folding.chmcc.org
This conceptual step may be reversed to obtain the best score and alignment up to a given point The score is a sum of independent piecewise scores, in particular, the score up to a certain point is the best score up to the point one step before plus the incremental score of the new step: • Global alignment (Needleman-Wunsch): F(0,0) = 0; F(k,0) = F(0,k) = - k d; F(i,j) = max { F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d } • Local alignment (Smith-Waterman): F(0,0) = 0; F(k,0) = F(0,k) = 0; F(i,j) = max { 0 ; F(i-1,j-1)+s(ai,bj) ; F(i-1,j)-d ; F(i,j-1)-d } JM - http://folding.chmcc.org
The general scheme of the NW algorithm • Use the recurrence relations, starting from the left upper corner (convention). • Find the highest score in the DP table (last, bottom right cell in the global alignment by definition) • Trace back the alignment using the pointers in the DP graph that show how the best local steps led to the best overall alignment. JM - http://folding.chmcc.org
Examples of pairwise scores from the Blosum50 matrix JM - http://folding.chmcc.org
An example of DP table for global alignment HEAGAWGHE --P-AW-HE JM - http://folding.chmcc.org
An example of DP table for local alignment AWGHE AW-HE JM - http://folding.chmcc.org
Why does it work? • All the possible alignments (with gaps) are represented in the DP table (graph) • The score is a sum of independent piecewise scores, in particular, the score up to a certain point is the best score up to the point one step before plus the incremental score of the new step • Once the best score in the DP table is found the trace back procedure generates the alignment since only the best “past” leading to the present score is represented by the pointers between the cells JM - http://folding.chmcc.org
Why does it work? Formally, one needs to show that the walk (alignment) found using the NW DP recurrence relations and the traceback procedure is indeed optimal, i.e., it maximizes the alignment score. An argument instead of a proof In case of global alignment each path starts at cell (0,0) and must end at cell (n,m). Consider the latter cell and the immediate past that led to it through one (most favorable together with the cost of the incremental step) of the 3 neighboring cells. Changing the last step (e.g. from initially chosen, optimal diagonal step) to an alternative one does not affect the scores at the preceding cells that represent the best trajectory up to this point. Clearly we get suboptimal solution if we assume that optimal solutions have been found in the previous steps. Hence, formalizing this argument we get proof by induction with reductio ad absurdum. Problem How to modify the NW algorithm for suffix-prefix matches? Problem What is the meaning of the cut off threshold for the SW algorithm? JM - http://folding.chmcc.org
Approximate, heuristic solutions may be nearly as good and much faster: BLAST algorithm • BLAST approach: gapless seeds (High Scoring Pairs with well defined confidence measures), DP extensions: JM - http://folding.chmcc.org