320 likes | 473 Views
Pairwise Sequence Alignment (I). (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm.
E N D
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm
Comparing Genes in Two Genomes • Small islands of similarity corresponding to similarities between exons • Such comparisons are quite common in biology research
Alignment of sequences is one of the most basic and most important problems in bioinformatics…
Outline • Defining the problem of alignment • The longest common subsequence problem • Dynamic programming algorithms for alignment
Aligning Two Strings Given the strings: • v = ATGTTAT • w = ATCGTAC One possible alignment of the strings: AT_GTTAT_ ATCGT_A_C 1st row – string v with with space symbols “-” inserted 2nd row – string w with with space symbols “-” inserted
Aligning Two Strings (cont’d) Another way to represent each row shows the number of symbols of the sequence present up to a given position. For example the above sequences can be represented as: 0 1 2 2 3 4 5 6 7 7 AT_GTTAT_ ATCGT_A_C 0 1 2 3 4 5 5 6 6 7
0 1 2 2 3 4 5 6 7 7 AT_GTTAT_ ATCGT_A_C 0 1 2 3 4 5 5 6 6 7 Alignment Matrix Both rows of the alignment can be represented in the resulting matrix: 0 1 2 2 3 4 5 6 7 7 0 1 2 3 4 5 5 6 6 7 Each column in this matrix is a coordinate in a two-dimensional nxm grid
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1)
A T C G T A C w v A T G T T A T Alignment as a Path in the Edit Graph 1 2 3 4 5 6 7 0 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2) 0 1 2 3 4 5 6 7
A T C G T A C w v A T G T T A T Alignment as a Path in the Edit Graph 1 2 3 4 5 6 7 0 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2), (2,3), (3,4) 0 1 2 3 4 5 6 7
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) - End Result -
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph Every path in the edit graph corresponds to an alignment:
How to Score an Alignment? • Simplest • Every match scores 1 • Every mismatch scores 0 • An alignment is scored based on the number of common symbols • Lead to the longest common subsequence problem • More sophisticated • ? • ? • To be covered later
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignments in Edit Graph (cont’d) • and represent indels in v and w • Score 0. • represent exact matches. • Score 1.
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignments in Edit Graph (cont’d) The score of the alignment path in the graph is 5.
The Longest Common Subsequence (LCS) Problem • Find the longest subsequence common to two strings. Input: Two strings, v and w. Output: The longest common subsequence of v and w. v = ATGTTAT w = ATCGTAC v = AT GTTAT | | | | | “ATGTA” w = ATCGT AC A subsequence is not necessarily consecutive Longest common subsequence Best alignment
Brute Force Approach • Enumerate all the sequences up to length min(|v|,|w|) • For each one, check to see if it is a subsequence of v and w • Very expensive…. (How many sequences do we have to enumerate? )
The Idea of Dynamic Programming • Think of an alignment as a path in an edit graph • We only need to keep track of the best alignment (i.e., the longest common subsequence) • Score a longer alignment based on shorter alignments
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 G 3 T 4 T 5 A 6 T 7 Alignment as a Path in the Edit Graph 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6),(7,7) Use each cell to store the best alignment so far…
Use this scoring algorithm si,j = si-1, j-1+1 if vi = wj max si-1, j si, j-1 { Alignment: Dynamic Programming
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 0 G 3 0 T 4 0 T 5 0 A 6 0 T 7 0 0 Dynamic Programming Example • There are no matches in the beginning of the sequence • Label column i=1 to be all zero, and row j=1 to be all zero 0 0 0 0 0 0 0 0
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 0 G 3 0 T 4 0 T 5 0 Si,j = Si-1, j-1 max Si-1, j Si, j-1 A 6 0 { T 7 0 0 Dynamic Programming Example 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 value from NW +1, if vi = wj value from North (top) value from West (left) 1 1 1 1 Keep track of the best alignment score and the path contributing to it 1 1
Alignment: Backtracking Arrows show where the score originated from. if from the top if from the left if vi = wj
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 T 2 0 G 3 0 T 4 0 T 5 0 A 6 0 T 7 0 0 Dynamic Programming Example 0 0 0 0 0 0 0 0 Continuing with the scoring algorithm gives this result. 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 3 3 3 3 1 2 2 3 4 4 4 1 2 2 3 4 4 4 1 2 2 3 4 5 5 1 2 2 3 4 5 5
LCS(v,w) for i 1 to n Si,0 0 for j 1 to m S0,j 0 for i 1 to n for j 1 to m si-1,j si,j max si,j-1 si-1,j-1 + 1, if vi = wj “ “ if si,j = si-1,j bi,j “ “ if si,j = si,j-1 “ “ if si,j = si-1,j-1 + 1 return (sn,m, b) LCS Algorithm { {
A T C G T A C w 1 2 3 4 5 6 7 0 0 v A 1 0 0 0 0 0 0 0 0 T 2 0 1 1 1 1 1 1 1 G 3 1 0 2 2 2 2 2 2 T 1 4 2 2 3 3 3 3 0 1 2 T 2 3 4 4 4 5 0 1 2 2 3 4 4 4 A 6 0 1 2 2 3 4 5 5 T 1 2 2 3 4 7 0 5 5 0 Now What? • LCS(v,w) created the alignment grid • Now we need a way to read the best alignment of v and w • Follow the arrows backwards from the (|v|,|w|) cell
LCS Runtime • To create the nxm matrix of best scores from vertex (0,0) to all other vertices, it takes O(nm) time. • Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix.
How do we improve the scoring of alignments? Can we still find an alignment efficiently? We’ll talk about these later…
Matching score Insertion/deletion score The LCS Recurrence Revisited • The formula can be rewritten by adding zero to the edges that come from an indel, since the penalty of indels are 0: si-1, j-1+1 if vi = wj si,j = max si-1, j + 0 si, j-1 + 0
What You Should Know • How an alignment corresponds to a path in an edit graph • How the LCS problem corresponds to alignment with a simple scoring method • How the dynamic programming algorithm solves the LCS problem (= simple alignment)