130 likes | 234 Views
Sequence Alignment II. CIS 667 Spring 2004. Optimal Alignments. So we know how to compute the similarity between two sequences How do we construct an alignment that gives that similarity? We will use the (already computed) array from the previous algorithm
E N D
Sequence Alignment II CIS 667 Spring 2004
Optimal Alignments • So we know how to compute the similarity between two sequences • How do we construct an alignment that gives that similarity? • We will use the (already computed) array from the previous algorithm • Start at entry (m, n) and repeat the choices made to get the similarity score • Note that sometimes we had more than one choice giving the same optimal score
Optimal Alignments • Each choice gives one column of the alignment • If we have two or three choices, we systematically choose one of them • We will use a recursive algorithm • The algorithm will produce two arrays - align-s and align-t • The elements of these arrays are either spaces or symbols from the sequences
Algorithm Align input: indices i, j, array a given by algorithm Similarity output: alignment in align-s, align-t, and length in len if i = 0 and j = 0 then len 0 else if i > 0 and a[i, j] = a[i - 1, j] + g then Align(i - 1, j, len) len len + 1 align-s[len] s[i] align-t[len] - else if i>0 and j>0 and a[i,j] = a[i-1,j-1] + p(i,j) then Align(i - 1, j - 1, len) len len + 1 align-s[len] s[i] align-t[len] t[j] else // j > 0 anda[i, j] = a[i, j - 1] + g Align(i, j - 1, len) len len + 1 align-s[len] - align-t[len] t[j]
Algorithm Complexity • First algorithm has four loops • O(m), O(n), O(mn) • So complexity is: O(m) + O(n) + O(mn) = O(mn) = O(n2) • Second algorithm is • O(len) = O(m + n)
Local Comparison • A local alignment between s and t is an alignment between a substring of s and a substring of t • We want to find the highest scoring local alignment between two sequences • Modify the original algorithm so that each entry (i, j) of the matrix will hold the highest score of an alignment between a suffix of s[1..i] and a suffix of t[1..j]
Local Comparison • First row and column initialized to 0 • We now fill in the other elements of a as before, choosing the maximum of, now, 4 values • We have the previous three choices, plus a fourth choice - 0 • We always have the choice zero, by aligning the two empty suffixes • Find the alignment same way as before, but stop if we reach an entry with value zero • Start search at the largest value in the array
Semiglobal Comparisons • The basic algorithm compares two sequences in their entirety • Gap penalty assessed whether in middle or at end of one or more sequences • Not always desirable • Suppose we want to search for the short sequence ACGT within the longer sequence AAACACGTGTCC AAACACGTGTCC ----ACGT----
Semiglobal Comparisons • We don’t want to penalize the gaps at the end as we do those in middle since they don’t have biological significance • Usually result from incomplete data acquisition • This approach is known as semiglobal alignment • We can modify the basic algorithm for this type of alignment
Semiglobal Comparisons • Suppose we don’t want to charge for spaces after the last character of s • Consider an optimal alignment • Spaces after the end of s are matched with a suffix of t • Removing final part of alignment, we have an alignment between s and a prefix of t • So find optimal alignment between s and a prefix of t - but these are already computed in last row of a! • So take max value from last row of a
Semiglobal Comparisons • Suppose we don’t want to charge for spaces after the last character of t • Consider an optimal alignment • Spaces after the end of t are matched with a suffix of s • Removing final part of alignment, we have an alignment between t and a prefix of s • So find optimal alignment between t and a prefix of s - but these are already computed in last column of a! • So take max value from last column of a
Semiglobal Comparisons • What about spaces at the beginning of s and t? • These are represented by the values in the first row and column of a • So, if we don’t want to charge for them, just initialize this row and column to be all 0 • So the changes to the basic algorithm are: • Initialize row 1, column 1 to zero • Look for maximum in last row or column