290 likes | 417 Views
Comp. Genomics. Recitation 2 12/3/09 Slides by Igor Ulitsky. Outline. Alignment re-cap End-space free alignment Affine gap alignment algorithm and proof Bounded gap/spaces alignments. Dynamic programming. Useful in many string-related settings Will be repeatedly used in the course
E N D
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky
Outline • Alignment re-cap • End-space free alignment • Affine gap alignment algorithm and proof • Bounded gap/spaces alignments
Dynamic programming • Useful in many string-related settings • Will be repeatedly used in the course • General idea • Confine the exponential number of possibilities into some “hierarchy”, such that the number of cases becomes polynomial
Dynamic programming for shortest paths • Finding the shortest path from X to Y using the Floyd Warshall • Idea: if we know what is the shortest path using intermediate vertices {1,…, k-1}, computing shortest paths using {1,…, k} is easy wij if k=0 • dij(k)= min{dij(k-1), dik(k-1)+dkj(k-1)} otherwise
Something1|G Something2|C Alignment reminder Something1|G Something1|C Something1|G Something2|C Something1|G Something1|C Somethin g1|G Something2C|- Something1|G Something1G|- Something1|C Somethin g2|C
Global alignment • Input: S1,S2 • Output: Minimum cost alignment • V(k,l) – score of aligning S1[1..k] with S2[1..l] • Base conditions: • V(i,0) = k=0..i(sk,-) • V(0,j) = k=0..j(-,tk) • Recurrence relation: V(i-1,j-1) + (si,tj) 1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-) V(i,j-1) + (-,tj)
Alignment reminder • Global alignment • All of S1 has to be aligned with all of S2 • Every gap is “payed for” • Solution equals V(n,m) Traceback all the way Alignment score here
Local alignment • Local alignment • Subset of S1 aligned with a subset of S2 • Gaps outside subsets “costless” • Solution equals the maximum score cell in the DP matrix • Base conditions: • V(i,0) = 0 • V(0,j) = 0 • Recurrence relation: V(i-1,j-1) + (si,tj) 1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-) V(i,j-1) + (-,tj) 0
Ends-free alignment • Something between global and local • Consider aligning a gene to a (bacterial) genome • Gaps in the beginning and end of S and T are costless • But all of S,T should be aligned • Base conditions: • V(i,0) = 0 • V(0,j) = 0 • Recurrence relation: V(i-1,j-1) + (si,tj) 1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-) V(i,j-1) + (-,tj) • The optimal solution is found at the last row/column (not necessarily at bottom right corner)
Something1|G Something2|C Handling weird gaps • Affine gap: different cost for a “new” and “old” gaps Something1|G Something1|C Something1|G Something2|C Something1|G Something1|C Somethin g1|G Something2C|- Two new things to keep track Two additional matrices Now we care if there were gaps here Something1|G Something1G|- Something1|C Somethin g2|C
G(i,j) S.....i T.....j Alignment with Affine Gap Penalty • Base Conditions: • V(i, 0) = F(i, 0) = Wg + iWs • V(0, j) = E(0, j) = Wg + jWs • Recursive Computation: • V(i, j) = max{ E(i, j), F(i, j), G(i, j)} • where: • G(i, j) = V(i-1, j-1) + (si, tj) • E(i, j) = max{ E(i, j-1) + Ws , G(i, j-1) + Wg + Ws , F(i, j-1) + Wg + Ws } • F(i, j) = max{ F(i-1, j) + Ws , G(i-1, j) + Wg + Ws , E(i-1, j) + Wg + Ws } S.....i------ T..............j E(i,j) S...............i T.....j------- • Time complexity O(nm) - compute 4 matrices instead of one. • Space complexity O(nm) - saving 3 (Why?) matrices. O(n+m) w/ Hir.
When do constant and affine gap costs differ? AGAGACTGACGCTTA ATATTA • Consider: AGAGACTGACGCTTA ATA---------TTA AGAGACTGACGCTTA ----A-T-A---TTA Constant penalty: Mismatch: -5 Gap: -1 -14 -9 Affine penalty: Mismatch: -5 Gap open: -3 Gap extend: -0.5 -12 -14.5
Bounding the number of gaps • Lets say we are allowed to have at most K gaps • (Gaps ≠ Spaces Gap can contain many spaces) • Now we keep track of the number of gaps we opened so far • Also still need to keep track of whether a gap is currently open in S or T (E/F matrices)
Bounding the number of gaps • A “multi-layer” DP matrix • Actually separate functions – V,E,F, on every layer, keeping track of layer no. • Every time we open or close a gap we “jump” to the next layer • Where to look for the solution? (not only at last layer!) • What is the complexity?
Bounding the number of spaces • Let’s say that no gap can exceed k spaces • Of course now cannot also bound number of gaps as well (why?) • How many matrices do we need now? • Here, no monotone notion of layer like before • What’s the complexity?
What about arbitrary gap functions? • If the gap cost is an arbitrary function of its length f(k) • Thus, when computing Dij, we need to look at j places “back” and i places “up”: • Complexity? Something1|G Something1|C min
Special cases • How about a logarithmic penalty? Wg+Ws*log(k) • This is a special case of a convex penalty, which is solvable in O(mn*log(m)) • The logarithmic case can be done in O(mn) • For a piece-wise linear gap function made of K lines, DP can be done in O(mn*log(K))
Supersequence • Exercise: A is called a non-contiguous supersequence of B if B is a non-contiguous subsequence of A. • e.g., YABADABADU is a non-contigous supersequence of BABU (YABADABADU) • Given SandT, find their shortest common supersequence
Reminder: LCS • Longest common non-contigous subsequence: • Adjust global alignment with similarity scores • 1 for match • 0 for gaps • -∞ for mismatches
Supersequence • Find the longest common sub-sequence of S,T • Generate the string as follows: • for every column in the alignment • Match – add the matching character (once!) • Gap – add the character aligned against the gap
Supersequence • For S=“Pride” T=“Parade”: • P-R-IDE • PARA-DE • PARAIDE – Shortest common supersequence
Exercise: Finding repeats • Basic objective: find a pair of subsequences within S with maximum similarity • Simple (albeit wrong) idea: Find an optimal alignment of S with itself! (Why wrong?) • But using local alignment is still a good idea
Variant #1 • Specific requirement: the two sequences may overlap • Solution: Change the local alignment algorithm: • Compute only the upper triangular submatrix (V(i,j), where j>i). • Set diagonal values to 0 • Complexity: O(n2) time and O(n) space
Variant #2 • Specific requirement: the two sequences may not overlap • Solution: Absence of overlap means that k exists such that one string is in S[1..k] and another in S[k+1..n] • Check local alignments between S[1..k] and S[k+1..n] for any 1<=k<n • Pick the highest-scoring alignment • Complexity: O(n3) time and O(n) space
Variant #3 • Specific requirement: the two sequences must be consequtive (tandem repeat) • Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between S[1..k] and S[k+1..n], • No penalties for gaps in the beginning of S[1..k] • No penalties for gaps in the end of S[k+1..n] • Complexity: O(n3) time and O(n) space
Variant #4 • Specific requirement: the two sequences must be consequtive and the similarity is measured between the first sequence and the reverse complement of the second - SRC (inverted repeat) • Tempting (albeit wrong) to use something in the spirit of variant #3 – will give complexity O(n3)
Variant #4 • Solution: Compute the local alignment between S and SRC • Look for results on the diagonal i+j=n • AGCTAACGCGTTCGAA (n=16) • Complexity:O(n2) time, O(n) space Index 8 Index 8