1 / 29

Comp. Genomics

Comp. Genomics. Recitation 2 12/3/09 Slides by Igor Ulitsky. Outline. Alignment re-cap End-space free alignment Affine gap alignment algorithm and proof Bounded gap/spaces alignments. Dynamic programming. Useful in many string-related settings Will be repeatedly used in the course

dandre
Download Presentation

Comp. Genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky

  2. Outline • Alignment re-cap • End-space free alignment • Affine gap alignment algorithm and proof • Bounded gap/spaces alignments

  3. Dynamic programming • Useful in many string-related settings • Will be repeatedly used in the course • General idea • Confine the exponential number of possibilities into some “hierarchy”, such that the number of cases becomes polynomial

  4. Dynamic programming for shortest paths • Finding the shortest path from X to Y using the Floyd Warshall • Idea: if we know what is the shortest path using intermediate vertices {1,…, k-1}, computing shortest paths using {1,…, k} is easy wij if k=0 • dij(k)= min{dij(k-1), dik(k-1)+dkj(k-1)} otherwise

  5. Something1|G Something2|C Alignment reminder Something1|G Something1|C Something1|G Something2|C Something1|G Something1|C Somethin g1|G Something2C|- Something1|G Something1G|- Something1|C Somethin g2|C

  6. Global alignment • Input: S1,S2 • Output: Minimum cost alignment • V(k,l) – score of aligning S1[1..k] with S2[1..l] • Base conditions: • V(i,0) = k=0..i(sk,-) • V(0,j) = k=0..j(-,tk) • Recurrence relation: V(i-1,j-1) + (si,tj) 1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-) V(i,j-1) + (-,tj)

  7. Alignment reminder • Global alignment • All of S1 has to be aligned with all of S2 • Every gap is “payed for” • Solution equals V(n,m) Traceback all the way Alignment score here

  8. Local alignment • Local alignment • Subset of S1 aligned with a subset of S2 • Gaps outside subsets “costless” • Solution equals the maximum score cell in the DP matrix • Base conditions: • V(i,0) = 0 • V(0,j) = 0 • Recurrence relation: V(i-1,j-1) + (si,tj) 1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-) V(i,j-1) + (-,tj) 0

  9. Ends-free alignment • Something between global and local • Consider aligning a gene to a (bacterial) genome • Gaps in the beginning and end of S and T are costless • But all of S,T should be aligned • Base conditions: • V(i,0) = 0 • V(0,j) = 0 • Recurrence relation: V(i-1,j-1) + (si,tj) 1in, 1jm: V(i,j) = maxV(i-1,j) + (si,-) V(i,j-1) + (-,tj) • The optimal solution is found at the last row/column (not necessarily at bottom right corner)

  10. Something1|G Something2|C Handling weird gaps • Affine gap: different cost for a “new” and “old” gaps Something1|G Something1|C Something1|G Something2|C Something1|G Something1|C Somethin g1|G Something2C|- Two new things to keep track  Two additional matrices Now we care if there were gaps here Something1|G Something1G|- Something1|C Somethin g2|C

  11. G(i,j) S.....i T.....j Alignment with Affine Gap Penalty • Base Conditions: • V(i, 0) = F(i, 0) = Wg + iWs • V(0, j) = E(0, j) = Wg + jWs • Recursive Computation: • V(i, j) = max{ E(i, j), F(i, j), G(i, j)} • where: • G(i, j) = V(i-1, j-1) + (si, tj) • E(i, j) = max{ E(i, j-1) + Ws , G(i, j-1) + Wg + Ws , F(i, j-1) + Wg + Ws } • F(i, j) = max{ F(i-1, j) + Ws , G(i-1, j) + Wg + Ws , E(i-1, j) + Wg + Ws } S.....i------ T..............j E(i,j) S...............i T.....j------- • Time complexity O(nm) - compute 4 matrices instead of one. • Space complexity O(nm) - saving 3 (Why?) matrices. O(n+m) w/ Hir.

  12. When do constant and affine gap costs differ? AGAGACTGACGCTTA ATATTA • Consider: AGAGACTGACGCTTA ATA---------TTA AGAGACTGACGCTTA ----A-T-A---TTA Constant penalty: Mismatch: -5 Gap: -1 -14 -9 Affine penalty: Mismatch: -5 Gap open: -3 Gap extend: -0.5 -12 -14.5

  13. Bounding the number of gaps • Lets say we are allowed to have at most K gaps • (Gaps ≠ Spaces  Gap can contain many spaces) • Now we keep track of the number of gaps we opened so far • Also still need to keep track of whether a gap is currently open in S or T (E/F matrices)

  14. Bounding the number of gaps • A “multi-layer” DP matrix • Actually separate functions – V,E,F, on every layer, keeping track of layer no. • Every time we open or close a gap we “jump” to the next layer • Where to look for the solution? (not only at last layer!) • What is the complexity?

  15. Bounding the number of spaces • Let’s say that no gap can exceed k spaces • Of course now cannot also bound number of gaps as well (why?) • How many matrices do we need now? • Here, no monotone notion of layer like before • What’s the complexity?

  16. What about arbitrary gap functions? • If the gap cost is an arbitrary function of its length f(k) • Thus, when computing Dij, we need to look at j places “back” and i places “up”: • Complexity? Something1|G Something1|C min

  17. Special cases • How about a logarithmic penalty? Wg+Ws*log(k) • This is a special case of a convex penalty, which is solvable in O(mn*log(m)) • The logarithmic case can be done in O(mn) • For a piece-wise linear gap function made of K lines, DP can be done in O(mn*log(K))

  18. Supersequence • Exercise: A is called a non-contiguous supersequence of B if B is a non-contiguous subsequence of A. • e.g., YABADABADU is a non-contigous supersequence of BABU (YABADABADU) • Given SandT, find their shortest common supersequence

  19. Reminder: LCS • Longest common non-contigous subsequence: • Adjust global alignment with similarity scores • 1 for match • 0 for gaps • -∞ for mismatches

  20. Supersequence • Find the longest common sub-sequence of S,T • Generate the string as follows: • for every column in the alignment • Match – add the matching character (once!) • Gap – add the character aligned against the gap

  21. Supersequence • For S=“Pride” T=“Parade”: • P-R-IDE • PARA-DE • PARAIDE – Shortest common supersequence

  22. Exercise: Finding repeats • Basic objective: find a pair of subsequences within S with maximum similarity • Simple (albeit wrong) idea: Find an optimal alignment of S with itself! (Why wrong?) • But using local alignment is still a good idea

  23. Variant #1 • Specific requirement: the two sequences may overlap • Solution: Change the local alignment algorithm: • Compute only the upper triangular submatrix (V(i,j), where j>i). • Set diagonal values to 0 • Complexity: O(n2) time and O(n) space

  24. Variant #2 • Specific requirement: the two sequences may not overlap • Solution: Absence of overlap means that k exists such that one string is in S[1..k] and another in S[k+1..n] • Check local alignments between S[1..k] and S[k+1..n] for any 1<=k<n • Pick the highest-scoring alignment • Complexity: O(n3) time and O(n) space

  25. Variant #2

  26. Variant #3 • Specific requirement: the two sequences must be consequtive (tandem repeat) • Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between S[1..k] and S[k+1..n], • No penalties for gaps in the beginning of S[1..k] • No penalties for gaps in the end of S[k+1..n] • Complexity: O(n3) time and O(n) space

  27. Variant #3

  28. Variant #4 • Specific requirement: the two sequences must be consequtive and the similarity is measured between the first sequence and the reverse complement of the second - SRC (inverted repeat) • Tempting (albeit wrong) to use something in the spirit of variant #3 – will give complexity O(n3)

  29. Variant #4 • Solution: Compute the local alignment between S and SRC • Look for results on the diagonal i+j=n • AGCTAACGCGTTCGAA (n=16) • Complexity:O(n2) time, O(n) space Index 8 Index 8 

More Related