290 likes | 475 Views
Comp. Genomics. Recitation 1. Outline. Sequence alignment End-space free alignment Alignment with gaps. x i | G. y j | C. Alignment basic step. x i |G. y j |C. G. C. x i |G. y j |C. G. -. x i |G. -. y j |C. C. Global alignment. All of x has to be aligned with all of y
E N D
Comp. Genomics Recitation 1
Outline • Sequence alignment • End-space free alignment • Alignment with gaps
xi | G yj |C Alignment basic step xi|G yj |C G C xi|G yj |C G - xi|G - yj|C C
Global alignment • All of xhas to be aligned with all of y • Therefore, every gap is “paid for” • The solution score is found in one cell Traceback all the way Alignment score here
Global alignment • Input: Sequences x,y • Output:Maximum score alignment • F(i,j) – score of aligning x[1..i] with y[1..j] • Base conditions: • F(i,0) = k=1..i(xk,-) • F(0,j) = k=1..j(-,yk) • Recurrence relation: F(i-1,j-1) + (xi,yj) 1in, 1jm: F(i,j) = maxF(i-1,j) + (xi,-) F(i,j-1) + (-, yj)
Local alignment • Local alignment • Subset of xaligned with a subset of y • Gaps outside subsets “costless” • Solution equals the maximum score cell in the DP matrix • Base conditions: • F(i,0) = 0 • F(0,j) = 0 • Recurrence relation: F(i-1,j-1) + (xi,yj) 1in, 1jm: F(i,j) = maxF(i-1,j) + (xi,-) F(i,j-1) + (-,yj) 0
Local alignment example AWGHE AW_ HE Mismatch: BLOSUM50 Match: BLOSUM50 Gap: -8
Overlap matches (end space free alignment) • Something between global and local • Consider aligning a gene x to a (bacterial) genome y • Gaps in the beginning and end of x and y are costless • But all of x should be aligned • Base conditions: • F(i,0) = 0 • F(0,j) = 0 • Recurrence relation: F(i-1,j-1) + (xi,yj) 1in, 1jm: F(i,j) = maxF(i-1,j) + (xi,-) F(i,j-1) + (-,yj) • The optimal solution is found at the last row/column (not necessarily at bottom right corner)
Xi|G yj|C Handling weird gaps • Affine gap: different cost for a “new” and “old” gaps Xi|G y j |C G C Xi|G y j |C G - Two new things to keep track Two additional matrices Now we care if there were gaps here Xi|G y j |C - C
M(i,j) x 1...........i y 1...........j Alignment with Affine Gap Penalty Base Conditions: M(i, 0) = Ix(i, 0) = Wg + iWs M(0, j) = Iy(0, j) = Wg + jWs M(0, 0) = 0 Recursive Computation: x 1......i---- y 1...........j Iy(i,j) x 1...........i y 1….j----- Ix(i,j) M(i-1,j-1) + (xi,yj) M(i,j) = max Ix(i-1,j-1) + (xi,yj) Iy (i-1,j-1) +(xi,yj) M(i-1,j) + Wg+ Ws Ix(i-1,j) + Ws Wg ,Ws <0 Ix(i,j) = max The optimal solution is the maximum of the relevant cells in the three matrices
When do constant and affine gap costs differ? AGAGACTGACGCTTA ATATTA • Consider: AGAGACTGACGCTTA ATA---------TTA AGAGACTGACGCTTA ----A-T-A---TTA Constant penalty: Mismatch: -5 Gap: -1 -14 -9 Affine penalty: Mismatch: -5 Gap open: -3 Gap extend: -0.5 -12 -14.5
Question • Given two sequences x and y, the fragmentation number of x,y is the minimal k such that x and y can be broken into substrings x1,x2,...,xk ; y1,y2,...,yk and every xi is a substring of the corresponding yi • Suggest an algorithm for finding the fragmentation number of two sequences
Solution • Global alignment with the following modifications: • No penalty for gaps at the ends of y • Gaps are only allowed in x (characters of x may not be skipped) • Mismatches are not allowed (score -∞) • Affine gaps score, with open cost 1 and extension cost 0
Question • How do we align two sequences with a bound k on the maximal number of gaps? • Analyze the complexity
Solution We will divide every cell in the alignment matrix to 2k sub-cells. The meaning of a sub-cell is as follows: k cells with superscript 1: k cells with superscript 2:
Solution • The update rule for sub-cells with superscript 1: • The update rule for sub-cells with superscript 2:
What about arbitrary gap functions? • If the gap cost is an arbitrary function of its length, γ(k) • When computing Mij, we need to look at all possible gap lengths “back”: Xi|G Yj|C
Alignment with arbitrary gap functions Recursive Computation: k=0,…,i-1 F(i-1,j-1) + (xi,yj) k=0,…,j-1 F(i,j) = max F(k,j) + γ(i-k) F(i,k) + γ(j-k)
Complexity Suppose the two sequences are of length n.
LCS • Longest common non-contigous subsequence: • Use global alignment with similarity scores • +1 for match • 0 for indel • -∞ for mismatches
Exercise: Shortest common supersequence • A is called a non-contiguous supersequence of B if B is a non-contiguous subsequence of A. • e.g., YABADABADU is a non-contigous supersequence of BABU (YABADABADU) • Problem: Given AandB, find their shortest common supersequence
Solution • For A=“PRIDE” B=“PARADE”: • Compute LCS using global align: A=P-R-IDE B=PARA-DE • PARAIDE – Shortest common supersequence • Notice that PRDE is the longest common subsequence of A and B.
Exercise: Finding repeats • Basic objective: find a pair of subsequences within a string x with maximum similarity • Simple (albeit wrong) idea: Find an optimal alignment of x with itself! (Why is this wrong?) • But using local alignment is still a good idea
Variant #1 • First variant: the two sequences may not overlap • Solution: Absence of overlap means that there exists an index k such that one substring is in x[1..k] and another in x[k+1..n] • Check local alignments between x[1..k] and x[k+1..n] for all 1<=k<n • Pick the highest-scoring alignment • Complexity: O(n3) time and O(n) space
Variant #2 • Second variant: the two sequences must be consecutive (tandem repeat) • Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between x[1..k] and x[k+1..n], • No penalties for gaps in the beginning of x[1..k] • No penalties for gaps in the end of x[k+1..n] • Complexity: O(n3) time and O(n) space