Comp. Genomics

Comp. Genomics Recitation 1

Outline • Sequence alignment • End-space free alignment • Alignment with gaps

xi | G yj |C Alignment basic step xi|G yj |C G C xi|G yj |C G - xi|G - yj|C C

Global alignment • All of xhas to be aligned with all of y • Therefore, every gap is “paid for” • The solution score is found in one cell Traceback all the way Alignment score here

Global alignment • Input: Sequences x,y • Output:Maximum score alignment • F(i,j) – score of aligning x[1..i] with y[1..j] • Base conditions: • F(i,0) = k=1..i(xk,-) • F(0,j) = k=1..j(-,yk) • Recurrence relation: F(i-1,j-1) + (xi,yj) 1in, 1jm: F(i,j) = maxF(i-1,j) + (xi,-) F(i,j-1) + (-, yj)

Local alignment • Local alignment • Subset of xaligned with a subset of y • Gaps outside subsets “costless” • Solution equals the maximum score cell in the DP matrix • Base conditions: • F(i,0) = 0 • F(0,j) = 0 • Recurrence relation: F(i-1,j-1) + (xi,yj) 1in, 1jm: F(i,j) = maxF(i-1,j) + (xi,-) F(i,j-1) + (-,yj) 0

Local alignment example AWGHE AW_ HE Mismatch: BLOSUM50 Match: BLOSUM50 Gap: -8

Overlap matches (end space free alignment) • Something between global and local • Consider aligning a gene x to a (bacterial) genome y • Gaps in the beginning and end of x and y are costless • But all of x should be aligned • Base conditions: • F(i,0) = 0 • F(0,j) = 0 • Recurrence relation: F(i-1,j-1) + (xi,yj) 1in, 1jm: F(i,j) = maxF(i-1,j) + (xi,-) F(i,j-1) + (-,yj) • The optimal solution is found at the last row/column (not necessarily at bottom right corner)

Xi|G yj|C Handling weird gaps • Affine gap: different cost for a “new” and “old” gaps Xi|G y j |C G C Xi|G y j |C G - Two new things to keep track  Two additional matrices Now we care if there were gaps here Xi|G y j |C - C

M(i,j) x 1...........i y 1...........j Alignment with Affine Gap Penalty Base Conditions: M(i, 0) = Ix(i, 0) = Wg + iWs M(0, j) = Iy(0, j) = Wg + jWs M(0, 0) = 0 Recursive Computation: x 1......i---- y 1...........j Iy(i,j) x 1...........i y 1….j----- Ix(i,j) M(i-1,j-1) + (xi,yj) M(i,j) = max Ix(i-1,j-1) + (xi,yj) Iy (i-1,j-1) +(xi,yj) M(i-1,j) + Wg+ Ws Ix(i-1,j) + Ws Wg ,Ws <0 Ix(i,j) = max The optimal solution is the maximum of the relevant cells in the three matrices

When do constant and affine gap costs differ? AGAGACTGACGCTTA ATATTA • Consider: AGAGACTGACGCTTA ATA---------TTA AGAGACTGACGCTTA ----A-T-A---TTA Constant penalty: Mismatch: -5 Gap: -1 -14 -9 Affine penalty: Mismatch: -5 Gap open: -3 Gap extend: -0.5 -12 -14.5

Question • Given two sequences x and y, the fragmentation number of x,y is the minimal k such that x and y can be broken into substrings x1,x2,...,xk ; y1,y2,...,yk and every xi is a substring of the corresponding yi • Suggest an algorithm for finding the fragmentation number of two sequences

Solution • Global alignment with the following modifications: • No penalty for gaps at the ends of y • Gaps are only allowed in x (characters of x may not be skipped) • Mismatches are not allowed (score -∞) • Affine gaps score, with open cost 1 and extension cost 0

Question • How do we align two sequences with a bound k on the maximal number of gaps? • Analyze the complexity

Solution We will divide every cell in the alignment matrix to 2k sub-cells. The meaning of a sub-cell is as follows: k cells with superscript 1: k cells with superscript 2:

Solution • The update rule for sub-cells with superscript 1: • The update rule for sub-cells with superscript 2:

What about arbitrary gap functions? • If the gap cost is an arbitrary function of its length, γ(k) • When computing Mij, we need to look at all possible gap lengths “back”: Xi|G Yj|C

Alignment with arbitrary gap functions Recursive Computation: k=0,…,i-1 F(i-1,j-1) + (xi,yj) k=0,…,j-1 F(i,j) = max F(k,j) + γ(i-k) F(i,k) + γ(j-k)

Complexity Suppose the two sequences are of length n.

LCS • Longest common non-contigous subsequence: • Use global alignment with similarity scores • +1 for match • 0 for indel • -∞ for mismatches

Exercise: Shortest common supersequence • A is called a non-contiguous supersequence of B if B is a non-contiguous subsequence of A. • e.g., YABADABADU is a non-contigous supersequence of BABU (YABADABADU) • Problem: Given AandB, find their shortest common supersequence

Solution • For A=“PRIDE” B=“PARADE”: • Compute LCS using global align: A=P-R-IDE B=PARA-DE • PARAIDE – Shortest common supersequence • Notice that PRDE is the longest common subsequence of A and B.

Exercise: Finding repeats • Basic objective: find a pair of subsequences within a string x with maximum similarity • Simple (albeit wrong) idea: Find an optimal alignment of x with itself! (Why is this wrong?) • But using local alignment is still a good idea

Variant #1 • First variant: the two sequences may not overlap • Solution: Absence of overlap means that there exists an index k such that one substring is in x[1..k] and another in x[k+1..n] • Check local alignments between x[1..k] and x[k+1..n] for all 1<=k<n • Pick the highest-scoring alignment • Complexity: O(n3) time and O(n) space

Variant #1, Pictorially

Variant #2 • Second variant: the two sequences must be consecutive (tandem repeat) • Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between x[1..k] and x[k+1..n], • No penalties for gaps in the beginning of x[1..k] • No penalties for gaps in the end of x[k+1..n] • Complexity: O(n3) time and O(n) space

Variant #2, Pictorially

Comp. Genomics

Comp. Genomics

Presentation Transcript

DNA Chips and Their Analysis Comp. Genomics: Lecture 13

Genomics

Genomics

Computational Genomics Fall 2004/5 www.cs.tau.ac.il/~bchor/CG05/comp-genom.html

Comp. Genomics

Comp. Genomics

Comp. Genomics

Genomics

Comp. Genomics

Intro to Comp Genomics

Comp. Genomics

Comp. Genomics

Comp. Genomics

Comp. Genomics

Genomics

Genomics

Comp. Genomics

Computational Genomics Spring 2009 cs.tau.ac.il/~bchor/CG09/comp-genom.html

Genomics

Comp. Genomics

Genomics

Comp. Genomics