CS 5263 Bioinformatics

CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties

Last lecture • Local Sequence Alignment • Bounded Dynamic Programming • Linear Space Sequence Alignment

The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) Iteration: F(i, j) = max

The Smith-Waterman algorithm Termination: • If we want the best local alignment… FOPT = maxi,j F(i, j) • If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

Bounded Dynamic Programming • O(kM) time • O(kN) memory x1 ………………………… xM yN ………………………… y1 k

Linear-space alignment • O(M+N) memory • 2MN time M/2 k* M/2 N-k*

Homework Problem 5 hints Dot matrix for visualizing seq similarities • Seq1: x[1..m] • Seq2: y[1..n] A(i, j) = 1 if (xi, yj) = 1 A(i, j) = 1 if k=1:10((xi+k, yj+k)) > 7 A dot matrix does not do any alignment (global or local). It helps to detect strongly conserved regions. A(i, j) = 1 if k=1:20((xi+k, yj+k)) > 15

Seq1 Seq2

Today • How to model gaps more accurately? • Statistics of alignments • Where does (xi, yj)come from? • Are two aligned sequences actually related? – not today

GACGCCGAACG |||| | | || GACG-C-A-CG What’s a better alignment? GACGCCGAACG ||||| ||| GACGC---ACG Score = 8 x m – 3 x d Score = 8 x m – 3 x d • However, gaps usually occur in bunches. • During evolution, chunks of DNA may be lost entirely • Aligning genomic sequences vs. cDNAs (reverse complimentary to mRNAs)

Model gaps more accurately • Current model: • Gap of length n incurs penalty nd • General: • Convex function • E.g. (n) = c * sqrt (n)  n  n

General gap dynamic programming Initialization: same Iteration: F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – (i-k) maxk=0…j-1F(i,k) – (j-k) Termination: same Running Time: O((M+N)MN) (cubic) Space: O(NM) (linear-space algorithm not applicable)

Compromise: affine gaps (n) (n) = d + (n – 1)e | | gap gap open extension e d Match: 2 Gap open: -5 Gap extension: -1 GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 9 8x2-3x5 = 1 • We want to find the optimal alignment with affine gap penalty in • O(MN) time • O(MN) or better O(M+N) memory

Allowing affine gap penalties • Still three cases • xi aligned with yj • Xi aligns to a gap • Are we continuing a gap in x? (if no, start is more expensive) • Yj aligns to a gap • Are we continuing a gap in y? (if no, start is more expensive) • We can use a finite state machine to represent the three cases as three states • The machine has two heads, reading the chars on the two strings separately • At every step, each head reads 0 or 1 char from each sequence • Depending on what it reads, goes to a different state, and produces different scores

Finite State Machine Input Output ? / ? ? / ? Ix ? / ? ? / ? F ? / ? Iy ? / ? State ? / ? F: have just read 1 char from each seq (xi aligned to yj ) Ix: have read 0 char from x. (yj aligned to a gap) Iy: have read 0 char from y (xi aligned to a gap)

Input Output (-, yj) / e (xi,yj) /  Ix (xi,yj) /  (-, yj) / d F (xi,-) / d Iy (xi,-) / e Start state (xi,yj) / 

(-, yj) / e (xi,yj) /  Ix (xi,yj) /  (-, yj) / d F (xi,-) / d Iy (xi,-) / e start state (xi,yj) /  F-F-Iy-F-Ix F-F-F-F F-Iy-F-F-Ix AAC ||| ACT AAC- | | A-CT AAC- || -ACT AAC ACT Given a pair of sequences, an alignment (not necessarily optimal) corresponds to a state path in the FSM. Optimal alignment: find a state path to read the two sequences such that the total output score is the highest

Dynamic programming • We encode this information in three different matrices • For each element (i,j) we use three variables • F(i,j): best alignment (score) of x1..xi & y1..yj if xi aligns to yj • Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap • Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap xi xi xi yj yj yj Iy(i, j) Ix(i, j) F(i, j)

(-, yj)/e (xi,yj) / Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i-1, j-1) + (xi, yj) F(i, j) = max Ix(i-1, j-1) + (xi, yj) Iy(i-1, j-1) + (xi, yj) xi yj

(-, yj)/e (xi,yj) / Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i, j-1) + d Ix(i, j) = max Ix(i, j-1) + e xi yj Ix(i, j)

(-, yj)/e (xi,yj) / Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i-1, j) + d Iy(i, j) = max Iy(i-1, j) + e xi yj Iy(i, j)

F(i – 1, j – 1) F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) F(i, j – 1) + d Ix(i, j) = max Ix(i, j – 1) + e F(i – 1, j) + d Iy(i, j) = max Iy(i – 1, j) + e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y

Data dependency F i Iy Ix j i-1 i-1 j-1 j-1

Data dependency F i Ix Iy j i i j j

Data dependency • If we stack all three matrices • No cyclic dependency • Therefore, we can fill in all three matrices in order

Algorithm • for i = 1:m • for j = 1:n • Fill in F(i, j), Ix(i, j), Iy(i, j) • end end • F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N)) • Time: O(MN) • Space: O(MN) or O(N) when combine with the linear-space algorithm

Exercise • x = GCAC • y = GCC • m = 2 • s = -2 • d = -5 • e = -1

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F: aligned on both Iy: Insertion on y y = G C C Iy(i-1, j-1) F(i-1, j-1) Iy(i-1,j) x = F(i-1,j) (xi, yj) G C A C e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix: Insertion on x

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = 2 Ix(i-1, j-1) F(i, j) Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = -2 Ix(i-1, j-1) F(i, j) Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = G C A C F(i,j-1) d = -5 Ix(i,j) Ix(i,j-1) e = -1 Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1,j) G C A C F(i-1,j) e=-1 d=-5 Iy(i,j) Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = -2 Ix(i-1, j-1) F(i, j) Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = Iy(i-1, j-1) F(i-1, j-1) G C A C (xi, yj) = 2 Ix(i-1, j-1) F(i, j) Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C x = G C A C F(i,j-1) d = -5 Ix(i,j) Ix(i,j-1) e = -1 Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C F Iy y = G C C Iy(i-1, j-1) F(i-1, j-1) Iy(i-1,j) x = F(i-1,j) (xi, yj) G C A C e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix

y = y = G C C G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A C G C A C x GCAC || | GC-C y F Iy y = G C C y = G C C x = x = G C A C G C A C Ix

CS 5263 Bioinformatics