CS 5263 Bioinformatics

CS 5263 Bioinformatics Lecture 5: Local Sequence Alignment Algorithms

Poll • Who have learned and still remember Finite State Machine/Automata, regular grammar, and context-free grammar?

Roadmap • Review of last lecture • Local Sequence Alignment • Statistics of sequence alignment • Substitution matrix • Significance of alignment

Bounded Dynamic Programming • O(kM) time • O(kM) memory • Possibly O(M+k) x1 ………………………… xM yN ………………………… y1 k

Linear-space alignment • O(M+N) memory • 2MN time M/2 k* M/2 N-k*

Graph representation of seq alignment (0,0) -1 -1 -1 -1 -1 -1 1 1 1 1 (3,4) An optimal alignment is a longest path from (0, 0) to (m,n) on the alignment graph

Question • If I change the scoring scheme, will it change the alignment? • Match = 1, mismatch = gap = -2 || v • Match = 2, mismatch = gap = -1? • Answer: Yes

Proof • Let F1 be the score of an optimal alignment under the scoring scheme • Match = m > 0 • Mismatch = s < 0 • Gap = d < 0 • Let a1, b1, c1 be the number of matches, mismatches, and gaps in the alignment • F1 = a1m + b1s + c1d

Proof (cont’) • Let F2 be the score of a sub-optimal alignment under the same scoring scheme • Let a2, b2, c2 be the number of matches, mismatches, and gaps in the alignment • F2 = a2m + b2s + c2d • Let F1 = F2 + k, where k > 0

Proof (cont’) • Now we change the scoring scheme, so that • Match = m + 1 • Mismatch = s + 1 • Gap = d + 1

length of alignment 1 length of alignment 2 Proof (cont’) • The new scores for the two alignments become: F1’= a1 * (m+1) + b1 * (s + 1) + c1 * (d + 1) = a1m + b1s + c1d + (a1+b1+c1) = F1 + (a1+b1+c1) = F1 + L1 F2’ = a2 * (m+1) + b2 * (s + 1) + c2 * (d + 1) = F2 + (a2+b2+c2) = F2 + L2

Proof (cont’) • F1’ – F2’ = F1 – F2 + (a1+b1+c1) – (a2+b2+c2) = k + (a1+b1+c1) – (a2+b2+c2) = k + L1 – L2 In order for F1’ < F2’, we need to have: k + L1 – L2 < 0, i.e. L2 – L1 > k Length of alignment 1 Length of alignment 2

Proof (cont’) • This means, if under the original scoring scheme, F1 is greater than F2 by k, but the length of alignment 2 is at least (k+1) greater than that of alignment 1, F2’ will be greater than F1’ under the new scoring scheme. • We only need to show one example that it is possible to find such two alignments

d F1 = 2m + 3s F2 = 3m + 4d m m s d m d m s s d

d F1 = 2m + 3s F2 = 3m + 4d m m s d m = 1, s = d = –2 F1 = 2 – 6 = –4 F2 = 3 – 8 = –5 F1 > F2 m d m s s d

d F1 = 2m + 3s F2 = 3m + 4d m m s d m = 2, s = d = – 1 F1’ = 4 – 3 = 1 F2’ = 6 – 4 = 2 F2’ > F1’ m d m s s d

A A C G A AACAG | | ATCGT F1 = 2x1-3x2 = -4 F1’ = 2x2 – 3x1 = 1 m m A T m C AA-CAG- | | | -ATC-GT F2 = 3x1 – 4x2 = -5 F2’ = 3x2 – 4x1 = 2 m G T

On the other hand, if we had doubled our scores, such that m’ = 2m, s’ = 2s d’ = 2d • F1’ = 2F1 • F2’ = 2F2 • Our alignment won’t be changed

Today • How to model gaps more accurately? • Local sequence alignment • Statistics of alignment

GACGCCGAACG |||| | | || GACG-C-A-CG What’s a better alignment? GACGCCGAACG ||||| ||| GACGC---ACG Score = 8 x m – 3 x d Score = 8 x m – 3 x d • However, gaps usually occur in bunches. • During evolution, chunks of DNA may be lost entirely • Aligning genomic sequence vs. cDNA (reverse complimentary to mRNA)

Model gaps more accurately • Current model: • Gap of length n incurs penalty nd • General: • Convex function • E.g. (n) = c * sqrt (n)  n  n

General gap dynamic programming Initialization: same Iteration: F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – (i-k) maxk=0…j-1F(i,k) – (j-k) Termination: same Running Time: O(N2M) (cubic) Space: O(NM) (linear-space algorithm not applicable)

Compromise: affine gaps (n) (n) = d + (n – 1)e | | gap gap open extension e d Match: 2 Gap open: 5 Gap extension: 1 GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 9 8x2-3x5 = 1

Additional states • The amount of state needed increases • In scoring a single entry in our matrix, we need remember an extra piece of information • Are we continuing a gap in x? (if no, start is more expensive) • Are we continuing a gap in y? (if no, start is more expensive) • Are we continuing from a match between xi and yj?

Finite State Automaton e Xi aligned to a gap  d Xi and Yj aligned  d Yj aligned to a gap e 

Dynamic programming • We encode this information in three different matrices • For each element (i,j) we use three variables • F(i,j): best alignment of x1..xi & y1..yj if xi aligns to yj • Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap • Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap

d F(i – 1, j – 1) F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) F(i, j – 1) – d Ix(i, j) = max Iy(i, j – 1) – d Ix(i, j – 1) – e F(i – 1, j) – d Iy(i, j) = max Ix(i – 1, j) – d Iy(i – 1, j) – e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y

F Iy Ix

F Ix Iy

If we stack all three matrices • No cyclic dependency • We can fill in all three matrices in order

Algorithm • for i = 1:m • for j = 1:n • Fill in F(i, j), Ix(i, j), Iy(i, j) • end end • F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N)) • Time: O(MN) • Space: O(MN) or O(N) when combine with the linear-space algorithm

To simplify F(i – 1, j – 1) + (xi, yj) F(i, j) = max I(i – 1, j – 1) + (xi, yj) F(i, j – 1) – d I(i, j) = max I(i, j – 1) – e F(i – 1, j) – d I(i – 1, j) – e I(i, j): best alignment between x1…xi and y1…yj if either xi or yj is aligned to a gap This is possible because no alternating gaps allowed

To summarize • Global alignment • Basic algorithm: Needleman-Wunsch • Variants: • Overlapping detection • Longest common subsequences • Achieved by varying initial conditions or scoring • Bounded DP (pruning search space) • Linear space (divide-and-conquer) • Affine gap penalty

Local alignment

The local alignment problem Given two strings X = x1……xM, Y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de x y

Why local alignment • Conserved regions may be a small part of the whole • “Active site” of a protein • Scattered genes or exons among “junks” • Don’t have whole sequence • Global alignment might miss them if flanking “junk” outweighs similar regions

Genes are shuffled between genomes C D B A D B A C B C A D B D C A

Naïve algorithm for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair • Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.

Reminder • The overlap detection algorithm • We do not give penalty to gaps in the ends Free gap Free gap

Similar here • We are free of penalty for the unaligned regions

The big idea • Whenever we get to some bad region (negative score), we ignore the previous alignment • Reset score to zero

The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) Iteration: F(i, j) = max

The Smith-Waterman algorithm Termination: • If we want the best local alignment… FOPT = maxi,j F(i, j) • If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

The correctness of the algorithm can be proved by induction using the alignment graph 0 -10 0 100

Match: 2 Mismatch: -1 Gap: -1

CS 5263 Bioinformatics