CS 5263 Bioinformatics

CS 5263 Bioinformatics Lecture 4: Local Sequence Alignment, More Efficient Sequence Alignment Algorithms

Roadmap • Review of last lecture • Local sequence alignment • More efficient sequence alignment algorithms

Given a scoring scheme, • Match: m • Mismatch: -s • Gap: -d • We can easily compute an optimal alignment by dynamic programming

Look at any column of an alignment between two sequences X = x1x2…xM, Y = y1y1…yN • Only three cases: • xi is aligned to yj • xi is aligned to a gap • yj is aligned to a gap F(i-1, j-1) +  (xi, yj) F(i, j) = max F(i-1, j) - d F(i, j-1) - d

Example F(i,j) j = 0 1 2 3 4 i = 0 A A A A G - G - T T T T A A A A 1 2 3

Equivalent graph problem S1 = G A T A (0,0) : a gap in the 2nd sequence : a gap in the 1st sequence : match / mismatch 1 1 S2 = A 1 T Value on vertical/horizontal line: -d Value on diagonal: m or -s 1 1 A (3,4) • Number of steps: length of the alignment • Path length: alignment score • Optimal alignment: find the longest path from (0, 0) to (3, 4) • General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.

Variants of Needleman-Wunsch alg • LCS: longest common subsequence • No penalty for gaps or mutations • Change: score function • Overlapping variants • No penalty for starting/ending gaps • Change: initial / termination step • Other variants • cDNA-genome alignment

Local alignment

The local alignment problem Given two strings X = x1……xM, Y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de x y

Why local alignment • Conserved regions may be a small part of the whole • Global alignment might miss them if flanking “junk” outweighs similar regions • Genes are shuffled between genomes C D B A D B A C

Naïve algorithm for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair • Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.

Reminder • The overlap detection algorithm • We do not give penalty to gaps in the ends Free gap Free gap

The local alignment idea • Do not penalize the unaligned regions (gaps or mismatches) • The alignment can start anywhere and ends anywhere • Strategy: whenever we get to some low similarity region (negative score), we restart a new alignment • By resetting alignment score to zero

The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) Iteration: F(i, j) = max

The Smith-Waterman algorithm Termination: • If we want the best local alignment… FOPT = maxi,j F(i, j) • If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

Match: 2 Mismatch: -1 Gap: -1

Trace back Match: 2 Mismatch: -1 Gap: -1

Trace back Match: 2 Mismatch: -1 Gap: -1 cxde | || c-de x-de | || xcde

No negative values in local alignment DP array • Optimal local alignment will never have a gap on either end • Local alignment: “Smith-Waterman” • Global alignment: “Needleman-Wunsch”

Analysis • Time: • O(MN) for finding the best alignment • Time to report all alignments depends on the number of sub-opt alignments • Memory: • O(MN) • O(M+N) possible

More efficient alignment algorithms

Given two sequences of length M, N • Time: O(MN) • ok • Space: O(MN) • bad • 1Mb seq x 1Mb seq = 1000G memory • Can we do better?

Bounded alignment Good alignment should appear near the diagonal

Bounded Dynamic Programming If we know that x and y are very similar Assumption: # gaps(x, y) < k xi Then, | implies | i – j | < k yj

Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ (xi, yj) F(i, j) = max F(i, j – 1) – d, if j > i – k F(i – 1, j) – d, if j < i + k Termination: same x1 ………………………… xM yN ………………………… y1 k

Analysis • Time: O(kM) << O(MN) • Space: O(kM) with some tricks => M M 2k 2k

Given two sequences of length M, N • Time: O(MN) • ok • Space: O(MN) • bad • 1mb seq x 1mb seq = 1000G memory • Can we do better?

Linear space algorithm • If all we need is the alignment score but not the alignment, easy! We only need to keep two rows (You only need one row, with a little trick) But how do we get the alignment?

Linear space algorithm • When we finish, we know how we have aligned the ends of the sequences XM YN Naïve idea: Repeat on the smaller subproblem F(M-1, N-1) Time complexity: O((M+N)(MN))

(0, 0) M/2 (M, N) Key observation: optimal alignment (longest path) must use an intermediate point on the M/2-th row. Call it (M/2, k), where k is unknown.

(0,0) • Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(6,6,3,k)) (3,2) (3,0) (3,4) (3,6) (6,6)

Hirschberg’s idea • Divide and conquer! Y X Forward algorithm Align x1x2…xM/2 with Y M/2 F(M/2, k) represents the best alignment between x1x2…xM/2 and y1y2…yk

Backward Algorithm Y X Backward algorithm Align reverse(xM/2+1…xM) with reverse(Y) M/2 B(M/2, k) represents the best alignment between reverse(xM/2+1…xM) and reverse(ykyk+1…yN )

Linear-space alignment Using 2 (4) rows of space, we can compute for k = 1…N, F(M/2, k), B(M/2, k) M/2

Linear-space alignment Now, we can find k* maximizing F(M/2, k) + B(M/2, k) Also, we can trace the path exiting column M/2 from k* Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2

Linear-space alignment • Iterate this procedure to the two sub-problems! M/2 k* M/2 N-k*

Analysis • Memory: O(N) for computation, O(N+M) to store the optimal alignment • Time: • MN for first iteration • k M/2 + (N-k) M/2 = MN/2 for second • … k M/2 M/2 N-k

MN MN/2 MN/4 • MN + MN/2 + MN/4 + MN/8 + … • = MN (1 + ½ + ¼ + 1/8 + 1/16 + …) • = 2MN = O(MN) MN/8

CS 5263 Bioinformatics