Pair-wise Sequence Alignment in Bioinformatics: Methods and Tools for Analysis

CS 4233& 5263 Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment

Outline • What’s the biological problem • Algorithm issues • Intro to dynamic programming • Global sequence alignment • Local sequence alignment • More efficient algorithms • Interpretation issues • Model gaps more accurately • Alignment statistics • Practical tools: BLAST

Evolution at the DNA level C …ACGGTGCAGTCACCA… …ACGTTGC-GTCCACCA… DNA evolutionary events (sequence edits): Mutation, deletion, insertion

Sequence conservation implies function next generation OK OK OK X X Still OK?

Why sequence alignment? • Conserved regions are more likely to be functional • Can be used for finding genes, regulatory elements, etc. • Similar sequences often have similar origin and function • Can be used to predict functions for new genes / proteins • Sequence alignment is one of the most widely used computational tools in biology

Global Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC S T -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC S’ T’ • Definition • An alignment of two strings S, T is a pair of strings S’, T’ (with spaces) s.t. • |S’| = |T’|, and (|S| = “length of S”) • removing all spaces in S’, T’ leaves S, T

What is a good alignment? Alignment: The “best” way to match the letters of one sequence with those of the other How do we define “best”?

S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • The scoreof aligning (characters or spaces) x & y is σ (x,y). • Scoreof an alignment: • An optimal alignment: one with max score

Scoring Function • Sequence edits: AGGCCTC • Mutations AGGACTC • Insertions AGGGCCTC • Deletions AGG-CTC Scoring Function: Match: +m ~~~AAC~~~ Mismatch: -s ~~~A-A~~~ Gap (indel): -d

Match = 2, mismatch = -1, gap = -1 • Score = 3 x 2 – 2 x 1 – 1 x 1 = 3

More complex scoring function • Substitution matrix • Similarity score of matching two letters a, b should reflect the probability of a, b derived from the same ancestor • It is usually defined by log likelihood ratio • Active research area. Especially for proteins. • Commonly used: PAM, BLOSUM

An example substitution matrix

How to find an optimal alignment? • A naïve algorithm: for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value retain the max end output the retained alignment S = abcd A = cd T = wxyz B = xz -abc-d a-bc-d w--xyz -w-xyz

Analysis • Assume |S| = |T| = n • Cost of evaluating one alignment: ≥n • How many alignments are there: • pick n chars of S,T together • say k of them are in S • match these k to the k unpicked chars of T • Total time: • E.g., for n = 20, time is > 240 >1012 operations

Intro to Dynamic Programming

Dynamic programming • What is dynamic programming? • A method for solving problems exhibiting the properties of overlapping subproblems and optimal substructure • Key idea: tabulating sub-problem solutions rather than re-computing them repeatedly • Two simple examples: • Computing Fibonacci numbers • Find the special shortest path in a grid

Example 1: Fibonacci numbers • 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, … F(0) = 1; F(1) = 1; F(n) = F(n-1) + F(n-2) • How to compute F(n)?

F(9) F(8) F(7) F(6) F(5) F(7) F(6) F(4) F(6) F(5) F(5) F(5) F(4) F(4) F(3) A recursive algorithm function fib(n) if (n == 0 or n == 1) return 1; else return fib(n-1) + fib(n-2);

n/2 • Time complexity: • Between 2n/2 and 2n • O(1.62n), i.e. exponential • Why recursive Fib algorithm is inefficient? • Overlapping subproblems n

An iterative algorithm function fib(n) F[0] = 1; F[1] = 1; for i = 2 to n F[i] = F[i-1] + F[i-2]; Return F[n]; Time complexity: Time: O(n), space: O(n)

Example 2: shortest path in a grid S m G n Each edge has a length (cost). We need to get to G from S. Can only move right or down. Aim: find a path with the minimum total length

Optimal substructures • Naïve algorithm: enumerate all possible paths and compare costs • Exponential number of paths • Key observation: • If a path P(S, G) is the shortest from S to G, any of its sub-path P(S,x), where x is on P(S,G), is the shortest from S to x

Proof • Proof by contradiction • If the path between P(S,x) is not the shortest, i.e., P’(S,x) < P(S,x) • Construct a new path P’(S,G) = P’(S,x) + P(x, G) • P’(S,G) < P(S,G) => P(S,G) is not the shortest • Contradiction • Therefore, P(S, x) is the shortest S x G

Recursive solution (0,0) • Index each intersection by two indices, (i, j) • Let F(i, j) be the total length of the shortest path from (0, 0) to (i, j). Therefore, F(m, n) is the shortest path we wanted. • To compute F(m, n), we need to compute both F(m-1, n) and F(m, n-1) m (m, n) n F(m-1, n) + length((m-1, n), (m, n)) F(m, n) = min F(m, n-1) + length((m, n-1), (m, n))

Recursive solution F(i-1, j) + length((i-1, j), (i, j)) F(i, j) = min F(i, j-1) + length((i, j-1), (i, j)) • But: if we use recursive call, many subpaths will be recomputed for many times • Strategy: pre-compute F values starting from the upper-left corner. Fill in row by row (what other order will also do?) (0,0) (i-1, j) (i, j-1) (i, j) m (m, n) n

Dynamic programming illustration S 9 1 2 3 0 3 12 13 15 5 3 3 3 3 2 5 2 3 5 6 8 13 15 2 3 3 9 3 4 2 3 2 7 9 11 13 16 6 2 3 7 4 6 3 3 3 13 11 14 17 20 4 6 3 1 3 2 3 2 1 17 17 17 18 20 G F(i-1, j) + length(i-1, j, i, j) F(i, j) = min F(i, j-1) + length(i, j-1, i, j)

Trackback 9 1 2 3 0 3 12 13 15 5 3 3 3 3 2 5 2 3 5 6 8 13 15 2 3 3 9 3 4 2 3 2 7 9 11 13 16 6 2 3 7 4 6 3 3 3 13 11 14 17 20 4 6 3 1 3 2 3 2 1 17 17 17 18 20

Elements of dynamic programming • Optimal sub-structures • Optimal solutions to the original problem contains optimal solutions to sub-problems • Overlapping sub-problems • Some sub-problems appear in many solutions • Memorization and reuse • Carefully choose the order that sub-problems are solved

Dynamic Programming for sequence alignment Suppose we wish to align x1……xM y1……yN Let F(i,j) = optimal score of aligning x1……xi y1……yj Scoring Function: Match: +m Mismatch: -s Gap (indel): -d

Elements of dynamic programming • Optimal sub-structures • Optimal solutions to the original problem contains optimal solutions to sub-problems • Overlapping sub-problems • Some sub-problems appear in many solutions • Memorization and reuse • Carefully choose the order that sub-problems are solved

... 1 2 i M x: j 1 2 N ... y: Optimal substructure • If x[i] is aligned to y[j] in the optimal alignment between x[1..M] and y[1..N], then • The alignment between x[1..i] and y[1..j] is also optimal • Easy to prove by contradiction

max Recursive solution Notice three possible cases: • xM aligns to yN ~~~~~~~ xM ~~~~~~~ yN 2. xM aligns to a gap ~~~~~~~ xM ~~~~~~~  • yN aligns to a gap ~~~~~~~  ~~~~~~~ yN m, if xM = yN F(M,N) = F(M-1, N-1) + -s, if not F(M,N) = F(M-1, N) - d F(M,N) = F(M, N-1) - d

Recursive solution • Generalize: F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d (Xi,Yj) = m if Xi = Yj, and –s otherwise • Boundary conditions: • F(0, 0) = 0. • F(0, j) = ? • F(i, 0) = ? -jd: y[1..j] aligned to gaps. -id: x[1..i] aligned to gaps.

F(i-1, j-1) F(i-1, j) 1 1 2 F(i, j-1) F(i, j) 3 What order to fill? i j

What order to fill?

Example x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

Example x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 Optimal Alignment: F(4,3) = 2 j = 0 1 2 3

Example x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 j = 0 Optimal Alignment: F(4,3) = 2 This only tells us the best score 1 2 3

Trace-back x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

Trace-back x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 Optimal Alignment: F(4,3) = 2 AGTA ATA j = 0 1 2 3

Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

Pair-wise Sequence Alignment in Bioinformatics: Methods and Tools for Analysis