1 / 195

CS 4233& 5263 Bioinformatics

CS 4233& 5263 Bioinformatics. Lectures 3-6: Pair-wise Sequence Alignment. Outline. What’s the biological problem Algorithm issues Intro to dynamic programming Global sequence alignment Local sequence alignment More efficient algorithms Interpretation issues Model gaps more accurately

duaneh
Download Presentation

CS 4233& 5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 4233& 5263 Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment

  2. Outline • What’s the biological problem • Algorithm issues • Intro to dynamic programming • Global sequence alignment • Local sequence alignment • More efficient algorithms • Interpretation issues • Model gaps more accurately • Alignment statistics • Practical tools: BLAST

  3. Evolution at the DNA level C …ACGGTGCAGTCACCA… …ACGTTGC-GTCCACCA… DNA evolutionary events (sequence edits): Mutation, deletion, insertion

  4. Sequence conservation implies function next generation OK OK OK X X Still OK?

  5. Why sequence alignment? • Conserved regions are more likely to be functional • Can be used for finding genes, regulatory elements, etc. • Similar sequences often have similar origin and function • Can be used to predict functions for new genes / proteins • Sequence alignment is one of the most widely used computational tools in biology

  6. Global Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC S T -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC S’ T’ • Definition • An alignment of two strings S, T is a pair of strings S’, T’ (with spaces) s.t. • |S’| = |T’|, and (|S| = “length of S”) • removing all spaces in S’, T’ leaves S, T

  7. What is a good alignment? Alignment: The “best” way to match the letters of one sequence with those of the other How do we define “best”?

  8. S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC • The scoreof aligning (characters or spaces) x & y is σ (x,y). • Scoreof an alignment: • An optimal alignment: one with max score

  9. Scoring Function • Sequence edits: AGGCCTC • Mutations AGGACTC • Insertions AGGGCCTC • Deletions AGG-CTC Scoring Function: Match: +m ~~~AAC~~~ Mismatch: -s ~~~A-A~~~ Gap (indel): -d

  10. Match = 2, mismatch = -1, gap = -1 • Score = 3 x 2 – 2 x 1 – 1 x 1 = 3

  11. More complex scoring function • Substitution matrix • Similarity score of matching two letters a, b should reflect the probability of a, b derived from the same ancestor • It is usually defined by log likelihood ratio • Active research area. Especially for proteins. • Commonly used: PAM, BLOSUM

  12. An example substitution matrix

  13. How to find an optimal alignment? • A naïve algorithm: for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value retain the max end output the retained alignment S = abcd A = cd T = wxyz B = xz -abc-d a-bc-d w--xyz -w-xyz

  14. Analysis • Assume |S| = |T| = n • Cost of evaluating one alignment: ≥n • How many alignments are there: • pick n chars of S,T together • say k of them are in S • match these k to the k unpicked chars of T • Total time: • E.g., for n = 20, time is > 240 >1012 operations

  15. Intro to Dynamic Programming

  16. Dynamic programming • What is dynamic programming? • A method for solving problems exhibiting the properties of overlapping subproblems and optimal substructure • Key idea: tabulating sub-problem solutions rather than re-computing them repeatedly • Two simple examples: • Computing Fibonacci numbers • Find the special shortest path in a grid

  17. Example 1: Fibonacci numbers • 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, … F(0) = 1; F(1) = 1; F(n) = F(n-1) + F(n-2) • How to compute F(n)?

  18. F(9) F(8) F(7) F(6) F(5) F(7) F(6) F(4) F(6) F(5) F(5) F(5) F(4) F(4) F(3) A recursive algorithm function fib(n) if (n == 0 or n == 1) return 1; else return fib(n-1) + fib(n-2);

  19. n/2 • Time complexity: • Between 2n/2 and 2n • O(1.62n), i.e. exponential • Why recursive Fib algorithm is inefficient? • Overlapping subproblems n

  20. An iterative algorithm function fib(n) F[0] = 1; F[1] = 1; for i = 2 to n F[i] = F[i-1] + F[i-2]; Return F[n]; Time complexity: Time: O(n), space: O(n)

  21. Example 2: shortest path in a grid S m G n Each edge has a length (cost). We need to get to G from S. Can only move right or down. Aim: find a path with the minimum total length

  22. Optimal substructures • Naïve algorithm: enumerate all possible paths and compare costs • Exponential number of paths • Key observation: • If a path P(S, G) is the shortest from S to G, any of its sub-path P(S,x), where x is on P(S,G), is the shortest from S to x

  23. Proof • Proof by contradiction • If the path between P(S,x) is not the shortest, i.e., P’(S,x) < P(S,x) • Construct a new path P’(S,G) = P’(S,x) + P(x, G) • P’(S,G) < P(S,G) => P(S,G) is not the shortest • Contradiction • Therefore, P(S, x) is the shortest S x G

  24. Recursive solution (0,0) • Index each intersection by two indices, (i, j) • Let F(i, j) be the total length of the shortest path from (0, 0) to (i, j). Therefore, F(m, n) is the shortest path we wanted. • To compute F(m, n), we need to compute both F(m-1, n) and F(m, n-1) m (m, n) n F(m-1, n) + length((m-1, n), (m, n)) F(m, n) = min F(m, n-1) + length((m, n-1), (m, n))

  25. Recursive solution F(i-1, j) + length((i-1, j), (i, j)) F(i, j) = min F(i, j-1) + length((i, j-1), (i, j)) • But: if we use recursive call, many subpaths will be recomputed for many times • Strategy: pre-compute F values starting from the upper-left corner. Fill in row by row (what other order will also do?) (0,0) (i-1, j) (i, j-1) (i, j) m (m, n) n

  26. Dynamic programming illustration S 9 1 2 3 0 3 12 13 15 5 3 3 3 3 2 5 2 3 5 6 8 13 15 2 3 3 9 3 4 2 3 2 7 9 11 13 16 6 2 3 7 4 6 3 3 3 13 11 14 17 20 4 6 3 1 3 2 3 2 1 17 17 17 18 20 G F(i-1, j) + length(i-1, j, i, j) F(i, j) = min F(i, j-1) + length(i, j-1, i, j)

  27. Trackback 9 1 2 3 0 3 12 13 15 5 3 3 3 3 2 5 2 3 5 6 8 13 15 2 3 3 9 3 4 2 3 2 7 9 11 13 16 6 2 3 7 4 6 3 3 3 13 11 14 17 20 4 6 3 1 3 2 3 2 1 17 17 17 18 20

  28. Elements of dynamic programming • Optimal sub-structures • Optimal solutions to the original problem contains optimal solutions to sub-problems • Overlapping sub-problems • Some sub-problems appear in many solutions • Memorization and reuse • Carefully choose the order that sub-problems are solved

  29. Dynamic Programming for sequence alignment Suppose we wish to align x1……xM y1……yN Let F(i,j) = optimal score of aligning x1……xi y1……yj Scoring Function: Match: +m Mismatch: -s Gap (indel): -d

  30. Elements of dynamic programming • Optimal sub-structures • Optimal solutions to the original problem contains optimal solutions to sub-problems • Overlapping sub-problems • Some sub-problems appear in many solutions • Memorization and reuse • Carefully choose the order that sub-problems are solved

  31. ... 1 2 i M x: j 1 2 N ... y: Optimal substructure • If x[i] is aligned to y[j] in the optimal alignment between x[1..M] and y[1..N], then • The alignment between x[1..i] and y[1..j] is also optimal • Easy to prove by contradiction

  32. max Recursive solution Notice three possible cases: • xM aligns to yN ~~~~~~~ xM ~~~~~~~ yN 2. xM aligns to a gap ~~~~~~~ xM ~~~~~~~  • yN aligns to a gap ~~~~~~~  ~~~~~~~ yN m, if xM = yN F(M,N) = F(M-1, N-1) + -s, if not F(M,N) = F(M-1, N) - d F(M,N) = F(M, N-1) - d

  33. Recursive solution • Generalize: F(i-1, j-1) + (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d (Xi,Yj) = m if Xi = Yj, and –s otherwise • Boundary conditions: • F(0, 0) = 0. • F(0, j) = ? • F(i, 0) = ? -jd: y[1..j] aligned to gaps. -id: x[1..i] aligned to gaps.

  34. F(i-1, j-1) F(i-1, j) 1 1 2 F(i, j-1) F(i, j) 3 What order to fill? i j

  35. What order to fill?

  36. Example x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

  37. Example x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

  38. Example x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

  39. Example x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

  40. Example x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 Optimal Alignment: F(4,3) = 2 j = 0 1 2 3

  41. Example x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 j = 0 Optimal Alignment: F(4,3) = 2 This only tells us the best score 1 2 3

  42. Trace-back x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

  43. Trace-back x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

  44. Trace-back x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

  45. Trace-back x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

  46. Trace-back x = AGTA m = 1 y = ATA s = 1 d = 1 • F(i-1, j-1) + (Xi,Yj) • F(i,j) = max F(i-1, j) – d • F(i, j-1) – d F(i,j) i = 0 1 2 3 4 Optimal Alignment: F(4,3) = 2 AGTA ATA j = 0 1 2 3

  47. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

  48. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

  49. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

  50. Using trace-back pointers x = AGTA m = 1 y = ATA s = 1 d = 1 F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

More Related