1 / 51

CS 5263 Bioinformatics

CS 5263 Bioinformatics. Lecture 8: Multiple Sequence Alignment. Roadmap. Homework? Review of last lecture Multiple sequence alignment. Homework. #1: dsDNA => mRNA => protein. Coding strand. Template strand. The genetic code. mRNA. Template strand. mRNA. protein. Problem #2.

hao
Download Presentation

CS 5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 5263 Bioinformatics Lecture 8: Multiple Sequence Alignment

  2. Roadmap • Homework? • Review of last lecture • Multiple sequence alignment

  3. Homework • #1: dsDNA => mRNA => protein Coding strand Template strand The genetic code mRNA Template strand mRNA protein

  4. Problem #2 • For two strings of lengths m and n, the number of alignment is equal to the number of paths from (0, 0) to (m, n) • How many ways we can get to (i, j) depend on how many ways we can get to its preceding neighbors

  5. Problem #3 • Similar to problem #2 • But there are some limitations on certain paths • (i-1, j-1)→(i-1, j)→(i, j) is illegal • So is (i-1, j-1)→(i, j-1)→(i, j) • How many ways to get to (i-1, j) without using (i-1, j-1)→(i-1, j)? • How many ways to get to (i, j-1) without using (i-1, j-1)→(i, j-1)?

  6. Problem #4 • Implementation is easy • Histogram: how you bin it may affect your results • bin for each discrete value you observed in your scores • Scores related to base frequency? • Scores differ between global and local alignments? • Score distribution?

  7. BLAST …… Main idea: Construct a dictionary of all the words in the query Alignment initiated between words of alignment score  T Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold query …… scan DB query

  8. BLAST A C G A A G T A A G G T C C A G T Example: k = 4, T = 4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC C C C T T C C T G G A T T G C G A

  9. Gapped BLAST A C G A A G T A A G G T C C A G T Added features: • Pairs of words can initiate alignment • Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A

  10. Advantages • Fast!!!! • A few minute to search a database of 1011 bases • Disadvantages • Sensitivity may be low • Often misses weak homologies

  11. New improvement • Make it even faster • But even less sensitive • Mainly for aligning very similar sequences or really long sequences • E.g. whole genome vs whole genome • Make it more sensitive • PSI-BLAST: iteratively add more homologous sequences • PatternHunter: discontinuous seeds

  12. Things we’ve covered so far • Global alignment • Needleman-Wunsch and variants • Improvement on space and time • Local Alignment • Smith-Waterman • Heuristic algorithms • BLAST families • Statistics for sequence alignment • Extreme value distribution

  13. Commonality • They all deal with aligning two sequences • Pair-wise sequence alignment

  14. Today • Aligning multiple sequences all together • Multiple sequence alignment

  15. Motivation • A faint similarity between two sequences becomes very significant if present in many • Protein domains • Motifs responsible for gene regulation

  16. Definition • Given N sequences x1, x2,…, xN: • Insert gaps (-) in each sequence xi, such that • All sequences have the same length L • Score of the global map is maximum • Pairwise alignment: a hypothesis on the evolutionary relationship between the letters of two sequences • Same for a multiple alignment!

  17. Scoring Function • Ideally: • Find alignment that maximizes probability that sequences evolved from common ancestor x y z ? Phylogenetic tree or evolution tree w v

  18. Scoring Function (cont’d) • Unfortunately: too many parameters • Compromises: • Ignore phylogenetic tree • Compute from pair-wise scores • Based on sum of all pair-wise scores • Based on scores with a consensus sequence

  19. First assumption • Columns are independent • Similar in pair-wise alignment • Therefore, the score of an alignment is the sum of all columns • Need to decide how to score a single column

  20. Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

  21. Sum Of Pairs (cont’d) • The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) = k<l s(mk, ml) s(mk, ml): score of induced alignment (k,l)

  22. Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG (A,A) + (A,G) x 2 = -1 (C,C) x 3 = 3 (-,A) x 2 + (A,A) = -1 Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5

  23. Sum Of Pairs (cont’d) • Drawback: no evolutionary characterization • Every sequence derived from all others • Heuristic way to incorporate evolution tree • Weighted Sum of Pairs: • S(m) = k<l wkl s(mk, ml) • wkl: weight decreasing with distance Human Mouse Duck Chicken

  24. Consensus score -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Consensus sequence: • Find optimal consensus string m* to maximize S(m) = i s(m*, mi) s(mk, ml): score of pairwise alignment (k,l)

  25. Multiple Sequence Alignments Algorithms

  26. Multidimensional Dynamic Programming (MDP) Generalization of Needleman-Wunsh: • Find the longest path in a high-dimensional cube • As opposed to a two-dimensional grid • Uses a N-dimensional matrix • As apposed to a two-dimensional array • Entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik] F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))

  27. Multidimensional Dynamic Programming (MDP) • Example: in 3D (three sequences): • 23 – 1 = 7 neighbors/cell (i-1,j-1,k-1) (i-1,j,k-1) (i-1,j-1,k) (i-1,j,k) F(i-1,j-1,k-1) + S(xi, xj, xk), F(i-1,j-1,k ) + S(xi, xj, -), F(i-1,j ,k-1) + S(xi, -, xk), F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk), F(i-1,j ,k ) + S(xi, -, -), F(i ,j-1,k ) + S(-, xj, -), F(i ,j ,k-1) + S(-, -, xk) (i,j,k-1) (i,j-1,k-1) (i,j-1,k) (i,j,k)

  28. Multidimensional Dynamic Programming (MDP) Running Time: • Size of matrix: LN; Where L = length of each sequence N = number of sequences • Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

  29. Faster MDP • Carrillo & Lipman, 1988 • Branch and bound • Other heuristics • Practical for about 6 sequences of length about 200-300.

  30. Progressive Alignment • Multiple Alignment is NP-hard • Most used heuristic: Progressive Alignment Algorithm: • Align two of the sequences xi, xj • Fix that alignment • Align a third sequence xk to the alignment xi,xj • Repeat until all sequences are aligned Running Time: O(NL2) Each alignment takes O(L2) Repeat N times

  31. Progressive Alignment x • When evolutionary tree is known: • Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) y z w

  32. Progressive Alignment: CLUSTALW CLUSTALW: most popular multiple protein alignment Algorithm: • Find all dij: alignment dist (xi, xj) • High alignment score => short distance • Construct a tree (Neighbor-joining hierarchical clustering. Will discuss in future) • Align nodes in order of decreasing similarity + a large number of heuristics

  33. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD

  34. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD Distance matrix

  35. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD s1 s3 s2 s4

  36. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK s1 s3 s2 s4

  37. CLUSTALW example • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK -TNSD NT-SD s1 s3 s2 s4

  38. CLUSTALW example Question: how do you align two alignments? • S1ALSK • S2TNSD • S3NASK • S4NTSD -ALSK NA-SK -ALSK -TNSD NA-SK NT-SD -TNSD NT-SD s1 s3 s2 s4

  39. Aligning two alignments • You can treat each column in an alignment as a single letter • Remember in the case of gene finder, we aligned three nucleic acids at a time • How do we score it? • Naïve: compute Sum of Pair • Better: only compute the cross terms • We already have (K, K) and (D, D) • Need to add 2x(K, D) -ALSK NA-SK -TNSD NT-SD

  40. CLUSTALW & the CINEMA viewer

  41. Iterative Refinement Problems with progressive alignment: • Depend on pair-wise alignments • If sequences are very distantly related, much higher likelihood of errors • Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear: correct y should be GA-CTT

  42. Iterative Refinement Algorithm (Barton-Stenberg): • Align most similar xi, xj • Align xk most similar to (xixj) • Repeat 2 until (x1…xN) are aligned • For j = 1 to N, Remove xj, and realign to x1…xj-1xj+1…xN • Repeat 4 until convergence Note: Guaranteed to converge Running time: O(kNL2), k: number of iterations

  43. allow y to vary x,z fixed projection Iterative Refinement (cont’d) For each sequence y • Remove y • Realign y (while rest fixed) z x y

  44. Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x: GAAGTTA y: GAC-TTA z: GAACTGA w: GTACTGA After realigning y: x: GAAGTTA y: G-ACTTA + 3 matches z: GAACTGA w: GTACTGA

  45. Iterative Refinement • Example not handled well: x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA Realigning any single yi changes nothing

  46. Restricted MDP • Similar to bounded DP in pair-wise alignment • Construct progressive multiple alignment m • Run MDP, restricted to radius R from m z x y Running Time: O(2N RN-1 L)

  47. Restricted MDP x: GAAGTTA y1: GAC-TTA y2: GAC-TTA y3: GAC-TTA z: GAACTGA w: GTACTGA • Within radius 1 of the optimal  Restricted MDP will fix it.

  48. Other approaches • Profile Hidden Markov Models • Statistical learning methods • Will discuss in future

  49. Multiple alignment tools • Clustal W (Thompson, 1994) • Most popular • PRRP (Gotoh, 1993) • HMMT (Eddy, 1995) • DIALIGN (Morgenstern, 1998) • T-Coffee (Notredame, 2000) • MUSCLE (Edgar, 2004) • Align-m (Walle, 2004) • PROBCONS (Do, 2004)

More Related