1 / 61

Multiple alignment by aligning alignments

Multiple alignment by aligning alignments. Travis Wheeler John Kececioglu Department of Computer Science University of Arizona. Extended from talk given at ISMB ‘07. Multiple sequence alignment. Sequence alignment central to computational biology: Functional conservation

tulia
Download Presentation

Multiple alignment by aligning alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple alignment by aligning alignments Travis Wheeler John Kececioglu Department of Computer Science University of Arizona Extended from talk given at ISMB ‘07

  2. Multiple sequence alignment Sequence alignment central to computational biology: • Functional conservation • Phylogenetic analysis • Signals of selection • Prediction of structure • Comparative genomics • and many others …

  3. Aligning two sequences A two-sequencealignment, with affine gap-costs, is scored, ) +l( ) )+ ( ( number of gaps substitution score total gap length columns g + 11 l ... ... g + 7 l g + 3 l

  4. Scoring a multiple alignment Sum-of-pairs:  wi,j score( i, j’ ) score( i, j” ) score( i, j ) i,j ij i j’ i j” ... Optimal alignment of multiple sequences is NP-complete Carrillo, Lipman 1988

  5. Form-and-polish strategy • Choosing parameters • Constructing the merge tree • Grouping sequences • Measuring distances • Weighting sequence pairs • Merging alignments • Polishing the alignment

  6. Form-and-polish strategy • Choosing parameters • Constructing the merge tree • Grouping sequences • Measuring distances • Weighting sequence pairs • Merging alignments • Polishing the alignment

  7. Constructing merge tree gatac gacagg aacctacg aagccgtt aattt aatccgtt catccgtt Feng, Doolittle 1987

  8. Merging alignments gatac gacagg aacctacg aagccgtt aattt aatccgtt catccgtt Feng, Doolittle 1987

  9. gatac- gacagg aagccgtt aa---ttt aatccgtt catccgtt Merging alignments gatac gacagg aacctacg aagccgtt aattt aatccgtt catccgtt Feng, Doolittle 1987

  10. Merging alignments

  11. aagccgtt aagccgtt aagccgtt aagccgtt aa---ttt aa---ttt aa---ttt aa---ttt aatccgtt aatccgtt catccgtt catccgtt aatccgtt catccgtt Merging alignments

  12. A’ B’ Polishing the alignment • Split alignment into groups • Realign groups • Repeat A C B Berger, Munson 1991

  13. Summary of main stages • Construct tree • Merge alignments • Polish

  14. ? Alignment Quality Computed alignment Correct alignment # substitutions recovered SPS : # substitutions in correct alignment # columns correctly recovered TC : # columns in correct alignment

  15. Benchmark datasets • Benchmark suites • BAliBase [Thompson et al. 1999; Bahr et al. 2001] • PALI [Balaji et al 2001] • SABmark [Van Walle et al 2004] • All based on structural alignment of proteins • Characteristics • 900 alignments • 10 sequences per alignment, on average • 400 columns per alignment, on average • Core blocks • Regions with high support

  16. Form-and-polish review • Choosing parameters • Constructing the merge tree • Grouping sequences • Measuring distances • Weighting sequence pairs • Merging alignments • Polishing the alignment

  17. Grouping methods • Neighbor joining (NJ) [Saitou, Nei 1987] • Unweighted-pair group method with arithmetic mean (UPGMA) [Sneath, Sokal 1973] • Minimum spanning tree (MST) • Dynamic alignment distance (DAD)

  18. Grouping sequences gacagg gatac aacctacg aatccgtt aagccgtt catccgtt aattt • Methods differ in measuring distances for new groups

  19. Comparing grouping methods • Best grouping method ≠ best phylogeny method

  20. Form-and-polish review • Choosing parameters • Constructing the merge tree • Grouping sequences • Measuring distances • Weighting sequence pairs • Merging alignments • Polishing the alignment

  21. AHDHHSSQ AHDHHSSQ AHDHHSSQ AHDHHSSQ ANEH--TR ANEH--TR ANEH--TR ANEH--TR Measuring distances Percent identity Compressed identity Normalized alignment cost

  22. Comparing distance methods • Normalized cost is very simple, and gives greatest gains

  23. Form-and-polish review • Choosing parameters • Constructing the merge tree • Grouping sequences • Measuring distances • Weighting sequence pairs • Merging alignments • Polishing the alignment

  24. ) +l( )+ ( ( ) gap length substitution score  columns ) ( wi,j i,j Aligning alignments gap count gap count

  25. Merging methods Exact gap counts [Gotoh 1993 ; Kececioglu, Starrett 2004] • k sequences, n columns • O(5kn2) worst case • O(k2n2) time in practice Pessimistic gap counts [Altschul 1989; Kececioglu, Zhang 1998] • Overestimates gap startups • O(kn + n2) worst case • 100-fold speedup with 20 sequences per alignment

  26. Comparing merging methods • Pessimistic heuristic may be sufficient for large inputs

  27. Form-and-polish review • Choosing parameters • Constructing the merge tree • Grouping sequences • Measuring distances • Weighting sequence pairs • Merging alignments • Polishing the alignment

  28. Polishing methods Two-cut method • Random partition [Probcons Do et al. 2005] • Tree-based partition

  29. Polishing methods Two-cut method • Random partition [Probcons Do et al. 2005] • Tree-based partition • Randomly cut edges [MAFFT Katoh et al. 2005] • Exhaustively cut edges [Muscle Edgar 2004] • Can’t fix two small misaligned groups

  30. Polishing methods Two-cut method • Random partition [Probcons Do et al. 2005] • Tree-based partition • Randomly cut edges [MAFFT Katoh et al. 2005] • Exhaustively cut edges [Muscle Edgar 2004] • Can’t fix two small misaligned groups

  31. Polishing methods Two-cut method • Random partition [Probcons Do et al. 2005] • Tree-based partition • Randomly cut edges [MAFFT Katoh et al. 2005] • Exhaustively cut edges [Muscle Edgar 2004] • Can’t fix two small misaligned groups

  32. Polishing methods Two-cut method • Random partition [Probcons Do et al. 2005] • Tree-based partition • Randomly cut edges [MAFFT Katoh et al. 2005] • Exhaustively cut edges [Muscle Edgar 2004] • Can’t fix two small misaligned groups Three-cut method • Tree-based, random

  33. Polishing methods Two-cut method • Random partition [Probcons Do et al. 2005] • Tree-based partition • Randomly cut edges [MAFFT Katoh et al. 2005] • Exhaustively cut edges [Muscle Edgar 2004] • Can’t fix two small misaligned groups Three-cut method • Tree-based, random On-the-fly method [Subbiah, Harrison 1989]

  34. Comparing polishing methods • 3-cut achieves 2-cut quality in less time • On-the-fly speeds up 2-cut convergence

  35. Form-and-polish review • Choosing parameters • Constructing the merge tree • Grouping sequences • Measuring distances • Weighting sequence pairs • Merging alignments • Polishing the alignment

  36. Weighting sequence pairs A B C 20 seqs 2 seqs 50 seqs

  37. Weighting sequence pairs A 40 A-B pair alignments (20) 1000 A-C pair alignments B (2) 100 B-C pair alignments C (50)

  38. Weighting methods Covariance weights [Altschul, et al. 1989] • Based on correlation between paths • Approximated in practice [Gotoh 1995] • Used in MAFFT Division weights [Thompson, et al. 1994] • Edge lengths divided among leaves • Used in ClustalW, Muscle

  39. Weighting anomalies Covariance weights i if lvx 2 lvy = lvz then wyz 2wxz ≈ even for very long lvz wyz wxz (expect: ) ≈ v lvy lvx y z x

  40. Weighting anomalies Division weights i if lvx lvy << lvz = lvz then wxz wyz 2wxy = ≈ wxy wxz wyz (expect: = ) ≈ 2lvz v lvx lvy x y z

  41. Weighting methods Covariance weights [Altschul, et al. 1989] • Based on correlation between paths • Approximated in practice [Gotoh 1995] • Used in MAFFT Division weights [Thompson, et al. 1994] • Edge lengths divided among leaves • Used in ClustalW, Muscle Influence weights • Based on the influence of leaf j on i, i,j

  42. Influence weights B A C j i

  43. Influence weights i A’ A” B C j

  44. Influence weights i x y z

  45. Sx(y) : l(x,y) + ( l(e) ) e T(y) Sx(y) Nx(y) = Hx(y) Hx(y) Influence weights T(y) : tree under y (subtree) i (size) L(y) : set of leaves under y (leaf set) Hx(y) : avg path length from x to L(y) (height) x (effective # sequences) y z

  46. Sx(y) : l(x,y) + ( l(e) ) e T(y) Sx(y) y Nx(y) = Hx(y) y Influence weights T(y) : tree under y (subtree) i (size) L(y) : set of leaves under y (leaf set) Hx(y) : avg path length from x to L(y) (height) x (effective # sequences) z Nx(y) ≈ 1 Nx(y) ≈ k k seqs

  47. Sx(y) : l(x,y) + ( l(e) ) e T(y) Sx(y) Nx(y) = Hx(y) wy Nx(y) Hi(z) = wz Nx(z) Hi(y) Influence weights T(y) : tree under y (subtree) i (size) L(y) : set of leaves under y (leaf set) Hx(y) : avg path length from x to L(y) (height) wx x (effective # sequences) y z Split wx = wy + wz according to the ratio:

  48. Sx(y) : l(x,y) + ( l(e) ) e T(y) 7 1 Sx(y) wx wx y Nx(y) = 8 8 Hx(y) wy Nx(y) Hi(z) = z wz Nx(z) Hi(y) Influence weights T(y) : tree under y (subtree) i (size) L(y) : set of leaves under y (leaf set) Hx(y) : avg path length from x to L(y) (height) wx x Hi(y) ≈ Hi(z) (effective # sequences) Split wx = wy + wz according to the ratio: Nx(z) ≈ 1 Nx(y) ≈ 7

  49. Sx(y) : l(x,y) + ( l(e) ) e T(y) 2 1 Sx(y) wx wx Nx(y) = 3 3 Hx(y) wy Nx(y) Hi(z) = wz Nx(z) Hi(y) Nx(y) ≈ Nx(z) Influence weights T(y) : tree under y (subtree) i (size) wx L(y) : set of leaves under y (leaf set) x Hx(y) : avg path length from x to L(y) (height) y z (effective # sequences) Hi(y) = 5 Hi(z) = 10 Split wx = wy + wz according to the ratio:

  50. Define wij =  w(i,j) w(j,i)  wi,j score( i, j ) i,j Influence weights • Influence w(i,j) is the weight wj SP score =

More Related