1 / 22

Guide Trees and Progressive Multiple Sequence Alignment

Guide Trees and Progressive Multiple Sequence Alignment. James A. foster And Luke Sheneman 1 October 2008 Initiative for Bioinformatics and Evolutionary Studies (IBEST). Multiple Sequence Alignment. Abstract representation of sequence homology

alyn
Download Presentation

Guide Trees and Progressive Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Guide Trees and Progressive Multiple Sequence Alignment James A. foster And Luke Sheneman 1 October 2008 Initiative for Bioinformatics and Evolutionary Studies (IBEST)

  2. Multiple Sequence Alignment Abstract representation of sequence homology Homologous molecular characters (nucleotides/residues) organized in columns Gaps (-) represent sequence indels

  3. Multiple Sequence Alignment • Many bioinformatics analyses depend on MSA. • First step in inferring phylogenetic trees • MSA technique is at least as important as inference method and model parameters (Morrison & Ellis, 1997) • Structural and functional sequence analyses

  4. Progressive Alignment Idea: align “closely related” sequences first, two at a time with “optimal” subalignments (dynamic programming) Problem: once a gap, always a gap Advantage: fast

  5. Guide Trees and Alignment Quality • How important is it to find “good” guide trees? • How much time should be spent looking for “better” guide trees?

  6. Hypothesis • Guide trees that are closer to the true phylogeny lead to better sequence alignments • Guide trees that are further from the true tree produce less accurate alignments. • The effect is measurable. • The correlation is significant.

  7. Previous Work • Folk wisdom, intuition: it matters, a lot • Basis for Clustal, and most other pMSA implementations • Nelesenet al. (PSB ’08): doesn’t matter, much • No strong correlation • No large effect • Edgar (2004): bad trees are sometimes better • UPGMA guide trees ultrametric but outperform NJ

  8. Experimental Design: strategy • For both natural data and simulation data, with reliable alignments and phylogenies: • Explore the space of possible guide trees, moving outward from the “true tree” • Use each tree as a guide tree, perform pMSA • Compare quality of resulting alignment with known optimal value

  9. Experimental Design: Naturally Evolved Case

  10. Experimental Design: Degrading Guide Trees • Random Nearest Neighbor Interchange (NNI) • Swaps two neighboring internal branches • Random Tree Bisect/Reconnect (TBR) • Randomly bisect tree • Randomly reconnect two trees Images: hyphy.org

  11. TreeBASE (“natural”) Input Datasets

  12. Experimental Design: Simulated Evolution Case

  13. Conclusions • Statistically significant correlation between guide tree quality and alignment quality • Independent of tree transformation operator • Independent of alignment distance metric • But very small absolute change in quality • Non-linear / logarithmic • Largest alignment quality effect 5-10 steps from phylogeny The lesson: it helps to improve a really good guide tree, otherwise it helps but only a little

  14. Acknowledgements • Dr. Luke Sheneman (mostly his slides!) • Faculty, staff, and students of BCB • Jason Evans • Darin Rokyta • Funding sources: • NIH P20 RR16454 • NIH NCRR 1P20 RR16448 • NSF EPS 00809035

  15. Experimental Design: metrics • Â =pmsa(S, T) • where S is the set of input sequences • where T is the guide tree • (hidden parameters: pairwise algorithm, tie breaking strategy) • AQ = CompareAlignments(A*, Â) • QSCORE (A*, Â) -> TC-error, SP-error • Nelesen had a nicer metric: error of estimated phylogeny • Tdist = TreeDistance(T*, T) • Upper bound estimate of edit distance via NNI or TBR

  16. Alternative Scoring metric Idea: “quality” of an alignment is distance from the phylogeny it produces to the “true” phylogeny • AQ = KTreeDist(ML_est(A*),ML_est( Â)) • ML_est(A): max likelihood estimate of the phylogeny behind MSA A (we used RAXML) • KTreeDist(T1,T2): scales T2 to T2, measures Branch Length Distance (Sorio-Kurasko et al. 07; Kuhner & Felsenstein 94) • Data sets: from L1 sequences in mammals, bats, humans, hand aligned A*

  17. All methods pretty are good

More Related