Guide Trees and Progressive Multiple Sequence Alignment

Guide Trees and Progressive Multiple Sequence Alignment James A. foster And Luke Sheneman 1 October 2008 Initiative for Bioinformatics and Evolutionary Studies (IBEST)

Multiple Sequence Alignment Abstract representation of sequence homology Homologous molecular characters (nucleotides/residues) organized in columns Gaps (-) represent sequence indels

Multiple Sequence Alignment • Many bioinformatics analyses depend on MSA. • First step in inferring phylogenetic trees • MSA technique is at least as important as inference method and model parameters (Morrison & Ellis, 1997) • Structural and functional sequence analyses

Progressive Alignment Idea: align “closely related” sequences first, two at a time with “optimal” subalignments (dynamic programming) Problem: once a gap, always a gap Advantage: fast

Guide Trees and Alignment Quality • How important is it to find “good” guide trees? • How much time should be spent looking for “better” guide trees?

Hypothesis • Guide trees that are closer to the true phylogeny lead to better sequence alignments • Guide trees that are further from the true tree produce less accurate alignments. • The effect is measurable. • The correlation is significant.

Previous Work • Folk wisdom, intuition: it matters, a lot • Basis for Clustal, and most other pMSA implementations • Nelesenet al. (PSB ’08): doesn’t matter, much • No strong correlation • No large effect • Edgar (2004): bad trees are sometimes better • UPGMA guide trees ultrametric but outperform NJ

Experimental Design: strategy • For both natural data and simulation data, with reliable alignments and phylogenies: • Explore the space of possible guide trees, moving outward from the “true tree” • Use each tree as a guide tree, perform pMSA • Compare quality of resulting alignment with known optimal value

Experimental Design: Naturally Evolved Case

Experimental Design: Degrading Guide Trees • Random Nearest Neighbor Interchange (NNI) • Swaps two neighboring internal branches • Random Tree Bisect/Reconnect (TBR) • Randomly bisect tree • Randomly reconnect two trees Images: hyphy.org

TreeBASE (“natural”) Input Datasets

Experimental Design: Simulated Evolution Case

Conclusions • Statistically significant correlation between guide tree quality and alignment quality • Independent of tree transformation operator • Independent of alignment distance metric • But very small absolute change in quality • Non-linear / logarithmic • Largest alignment quality effect 5-10 steps from phylogeny The lesson: it helps to improve a really good guide tree, otherwise it helps but only a little

Acknowledgements • Dr. Luke Sheneman (mostly his slides!) • Faculty, staff, and students of BCB • Jason Evans • Darin Rokyta • Funding sources: • NIH P20 RR16454 • NIH NCRR 1P20 RR16448 • NSF EPS 00809035

Experimental Design: metrics • Â =pmsa(S, T) • where S is the set of input sequences • where T is the guide tree • (hidden parameters: pairwise algorithm, tie breaking strategy) • AQ = CompareAlignments(A*, Â) • QSCORE (A*, Â) -> TC-error, SP-error • Nelesen had a nicer metric: error of estimated phylogeny • Tdist = TreeDistance(T*, T) • Upper bound estimate of edit distance via NNI or TBR

Alternative Scoring metric Idea: “quality” of an alignment is distance from the phylogeny it produces to the “true” phylogeny • AQ = KTreeDist(ML_est(A*),ML_est( Â)) • ML_est(A): max likelihood estimate of the phylogeny behind MSA A (we used RAXML) • KTreeDist(T1,T2): scales T2 to T2, measures Branch Length Distance (Sorio-Kurasko et al. 07; Kuhner & Felsenstein 94) • Data sets: from L1 sequences in mammals, bats, humans, hand aligned A*

All methods pretty are good

Guide Trees and Progressive Multiple Sequence Alignment