210 likes | 302 Views
New Approaches for Inferring the Tree of Life. Tandy Warnow Associate Professor Department of Computer Sciences Graduate Program in Ecology, Evolution, and Behavior Co-Director The Center for Computational Biology and Bioinformatics The University of Texas at Austin. Packard Proposal 1996.
E N D
New Approaches for Inferring the Tree of Life Tandy Warnow Associate ProfessorDepartment of Computer Sciences Graduate Program in Ecology, Evolution, and Behavior Co-DirectorThe Center for Computational Biology and Bioinformatics The University of Texas at Austin
Packard Proposal 1996 I observed that DNA and RNA sequences are low in phylogenetic signal, as currently analyzed, and I proposed to seek out and model new sources of significant phylogenetic signal, and then develop efficient algorithms to extract that signal, so that the inference of evolutionary history could be made with greater accuracy.
What I did instead • Developed methods for use with biomolecular sequences that recover the true tree with high probability from polynomial length sequences. • (Last two years): Developed methods for reconstructing phylogenies from gene order and content within whole genomes. • (Last year): Started looking at inferring non-tree models of evolution.
-3 mil yrs AAGACTT -2 mil yrs AAGGCCT AAGGCCT TGGACTT TGGACTT -1 mil yrs AGGGCAT AGGGCAT TAGCCCT TAGCCCT AGCACTT AGCACTT today AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT DNA Sequence Evolution AAGACTT AAGGCCT AAGGCCT TGGACTT TGGACTT AGGGCAT TAGCCCT AGCACTT AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT
Major Phylogenetic Reconstruction Methods • Polynomial-time distance-based methods (neighbor joining, perhaps the most popular) • NP-hard sequence-based methods • Maximum Parsimony • Maximum Likelihood • that can take years on real datasets • Heated debates over the relative performance of these methods
Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP
DCM-Boosting [Warnow et al. 2001] • DCM+SQS is a two-phase procedure which reduces the sequence length requirement of methods. Exponentially converging method Absolute fast converging method DCM SQS • We modify the second phase to improve the empirical performance, replacing SQS with ML (maximum likelihood) or MP (maximum parsimony).
DCMNJ+ML vs. other methods on a fixed model tree • 500-taxon rbcL tree • K2P+ model (=2, =1) • Avg. branch length = 0.278 • Relative performance • is typical in our • studies
Comparison of methods on random trees as a function of number of taxa • K2P+ model (=2, =1) • Avg. branch length = 0.05 • Seq. length = 1000
Summary • These are the first polynomial time methods that improve upon NJ (with respect to topological accuracy) and are never worse than NJ. • The advantage obtained with DCMNJ+MP and DCMNJ+ML increases with number of taxa. • In practice these new methods are slower than NJ (minutes vs. seconds), but still much faster than MP and ML (which can take days). • Conjecture: DCMNJ+ML is AFC.
A C A D X E Y B E Z W C F B D F II. Whole-Genome Phylogeny
Genomes As Signed Permutations 1 –5 3 4 -2 -6or6 2 -4 –3 5 –1 etc.
Genomes Evolve by Rearrangements 1 2 3 4 5 6 7 8 9 10 Inversion: 1 2 3 –8 –7 –6 –5 -4 9 10 Transposition: 1 2 3 9 4 5 6 7 8 10 Inverted Transposition: 1 2 3 9 –8 –7 –6 –5 -4 10
Genome Rearrangement Has A Huge State Space • DNA sequences : 4 states per site • Signed circular genomes with n genes: states, 1 site • Circular genomes (1 site) • with 37 genes: states • with 120 genes: states
Our Approaches • Statistically-based genomic distance estimators so that NJ analyses are more accurate, recovering 90% of the edges even for datasets close to saturation. • Improved bounds for tree length. • GRAPPA: high performance implementation for the maximum parsimony problems for rearranged genomes, achieving up to 200,000-fold speedup.
Accuracy of Neighbor Joining Using Distance Estimators • 120 genes • Inversion-only evolution • (other models of • evolution show • the same relative • performance) • 10, 20, 40, 80, and 160 genomes
Trachelium Campanula Adenophora Symphandra Legousia Asyneuma Triodanus Wahlenbergia Merciera Codonopsis Cyananthus Platycodon Tobacco Consensus of 216 MP Trees for the Campanulaceae dataset Strict Consensus of 216 trees; 6 out of 10 internal edges recovered.
Future Work • New focus on Rare Genomic Changes • New data • New models • New methods • New techniques for large-scale analyses • Divide-and-conquer methods • Non-tree models • Visualization of large trees and large sets of trees
Acknowledgements • Funding: The David and Lucile Packard Foundation, The National Science Foundation, and Paul Angello • Collaborators: Robert Jansen (U. Texas) Bernard Moret, David Bader, Mi-Yan (U. New Mexico) Daniel Huson (Celera) Katherine St. John (CUNY) Linda Raubeson (Central Washington U.) Luay Nakhleh, Usman Roshan, Jerry Sun, Li-San Wang, Stacia Wyman (Phylolab, U. Texas)
Phylolab, U. Texas Please visit us at http://www.cs.utexas.edu/users/phylo/