1 / 66

Disk-Covering Methods for phylogeny reconstruction

Disk-Covering Methods for phylogeny reconstruction. Tandy Warnow The University of Texas at Austin. Phylogeny. From the Tree of the Life Website, University of Arizona. Orangutan. Human. Gorilla. Chimpanzee. Reconstructing the “Tree” of Life. Handling large datasets: millions of species

kramerb
Download Presentation

Disk-Covering Methods for phylogeny reconstruction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Disk-Covering Methods for phylogeny reconstruction Tandy Warnow The University of Texas at Austin

  2. Phylogeny From the Tree of the Life Website,University of Arizona Orangutan Human Gorilla Chimpanzee

  3. Reconstructing the “Tree” of Life Handling large datasets: millions of species NSF funds many projects towards this goal, under the Assembling the Tree of Life (ATOL) program

  4. -3 mil yrs AAGACTT AAGACTT -2 mil yrs AAGGCCT AAGGCCT AAGGCCT AAGGCCT TGGACTT TGGACTT TGGACTT TGGACTT -1 mil yrs AGGGCAT AGGGCAT AGGGCAT TAGCCCT TAGCCCT TAGCCCT AGCACTT AGCACTT AGCACTT today AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT DNA Sequence Evolution

  5. Phylogeny Problem U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

  6. Steps in a phylogenetic analysis • Gather data • Align sequences • Estimate phylogeny on the multiple alignment • Estimate the reliable aspects of the evolutionary history (using bootstrapping, consensus trees, or other methods) • Perform post-tree analyses.

  7. Local optimum Cost Global optimum Phylogenetic trees Phylogenetic reconstruction methods • Hill-climbing heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood) • Polynomial time distance-based methods: Neighbor Joining, FastME, Weighbor, etc. 3. Bayesian methods

  8. Performance criteria • Running time. • Space. • Statistical performance issues (e.g., statistical consistency) with respect to a Markov model of evolution. • “Topological accuracy” with respect to the underlying true tree. Typically studied in simulation. • Accuracy with respect to a particular criterion (e.g. tree length or likelihood score), on real data.

  9. Markov models of site evolution Simplest (Jukes-Cantor): • The model tree is a pair (T,{e,p(e)}), where T is a rooted binary tree, and p(e) is the probability of a substitution on the edge e • The state at the root is random • If a site changes on an edge, it changes with equal probability to each of the remaining states • The evolutionary process is Markovian More complex models (such as the General Markov model) are also considered, with little change to the theory. Variation between different sites is either prohibited or minimized, in order to ensure identifiability of the model.

  10. Distance-based Phylogenetic Methods

  11. Maximum Parsimony • Input: Set S of n aligned sequences of length k • Output: • A phylogenetic tree T leaf-labeled by sequences in S • additional sequences of length k labeling the internal nodes of T such that is minimized, where H(i,j) denotes the Hamming distance between sequences at nodes i and j

  12. Maximum Likelihood • Input: Set S of n aligned sequences of length k, and a specified parametric model • Output: • A phylogenetic tree T leaf-labeled by sequences in S • With additional model parameters (e.g. edge “lengths”) such that Pr[S|(T, params)] is maximized.

  13. Local optimum Cost Global optimum Phylogenetic trees Approaches for “solving” MP/ML • Hill-climbing heuristics (which can get stuck in local optima) • Randomized algorithms for getting out of local optima • Approximation algorithms for MP (based upon Steiner Tree approximation algorithms).

  14. Theoretical results • Neighbor Joining is polynomial time, and statistically consistent under typical models of evolution. • Maximum Parsimony is NP-hard, and even exact solutions are not statistically consistent under typical models. • Maximum Likelihood is NP-hard and statistically consistent under typical models.

  15. Theoretical convergence rates • Atteson: Let T be a General Markov model tree defining additive matrix D. Then Neighbor Joining will reconstruct the true tree with high probability from sequences that are of length at least O(lg n emax Dij). • Proof: Show NJ accurate on input matrix d such that max{|Dij-dij|}<f/2, for f equal to the minimum edge “length”.

  16. Problems with NJ • Theory: The convergence rate is exponential: the number of sites needed to obtain an accurate reconstruction of the tree with high probability grows exponentially in the evolutionary diameter. • Empirical: NJ has poor performance on datasets with some large leaf-to-leaf distances.

  17. Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

  18. Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. 0.8 NJ 0.6 Error Rate 0.4 0.2 0 0 400 800 1200 1600 No. Taxa

  19. Other standard polynomial time methods don’t improve substantially on NJ (and have the same problem with large diameter datasets). • What about trying to “solve” maximum parsimony or maximum likelihood?

  20. Solving NP-hard problems exactly is … unlikely • Number of (unrooted) binary trees on n leaves is (2n-5)!! • If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in 2890 millennia

  21. How good an MP analysis do we need? • Our research shows that we need to get within 0.01% of optimal (or better even, on large datasets) to return reasonable estimates of the true tree’s “topology”

  22. Problems with current techniques for MP Shown here is the performance of a heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%. Performance of TNT with time

  23. Empirical problems with existing methods • Heuristics for Maximum Parsimony (MP) and Maximum Likelihood (ML) cannot handle large datasets (take too long!) – weneed new heuristics for MP/ML that can analyze large datasets • Polynomial time methods have poor topological accuracy on large diameter datasets – we need better polynomial time methods

  24. Using divide-and-conquer • Conjecture: better (more accurate) solutions will be found if we analyze a small number of smaller subsets and then combine solutions • Note: different “base” methods will need potentially different decompositions. • Alert: the subtree compatibility problem is NP-complete!

  25. Using divide-and-conquer • Conjecture: better (more accurate) solutions will be found if we analyze a small number of smaller subsets and then combine solutions • Note: different “base” methods will need potentially different decompositions. • Alert: the subtree compatibility problem is NP-complete!

  26. Using divide-and-conquer • Conjecture: better (more accurate) solutions will be found if we analyze a small number of smaller subsets and then combine solutions • Note: different “base” methods will need potentially different decompositions. • Alert: the subtree compatibility problem is NP-complete!

  27. DCMs: Divide-and-conquer for improving phylogeny reconstruction

  28. Strict Consensus Merger (SCM)

  29. “Boosting” phylogeny reconstruction methods • DCMs “boost” the performance of phylogeny reconstruction methods. DCM Base method M DCM-M

  30. DCMs (Disk-Covering Methods) • DCMs for polynomial time methods improve topological accuracy (empirical observation), and have provable theoretical guarantees under Markov models of evolution • DCMs for hard optimization problems reduce running time needed to achieve good levels of accuracy (empirically observation)

  31. Absolute fast convergence vs. exponential convergence

  32. DCM-Boosting [Warnow et al. 2001] • DCM+SQS is a two-phase procedure which reduces the sequence length requirement of methods. Exponentially converging method Absolute fast converging method DCM SQS

  33. DCM1 Decompositions Input: Set S of sequences, distance matrix d, threshold value 1. Compute threshold graph 2. Perform minimum weight triangulation DCM1 decomposition : compute the maximal cliques

  34. DCM1-boosting distance-based methods[Nakhleh et al. ISMB 2001] • DCM1-boosting makes distance-based methods more accurate • Theoretical guarantees that DCM1-NJ converges to the true tree from polynomial length sequences 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 0 0 400 800 1200 1600 No. Taxa

  35. Major challenge: MP and ML • Maximum Parsimony (MP) and Maximum Likelihood (ML) remain the methods of choice for most systematists • The main challenge here is to make it possible to obtain good solutions to MP or ML in reasonable time periods on large datasets

  36. Maximum Parsimony • Input: Set S of n aligned sequences of length k • Output: A phylogenetic tree T • leaf-labeled by sequences in S • additional sequences of length k labeling the internal nodes of T such that is minimized.

  37. Maximum parsimony (example) • Input: Four sequences • ACT • ACA • GTT • GTA • Question: which of the three trees has the best MP scores?

  38. Maximum Parsimony ACT ACT ACA GTA GTT GTT ACA GTA GTA ACA ACT GTT

  39. Maximum Parsimony ACT ACT ACA GTA GTT GTA ACA ACT 2 1 1 3 3 2 GTT GTT ACA GTA MP score = 7 MP score = 5 GTA ACA ACA GTA 2 1 1 ACT GTT MP score = 4 Optimal MP tree

  40. Optimal labeling can be computed in linear time O(nk) GTA ACA ACA GTA 2 1 1 ACT GTT MP score = 4 Finding the optimal MP tree is NP-hard Maximum Parsimony: computational complexity

  41. Problems with current techniques for MP Best methods are a combination of simulated annealing, divide-and-conquer and genetic algorithms, as implemented in the software package TNT. However, they do not reach 0.01% of optimal on large datasetsin 24 hours. Performance of TNT with time

  42. Observations • The best MP heuristics cannot get acceptably good solutions within 24 hours on most of these large datasets. • Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions. • Apparent convergence can be misleading.

  43. Our objective: speed up the best MP heuristics Fake study Performance of hill-climbing heuristic MP score of best trees Desired Performance Time

  44. Divide-and-conquer technique for speeding up MP/ML searches

  45. DCM Decompositions Input: Set S of sequences, distance matrix d, threshold value 1. Compute threshold graph 2. Perform minimum weight triangulation DCM2 decomposition: Clique-separator plus component DCM1 decomposition :

  46. Empirical observation • DCM1 not as good as DCM2 for MP • DCM2 decompositions too large, too slow to compute. • Neither improved the best MP heuristics.

  47. How can we improve upon existing techniques?

  48. Tree Bisection and Reconnection (TBR)

  49. Tree Bisection and Reconnection (TBR) Delete an edge

  50. Tree Bisection and Reconnection (TBR)

More Related