1 / 48

Mathematical and Computational Challenges in Reconstruction Evolution

This talk explores models of evolution, identifying challenges in phylogenetic signal and heterogeneity, and discusses open questions and future directions in the field. It also covers the statistical consistency of estimation methods and the impact of genome-scale data and incomplete lineage sorting (ILS) on phylogenomic analysis.

ccates
Download Presentation

Mathematical and Computational Challenges in Reconstruction Evolution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mathematical and Computational Challenges in Reconstruction Evolution Tandy Warnow The University of Illinois

  2. Phylogeny(evolutionary tree) Orangutan Human Gorilla Chimpanzee From the Tree of the Life Website,University of Arizona

  3. “Nothing in biology makes sense except in the light of evolution” • Theodosius Dobzhansky, 1973 essay in the American Biology Teacher, vol. 35, pp 125-129 • “…... nothing in evolution makes sense except in the light of phylogeny ...” • Society of Systematic Biologists, http://systbio.org/teachevolution.html

  4. This Talk • Models of evolution, identifiability, statistical consistency • The challenges of heterogeneity and phylogenetic signal • Open questions • Future directions

  5. -3 mil yrs AAGACTT AAGACTT -2 mil yrs AAGGCCT AAGGCCT AAGGCCT AAGGCCT TGGACTT TGGACTT TGGACTT TGGACTT -1 mil yrs AGGGCAT AGGGCAT AGGGCAT TAGCCCT TAGCCCT TAGCCCT AGCACTT AGCACTT AGCACTT today AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT DNA Sequence Evolution (Idealized)

  6. Markov Models of Sequence Evolution The different sites are assumed to evolve i.i.d. down the model tree (with rates that are drawn from a gamma distribution).

  7. Markov Models of Sequence Evolution The different sites are assumed to evolve i.i.d. down the model tree (with rates that are drawn from a gamma distribution). Simplest site evolution model (Jukes-Cantor, 1969): • The model tree T is binary and has substitution probabilities p(e) on each edge e. • The state at the root is randomly drawn from {A,C,T,G} (nucleotides) • If a site (position) changes on an edge, it changes with equal probability to each of the remaining states. • The evolutionary process is Markovian. More complex models (such as the General Markov model) are also considered, often with little change to the theory.

  8. Statistical Consistency error Data

  9. Questions • Is the model tree identifiable? • Which estimation methods are statistically consistent under this model? • How much data does the method need to estimate the model tree correctly (with high probability)?

  10. Answers? • We know a lot about which site evolution models are identifiable, and which methods are statistically consistent. • We know a little bit about the sequence length requirements for standard methods. Take home message: need to limit (or not allow) heterogeneity to get identifiability!

  11. Genome-scale data? error Data

  12. Phylogenomics Phylogeny + genomics = genome-scale phylogeny estimation .

  13. Incomplete Lineage Sorting (ILS) is a dominant cause of gene tree heterogeneity

  14. Gene trees inside the species tree (Coalescent Process) Past Present Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.

  15. Gene trees inside the species tree (Coalescent Process) Deep coalescence = INCOMPLETE LINEAGE SORTING (ILS): gene tree can be different from the species tree Past Present Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.

  16. 1KP: Thousand Transcriptome Project T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant • 103 plant transcriptomes, 400-800 single copy “genes” • Next phase will be much bigger • Wickett, Mirarab et al., PNAS 2014 • Major Challenge: • Massive gene tree heterogeneity consistent with ILS

  17. Major challenge: • Massive gene tree heterogeneity consistent with ILS.

  18. Statistically consistent methods • Coalescent-based summary methods: Estimate gene trees, and then combine together (ASTRAL, ASTRID, MP-EST, NJst, and others) • Co-estimation methods: Co-estimate gene trees and species trees (TOO EXPENSIVE) • Site-based methods: estimate the species tree from the concatenated alignment, and do not estimate gene trees (NOT WELL STUDIED)

  19. Main competing approaches gene 1gene 2 . . . gene k . . . Analyze separately . . . Summary Method Species Concatenation

  20. What about summary methods? . . .

  21. What about summary methods? . . . Techniques: Most frequent gene tree? Consensus of gene trees? Other?

  22. Species tree estimation from unrooted gene trees Theorem (Allman et al.): Under the multi-species coalescent model, for any four taxa A, B, C, and D, the most probable unrooted gene tree on {A,B,C,D} is identical to the unrooted species tree induced on {A,B,C,D}.

  23. Species tree estimation from unrooted gene trees Theorem (Allman et al.): Under the multi-species coalescent model, for any four taxa A, B, C, and D, the most probable unrooted gene tree on {A,B,C,D} is identical to the unrooted species tree induced on {A,B,C,D}. Proof: For every four species, select most frequently observed tree as the species tree. Then combine quartet trees!

  24. Species tree estimation from unrooted gene trees Theorem (Allman et al.): Under the multi-species coalescent model, for any four taxa A, B, C, and D, the most probable unrooted gene tree on {A,B,C,D} is identical to the unrooted species tree induced on {A,B,C,D}. Proof: For every four species, select most frequently observed tree as the species tree. Then combine quartet trees!

  25. Impact of Gene Tree Estimation Error(from Molloy and Warnow 2017) Error is fraction of bipartitions that are not recovered Summary Methods Site-based Method

  26. Impact of Gene Tree Estimation Error(from Molloy and Warnow 2017) Note: Summary methods better than CA-ML for low GTEE, then worse! Error is fraction of bipartitions that are not recovered Summary Methods Site-based Method

  27. Statistical Consistency for summary methods error Data Data are gene trees, presumed to be randomly sampled true gene trees.

  28. Gene tree estimation error: key issue in the debate • Multiple studies show that summary methods can be less accurate than concatenation in the presence of high gene tree estimation error. • Genome-scale data includes a range of markers, not all of which have substantial signal. Furthermore, removing sites due to model violations reduces signal. • Some researchers also argue that “gene trees” should be based on very short alignments, to avoid intra-locus recombination.

  29. What about performance on bounded number of sites? • Question: Do any summary methods converge to the species tree as the number of loci increase, but where each locus has only a constant number of sites? • Answers: Roch& Warnow, SystBiol, March 2015: • Strict molecular clock: Yes for some new methods, even for a single site per locus • No clock: Unknown for all methods, including MP-EST, ASTRAL, etc. S. Roch and T. Warnow. "On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods", Systematic Biology, 64(4):663-676, 2015

  30. Theoretical questions • Are any summary methods statistically consistent with bounded numbers of sites per locus (and no strict molecular clock)? • What are the data requirements for the different summary methods (assuming true gene trees)? • Is SVDquartets statistically consistent? • Is the underlying species tree identifiable when ILS and horizontal gene transfer (HGT) are both present?

  31. Theoretical questions • Are any summary methods statistically consistent with bounded numbers of sites per locus (and no strict molecular clock)? • What is the sample complexity for the different summary methods (assuming true gene trees)? • Is SVDquartets statistically consistent? • Is the underlying species tree identifiable when ILS and horizontal gene transfer (HGT) are both present?

  32. Theoretical questions • Are any summary methods statistically consistent with bounded numbers of sites per locus (and no strict molecular clock)? • What is the sample complexity for the different summary methods (assuming true gene trees)? • Is SVDquartets statistically consistent? • Is the underlying species tree identifiable when ILS and horizontal gene transfer (HGT) are both present?

  33. Theoretical questions • Are any summary methods statistically consistent with bounded numbers of sites per locus (and no strict molecular clock)? • What is the sample complexity for the different summary methods (assuming true gene trees)? • Is SVDquartets statistically consistent? • Is the underlying species tree identifiable when ILS and horizontal gene transfer (HGT) are both present?

  34. Indo-European Tree95% of the characters are compatible

  35. “Perfect Phylogenetic Network”(all characters compatible) L. Nakhleh, D. Ringe, and T. Warnow, LANGUAGE, 2005

  36. Language Evolution • Warnow, Evans, Ringe, and Nakhleh model of language evolution (Phylogenetic Methods and the Prehistory of Languages, MacDonald Institute Press, 2006) • Three types of characters: lexical, phonological, and morphological • Limited homoplasy for morphological and phonological characters enables identifiability in the presence of unlimited character evolution heterogeneity down a tree.

  37. Open Questions for Phylogenetics of Languages • What is the sample complexity of statistically consistent methods under language evolution models? • Are phylogenetic networks identifiable under language evolution models?

  38. Open Questions for Phylogenetics of Languages • What is the sample complexity of statistically consistent methods under language evolution models? • Are phylogenetic networks identifiable under language evolution models?

  39. Phylogenetic Inference • NP-hard optimization problems and large datasets • Statistical estimation under stochastic models of evolution • Probabilistic analysis of algorithms • Graph-theoretic divide-and-conquer • Chordal graph theory • Combinatorial optimization

  40. Acknowledgments Roch and Warnow, Systematic Biology 2014 Mirarab and Warnow, Bioinformatics 2015 Molloy and Warnow, Systematic Biology 2017 Warnow, Evans, Ringe, and Nakhleh, Phylogenetic Methods and the Prehistory of Languages, MacDonald Institute Press, 2006 Papers available at http://tandy.cs.illinois.edu/papers.html Funding: NSF, David Bruton Jr. Centennial Professorship, TACC (Texas Advanced Computing Center), Grainger Foundation, and HHMI (to SM).

  41. Overall Themes • Heterogeneity across the genome is evident, at multiple levels (different trees, and different GTR model parameters even for the same tree), yet models of evolution constrain heterogeneity to enable identifiability. • Gene tree estimation error is common, typically due to reduced phylogenetic signal. • Proofs of statistical consistency assume error-free gene trees and/or constrain heterogeneity.

  42. Overall Themes • Heterogeneity across the genome is evident, at multiple levels (different trees, and different GTR model parameters even for the same tree), yet models of evolution constrain heterogeneity to enable identifiability. • Gene tree estimation error is common, typically due to reduced phylogenetic signal. • Proofs of statistical consistency assume error-free gene trees and/or constrain heterogeneity.

  43. Overall Themes • Heterogeneity across the genome is evident, at multiple levels (different trees, and different GTR model parameters even for the same tree), yet models of evolution constrain heterogeneity to enable identifiability. • Gene tree estimation error is common, typically due to reduced phylogenetic signal. • Proofs of statistical consistency assume error-free gene trees and/or constrain heterogeneity.

  44. Future Directions • Better coalescent-based summary methods (that are more robust to gene tree estimation error) • Better techniques for estimating gene trees given multi-locus data, or for co-estimating gene trees and species trees • Better theory about robustness to gene tree estimation error (or lack thereof) for coalescent-based summary methods • Better “single site” methods

  45. Future Directions • Better coalescent-based summary methods (that are more robust to gene tree estimation error) • Better techniques for estimating gene trees given multi-locus data, or for co-estimating gene trees and species trees • Better theory about robustness to gene tree estimation error (or lack thereof) for coalescent-based summary methods • Better “single site” methods

  46. Future Directions • Better coalescent-based summary methods (that are more robust to gene tree estimation error) • Better techniques for estimating gene trees given multi-locus data, or for co-estimating gene trees and species trees • Better theory about robustness to gene tree estimation error (or lack thereof) for coalescent-based summary methods • Better “single site” methods

  47. Future Directions • Better coalescent-based summary methods (that are more robust to gene tree estimation error) • Better techniques for estimating gene trees given multi-locus data, or for co-estimating gene trees and species trees • Better “single site” methods

More Related