740 likes | 755 Views
New methods for estimating species trees from genome-scale data. Tandy Warnow The University of Illinois. Phylogeny (evolutionary tree). Orangutan. Human. Gorilla. Chimpanzee. From the Tree of the Life Website, University of Arizona. Sampling multiple genes from multiple species.
E N D
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois
Phylogeny(evolutionary tree) Orangutan Human Gorilla Chimpanzee From the Tree of the Life Website,University of Arizona
Sampling multiple genes from multiple species Orangutan Human Gorilla Chimpanzee From the Tree of the Life Website,University of Arizona
Incomplete Lineage Sorting (ILS) is a dominant cause of gene tree heterogeneity
Gene trees inside the species tree (Coalescent Process) Past Present Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.
1KP: Thousand Transcriptome Project T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant • 103 plant transcriptomes, 400-800 single copy “genes” • Next phase will be much bigger • Wickett, Mirarab et al., PNAS 2014 • Challenge: • Massive gene tree heterogeneity consistent with ILS
Avian Phylogenomics Project MTP Gilbert, Copenhagen T. Warnow UT-Austin G Zhang, BGI E Jarvis, HHMI S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin Plus many many other people… • Approx. 50 species, whole genomes, 14,000 loci • Jarvis, Mirarab, et al., Science 2014 • Major challenge: • Massive gene tree heterogeneity consistent with ILS.
This talk • Gene tree heterogeneity due to incomplete lineage sorting, modelled by the multi-species coalescent (MSC) • Statistically consistent estimation of species trees under the MSC, and the impact of gene tree estimation error • ASTRAL (Bioinformatics 2014, 2015): coalescent-based species tree estimation method that has high accuracy on large datasets (1000 species and genes) • “Statistical binning” (Science 2014) – improving gene tree estimation, and hence species tree estimation • Open questions
This talk • Gene tree heterogeneity due to incomplete lineage sorting, modelled by the multi-species coalescent (MSC) • Statistically consistent estimation of species trees under the MSC, and the impact of gene tree estimation error • ASTRAL (Bioinformatics 2014, 2015): coalescent-based species tree estimation method that has high accuracy on large datasets (1000 species and genes) • “Statistical binning”(Science 2014) – improving gene tree estimation, and hence species tree estimation • Open questions Controversial!
Incomplete Lineage Sorting (ILS) • Confounds phylogenetic analysis for many groups: Hominids, Birds, Yeast, Animals, Toads, Fish, Fungi, etc. • There is substantial debate about how to analyze phylogenomic datasets in the presence of ILS, focused around statistical consistency guarantees (theory) and performance on data.
Statistical Consistency error Data
Two competing approaches gene 1gene 2 . . . gene k . . . Analyze separately . . . Summary Method Species Concatenation
What about summary methods? . . . Techniques: Most frequent gene tree? Consensus of gene trees? Other?
Statistically consistent under ILS? • Coalescent-based summary methods: • MP-EST (Liu et al. 2010): maximum pseudo-likelihood estimation of rooted species tree based on rooted triplet tree distribution – YES • NJst (Liu and Yu, 2011) - YES • And others, including some newer methods (BUCKy-pop, ASTRAL, ASTRID, etc.) - YES • Co-estimation methods: *BEAST (Heled and Drummond 2009): Bayesian co-estimation of gene trees and species trees – YES • Single-site methods (SVDquartets, METAL, SNAPP, and others)
1KP: Thousand Transcriptome Project T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant • 103 plant transcriptomes, 400-800 single copy “genes” • Next phase will be much bigger • Wickett, Mirarab et al., PNAS 2014 • Challenges: • Massive gene tree heterogeneity consistent with ILS • Could not use MP-EST due to missing data (many gene trees • could not be rooted) and large number of species
1KP: Thousand Transcriptome Project T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant • 103 plant transcriptomes, 400-800 single copy “genes” • Next phase will be much bigger • Wickett, Mirarab et al., PNAS 2014 • Solution: • New coalescent-based method ASTRAL • ASTRAL is statistically consistent, polynomial time, and uses • unrooted gene trees.
ASTRAL and ASTRAL-2 • Estimates the species tree from gene trees by finding the species tree that has the maximum quartet support, using dynamic programming • Theorem: ASTRAL is statistically consistent under the MSC, even when solved in constrained mode (drawing bipartitions from the input gene trees) • The constrained version of ASTRAL runs in polynomial time • Open source software at https://github.com/smirarab • Published in ECCB/Bioinformatics 2014 (Mirarab et al.) and ISMB/Bioinformatics 2015 (Mirarab and Warnow) • Used in Wickett, Mirarab et al. (PNAS 2014) and Prum, Berv et al. (Nature 2015) (and in many other papers)
Avian Phylogenomics Project MTP Gilbert, Copenhagen T. Warnow UT-Austin G Zhang, BGI E Jarvis, HHMI S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin Plus many many other people… • Approx. 50 species, whole genomes, 14,000 loci • Jarvis, Mirarab, et al., Science 2014 • Major challenge: • Massive gene tree heterogeneity consistent with incomplete lineage sorting • Very poor resolution in the 14,000 gene trees (average bootstrap support 25%) • Standard coalescent-based species tree estimation methods contradicted concatenation analysis and prior studies
Statistical Consistency for summary methods error Data Data are gene trees, presumed to be randomly sampled true gene trees.
TYPICAL PHYLOGENOMICS PROBLEM: many poor gene trees • Summary methods combine estimated gene trees, not true gene trees. • Multiple studies show that summary methods can be less accurate than concatenation in the presence of high gene tree estimation error. • Genome-scale data includes a range of markers, not all of which have substantial signal. Furthermore, removing sites due to model violations reduces signal. • Some researchers also argue that “gene trees” should be based on very short alignments, to avoid intra-locus recombination.
Gene tree estimation error: key issue in the debate • Summary methods combine estimated gene trees, not true gene trees. • Multiple studies show that summary methods can be less accurate than concatenation in the presence of high gene tree estimation error. • Genome-scale data includes a range of markers, not all of which have substantial signal. Furthermore, removing sites due to model violations reduces signal. • Some researchers also argue that “gene trees” should be based on very short alignments, to avoid intra-locus recombination.
Avian Phylogenomics Project MTP Gilbert, Copenhagen T. Warnow UT-Austin G Zhang, BGI E Jarvis, HHMI S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin Plus many many other people… • Approx. 50 species, whole genomes, 14,000 loci • Published Science 2014 Most gene trees had very low bootstrap support, suggestive of gene tree estimation error
Avian Phylogenomics Project MTP Gilbert, Copenhagen T. Warnow UT-Austin G Zhang, BGI E Jarvis, HHMI S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin Plus many many other people… • Approx. 50 species, whole genomes, 14,000 loci • Solution: Statistical Binning • Improves coalescent-based species tree estimation by improving gene trees • (Mirarab, Bayzid, Boussau, and Warnow, Science 2014) • Avian species tree estimated using Statistical Binning with MP-EST • (Jarvis, Mirarab, et al., Science 2014)
Ideas behind statistical binning • “Gene tree” error tends to decrease with the number of sites in the alignment • Concatenation (even if not statistically consistent) tends to be reasonably accurate when there is not too much gene tree heterogeneity Number of sites in an alignment
Note: Supergene trees computed using fully partitioned maximum likelihood Vertex-coloring graph with balanced color classes is NP-hard; we used heuristic.
Statistical binning vs. unbinned Datasets: 11-taxon strongILS datasets with 50 genes from Chung and Ané, Systematic Biology Binning produces bins with approximate 5 to 7 genes each
Theorem 3 (PLOS One, Bayzid et al. 2015):Unweighted statistical binning pipelines are not statistically consistent under GTR+MSC As the number of sites per locus increase: • All estimated gene trees converge to the true gene tree and have bootstrap support that converges to 1 (Steel 2014) • For each bin, with probability converging to 1, the genes in the bin have the same tree topology (but can have different numeric parameters), and there is only one bin for any given tree topology • For each bin, a fully partitioned maximum likelihood (ML) analysis of its supergene alignment converges to a tree with the common gene tree topology. As the number of loci increase: • every gene tree topology appears with probability converging to 1. Hence as both the number of loci and number of sites per locus increase, with probability converging to 1, every gene tree topology appears exactly once in the set of supergene trees. It is impossible to infer the species tree from the flat distribution of gene trees!
Fig 1. Pipeline for unbinned analyses, unweighted statistical binning, and weighted statistical binning. Bayzid MS, Mirarab S, Boussau B, Warnow T (2015) Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses. PLoS ONE 10(6): e0129183. doi:10.1371/journal.pone.0129183 http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0129183
Theorem 2 (PLOS One, Bayzid et al. 2015): WSB pipelines are statistically consistent under GTR+MSC Easy proof: As the number of sites per locus increase • All estimated gene trees converge to the true gene tree and have bootstrap support that converges to 1 (Steel 2014) • For every bin, with probability converging to 1, the genes in the bin have the same tree topology • Fully partitioned GTR ML analysis of each bin converges to a tree with the common topology of the genes in the bin Hence as the number of sites per locus and number of loci both increase, WSB followed by a statistically consistent summary method will converge in probability to the true species tree. Q.E.D.
Table 1. Model trees used in the Weighted Statistical Binning study. We show number of taxa, species tree branch length (relative to base model), and average topological discordance between true gene trees and true species tree. Dataset Species tree branch length scaling Average Discordance(%) doi:10.1371/journal.pone.0129183.t001
Binning can improve species tree topology estimation Species tree estimation error for MP-EST and ASTRAL, and also concatenation using ML, on avian simulated datasets: 48 taxa, moderately high ILS (AD=47%), 1000 genes, and varying gene sequence length. Bayzid et al., (2015). PLoS ONE 10(6): e0129183
Binning can reduce incidence of high support false positive edges Cumulative distribution of the bootstrap support values of true positive (left) and false positive (right) edges. If a curve for method X is above the curve for method Y, then X has higher BS for true positives and lower BS for false positives. Values in the shaded area indicate false positive branches with support at 75% or higher. Results are shown for 1000 genes with 500bp, on the avian simulated datasets. Bayzid et al., (2015). PLoS ONE 10(6): e0129183
Weighted Statistical Binning: empirical WSB generally benign to highly beneficial for moderate to large datasets: • Improves gene tree estimation • Improves species tree topology • Improves species tree branch length • Reduces incidence of highly supported false positive branches
Weighted Statistical Binning: empirical However, WSB can reduce accuracy under some conditions. Current simulations have only established this for model conditions that simultaneously have: • Very small numbers of species (at most 10) • Very high ILS (AD > 80%) • Low bootstrap support for gene trees Most likely there are other conditions as well.
Species tree estimation error for MP-EST and ASTRAL on 10-taxon datasets • Simphy Model Tree • 200 genes with 100bp (GTRGAMMA) • 10 replicates per condition • Notes: • Moderate ILS: binning neutral or beneficial using BS=50% • Very high ILS: binning neutral for BS=50%, but increases MP-EST error with BS=75% AD=40% AD=84% Bayzid MS, Mirarab S, Boussau B, Warnow T (2015). PLoS ONE 10(6): e0129183
Liu and Edwards, Comment in Science, October 2015 • Attempted proof that WSB pipelines are statistically inconsistent for bounded number of sites per locus: • The proof fails for multiple reasons, including the use of unpartitioned ML instead of fully partitioned ML • Simulation study • 5-taxon, strict molecular clock, very high ILS (AD=82%) • performed WSB using unpartitioned ML instead of fully partitioned ML. • erroneous (extopic) data in supergene alignments, biasing against WSB • Our re-analysis of their data produced better results than they reported, but WSB did reduce accuracy on their data.
Liu and Edwards, Comment in Science, October 2015 • Attempted proof that WSB pipelines are statistically inconsistent for bounded number of sites per locus: • The proof fails for multiple reasons, including the use of unpartitioned ML instead of fully partitioned ML • Simulation study • 5-taxon, strict molecular clock, very high ILS (AD=82%) • performed WSB using unpartitioned ML instead of fully partitioned ML. • erroneous (extopic) data in supergene alignments, biasing against WSB • Our re-analysis of their data produced better results than they reported, but WSB did reduce accuracy on their data.
Liu and Edwards, Comment in Science, October 2015 • Attempted proof that WSB pipelines are statistically inconsistent for bounded number of sites per locus: • The proof fails for multiple reasons, including the use of unpartitioned ML instead of fully partitioned ML • Simulation study • 5-taxon, strict molecular clock, very high ILS (AD=82%) • Our re-analysis of their data produced better results for statistical binning (both weighted and unweighted) than they reported, • They performed WSB using unpartitioned ML instead of fully partitioned ML(biasing against statistical binning) • They had erroneous (ectopic) data in their supergene alignments, biasing against statistical binning Figure of model tree from L&E, Science 9 October 2015: 171 This model tree fits into the category of conditions described in Bayzid et al. PLOS One 2015, in which WSB reduced accuracy (very small numbers of taxa, very high ILS).
Liu and Edwards, Comment in Science, October 2015 • Attempted proof that WSB pipelines are statistically inconsistent for bounded number of sites per locus: • The proof fails for multiple reasons, including the use of unpartitioned ML instead of fully partitioned ML • Simulation study • 5-taxon, strict molecular clock, very high ILS (AD=82%) • Our re-analysis of their data produced better results for statistical binning (both weighted and unweighted) than they reported • They performed WSB using unpartitioned ML instead of fully partitioned ML(biasing against statistical binning) • They had erroneous (ectopic) data in their supergene alignments, biasing against statistical binning Figure of model tree from L&E, Science 9 October 2015: 171 This model tree fits into the category of conditions described in Bayzid et al. PLOS One 2015, in which WSB reduced accuracy (very small numbers of taxa, very high ILS).
Liu and Edwards, Comment in Science, October 2015 • Attempted proof that WSB pipelines are statistically inconsistent for bounded number of sites per locus: • The proof fails for multiple reasons, including the use of unpartitioned ML instead of fully partitioned ML • Simulation study • 5-taxon, strict molecular clock, very high ILS (AD=82%) • Our re-analysis of their data produced better results for statistical binning (both weighted and unweighted) than they reported. • They performed WSB using unpartitioned ML instead of fully partitioned ML(biasing against statistical binning). • They had erroneous (ectopic) data in their supergene alignments, biasing against statistical binning Figure of model tree from L&E, Science 9 October 2015: 171 This model tree fits into the category of conditions described in Bayzid et al. PLOS One 2015, in which WSB reduced accuracy (very small numbers of taxa, very high ILS).