Phylogenetic supertrees: seeing the data for the trees

Phylogenetic supertrees:seeing the data for the trees Olaf R. P. Bininda-Emonds Technische Universität München

Outline • the fundamental issue: characters versus trees • open questions: are trees data? • loss of contact with primary character data • loss of information • “novel” solutions • data duplication • the nature of supertrees • analytical issues • conclusions • are supertrees a valid phylogenetic technique?

The fundamental issues

Conventional studies source data: measurable attribute of an organism basic unit: character can be viewed as a putative statement of relationship Supertrees source data: phylogenies basic unit: membership criterion / statement of relationship at best, can be viewed as a proxy for a shared derived character The basic distinction

The fundamental issue • supertrees combine trees, not “real data” • has led to many criticisms of supertree construction • but also lends advantages to the approach

E F G H J K L Direct A B C D E F G H I J K L consensus-like techniques A B C K L C D E H I K optimization criterion coding technique Indirect Supertree construction

Direct strict consensus supertrees MinCutSupertree (and variants) semi-strict supertrees Lanyon (1993) Goloboff and Pol (2002) Indirect most matrix representation (MR) supertrees parsimony (MRP and variants) compatibility (MRC) minimum flip supertrees (MRF) average consensus (MRD) gene tree parsimony Supertree methods

Are trees data?

Open questions • loss of contact with raw (character) data • loss of information • “novel” solutions • data duplication • the nature of supertrees: consensus or phylogenetic hypothesis? • analytical issues

A B C D 0 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 1 0 1 1 0 0 A B C D Loss of information • a tree is a graphical representation of the “primary signal” in a character-based data set • strength of primary signal can be measured (e.g., bootstrap frequencies) • but information regarding nature of any conflicting “subsignals” lost

Potential problems • all trees and clades on them have equal supporta priori • prevents “signal enhancement” (sensu de Queiroz et al., 1995) in combined data sets • coherent subsignals in different data partitions, when combined, outweigh conflicting primary signals • “throwing away of information” should cause a supertree analysis to be less accurate than a total evidence one, where primary data are combined

No loss of accuracy • simulation studies indicate loss of information is not detrimental • MRP (and variants) (Bininda-Emonds and Sanderson, 2001) • average consensus (Lapointe and Levasseur, 2001) • both methods perform about on a par with total evidence analyses of primary character data • and show similar behaviour to total evidence analyses

Maximizing contact • weighting according to evidential support in source trees • possible for all MR methods, average consensus, and MinCutSupertree (and gene tree parsimony?) • causes MRP to outperform total evidence analyses of primary character data in simulation (Bininda-Emonds and Sanderson, 2001) • bootstrapping of primary character data • both non-parametric (Moore et al., in prep) and parametric versions (Huelsenbeck et al., in prep)

Non-parametric bootstrapped supertrees ( ) original data bootstrapped source trees bootstrapped supertree consensus of supertrees

C D E A B C D A B C D E Novel clades • all supertree methods have the potential to yield novel statements • relationships between taxa that do not co-exist on any single source tree (sensu Sanderson etal., 1998) • defining characteristic of method +

Unsupported clades • some supertree methods have the potential to make statements that are not only novel, but also contradicted (unsupported) by every source tree • violation of a weaker form of co-Pareto property • co-Pareto = relationship of a given kind in the consensus is present in at least one input tree

A B C D E F A 0 0 0 0 1 1 0 0 B 1 0 0 0 1 0 0 0 C 1 1 0 0 1 1 1 1 D 1 1 1 0 1 1 1 1 E 1 1 1 1 1 1 1 0 F 1 1 1 1 0 0 0 0 + C D E A B F A B C D E F • from Goloboff and Pol (2002)

 MRP (and variants) MRF? average consensus? MinCutSupertree (and variants)? gene tree parsimony?  strict consensus supertrees semi-strict supertrees MRC Comparing supertree methods • indirect, optimization-based methods seem more prone to producing unsupported clades

A B C D E F C D E A B F A B C D E F + Questions: unsupported clades • how should they be treated? • how common are they?

Conventional studies unsupported clades (at level of resulting trees) arise via signal enhancement have direct character support in the combined matrix Supertrees subsignals are invisible unsupported clades lack any support among source trees  should be regarded as spurious (Pisani and Wilkinson, 2002) not equivalent to signal enhancement Appropriateness

A B C D E F A 0 0 0 0 1 1 0 0 B 1 0 0 0 1 0 0 0 C 1 1 0 0 1 1 1 1 D 1 1 1 0 1 1 1 1 E 1 1 1 1 1 1 1 0 F 1 1 1 1 0 0 0 0 + C D E A B F A B C D E F

Incidence of unsupported clades • circumstantial evidence hints that they are rare • only a few reported in the literature • theoretical: Goloboff and Pol (2002); Wilkinson et al. (2001) • empirical: Bininda-Emonds and Bryant (1998); Wilkinson et al. (2001) • estimated that 8 of the 198 clades in the carnivore MRP supertree (~ 4%) had no support among the source trees (Bininda-Emonds et al., 1999) • dinosaur MRP supertree (Pisani et al., 2002) has no unsupported clades

Unsupported clades are very rare • simulation results (MRP only) • occur most often with source trees that are: • few in number (n ≤ 5) • large in size (up to 50 taxa) • possess identical taxon sets (“consensus setting”) • “most often” means < 0.21% of all simulated clades • overall incidence was 131 of 282 137 clades (< 0.05%) • empirical results • both the carnivore and lagomorph MRP supertrees have no unsupported clades whatsoever

Data duplication • character data are often recycled between phylogenetic analyses e.g., total evidence analyses, molecular studies of the same gene • the same character data may contribute to more than one source tree • overrepresented in a supertree analysis  data duplication • also violates assumption of data non-independence

data duplication among cetartio-dactyl source trees in the Liu et al. (2001) mammal order MRP supertree • from Gatesy et al. (2002)

Minimizing duplication • data duplication a potential problem for all supertree methods • use of trees does not reveal directly source of underlying data set • but can be minimized / avoided with careful data collection protocols e.g., supertrees of Daubin et al. (2001) and Kennedy and Page (2002) lack data duplication

Is data duplication unavoidable? • no phylogenies are independent given a single Tree of Life • all characters and data sources have been subject to the same evolutionary processes and history • want to combine phylogenetic hypotheses that can reasonably be viewed as being independent

Is the problem overrated? • supertrees combine phylogenetic hypotheses • emergent property composed of more than their raw character data • manipulation of data (weighting, alignment, recoding) • method and assumptions of analysis • for example: • strongly conflicting molecular phylogenies for whales can be explained largely by the choice of outgroup (Messenger and McGuire, 1998) • alignment and weighting of primary data also important

Is data duplication overrated? • data duplication is often only partial • most combined data sets represent unique combinations of individual data sets • easy to deal with data sets that are supersets of others • signal enhancement means that each unique combination could justifiably be viewed as an independent hypothesis • also independent from constituent data sets

Are supertrees unfairly singled out? • data duplication also exists in conventional studies (but less obviously so and to a lesser known extent): • morphological  single features often described by multiple characters • molecular  secondary structure (e.g., stems in tRNA, protein folding) and codon structure mean primary mutations may require secondary compensatory ones • total evidence  mixing of phenotypic and genotypic data must represent data duplication at some level

The nature of supertrees • is the supertree itself a legitimate phylogenetic hypothesis? • many would say “no”, arguing instead that they are a: • form of consensus • historical summary of systematic effort • therefore, supertrees should not be used to answer biological questions

Supertrees as consensus • association derives from: • similar methodology (combining trees rather than data) • both containing polytomies • resulting topologies may be suboptimal given underlying data • why are consensus trees not valid phylogenetic hypotheses? • especially if polytomies viewed as soft rather than hard

Dealing with incongruence • all supertree methods must somehow deal with incongruence among source trees • ignore it: strict consensus, semi-strict, MinCutSupertree, MRC • “fix” it: MRF • explain it biologically: gene tree parsimony • optimize it: average consensus and MRP

Incongruence as homoplasy • a repeated criticism of MRP is that inferred homoplasy on supertree has no biological meaning • convergence and reversals meaningless with respect to a membership criterion • but why is MRP singled out? • similar arguments should apply at least to average consensus

Principle of parsimony a criterion for deciding among scientific theories or explanations “Plurality should not be posited without necessity”  choose the simplest explanation of a phenomenon Cladistic parsimony specific application of principle of parsimony prefer the tree with the fewest number of evolutionary steps (i.e., character state changes) additional changes over minimum number represent homoplasy Parsimony and parsimony

Homoplasy and supertrees • notions of homoplasy, convergence, and reversals have nothing to do with parsimony per se • or really even with cladistic parsimony • post hoc biological interpretation of incongruence • incongruence on an MRP supertree is simply incongruence • idea of homoplasy in this context is epistemologically, not biologically meaningless

Limitations of total evidence • analytical limitations of combined primary data sets also result in a loss of information • data must be compatible • use of a single optimization criterion  usually MP, but ML now also possible • some data still not analyzable under either framework (e.g., DNA-DNA hybridization, morphometric data) • use of simplistic models of evolution • MP: differential weighting (including ti:tv ratio) • ML: same model for every partition • alignment problems

Advantages to supertrees • no loss of information: all phylogenetic hypotheses can be combined • even those that aren’t based on any data • process amounts to partitioned analyses • each partition can be analyzed according to most appropriate model of evolution, and optimization criterion • can be done in parallel • results then combined with little loss of accuracy • or hopefully less than loss of information for a total evidence analysis entails

The “superteam” have complete supertrees for: Carnivora Chiroptera Insectivora Lagomorpha Marsupialia Primates total of 1923 species (41.5%) Molecular data Murphy et al. (2001a) 9779 bp from 18 genes for 64 species Madsen et al. (2001) 8655 bp from 4 genes for 82 species Murphy et al. (2001b) 16 397 bp from 22 genes for 44 species (< 1%) A phylogeny of mammals

Summary

? Whither supertrees? • criticisms of supertree construction have been launched at two levels • at the supertree approach as a whole • at individual supertree methods

? D C B A B D A + Of approaches … • supertree problem inherently difficult because of missing data • results in the lack of a single right answer

Of approaches … • trees are data • potential loss of information not detrimental • key is to think in terms of phylogenetic hypotheses • still awaiting a response from the cladistic community …

… and methods • all methods will go astray if its assumptions are violated e.g., parsimony and long-branch attraction, likelihood and wrong model, regression and data non-independence • for supertrees, key is to try and establish: • what each method’s boundary conditions are • how robust each method is to violations of its assumptions • what the properties of each method are (in relation to our desired objective)

Phylogenetic supertrees: seeing the data for the trees