The bootstrap, consenus-trees, and super-trees

Phylogenetics Workhop, 16-18 August 2006 The bootstrap,consenus-trees, and super-trees Barbara Holland

What is the bootstrap? • Like in many other areas where statistical inference is applied, in phylogenetics it is not just of interest to get a point estimate of the phylogenetic tree. • We would also like some measure of confidence in our point estimate. • Is our tree likely to change if we got more data, or if we had used slightly different data? • How robust is our result to sampling error? • The bootstrap is a useful tool for answering these sorts of questions.

Assessing confidence in trees • In 1985 Felsenstein introduced the idea of the bootstrap to phylogenetics. • For each boostrap sample • Create a new alignment by resampling the columns of the observed alignment • Construct a tree for the ‘bootstrap’ alignment • Can be applied to any method that starts from a sequence alignment, e.g., parsimony, likelihood, clustering methods if the distances are derived from an alignment… • The bootstrap support for each edge is the number of bootstrap trees that edge appears in.

1234567 a ATATAAA b ATTATAA c TAAAATA d TATAAAT 1224567 a ATTTAAA b ATTATAA c TAAAATA d TAAAAAT 1334567 a AAATAAA b ATTATAA c TAAAATA d TTTAAAT 1234567 a ATATAAA b ATTATAA c TAAAATA d TATAAAT 1244567 a ATTTAAA b ATAATAA c TAAAATA d TAAAAAT c a c a a c a b d b c d d b b d c a 0.75 b d

0.01 0.2 a b c d Example where the bootstrap is useful • Simulate data on the four taxon tree below (JC model) • Use sequence lengths of 100, 1000, and 10000

0.05 0.05 0.1 0.1 d d a a b c c b Example where the bootstrap is not so useful • Simulate data on the two four-taxon trees below (JC model) in the proportion 55%, 45% and concatenate the sequences • Use total sequence lengths of 100, 1000, and 10000 55% 45%

Consensus trees • Consensus trees attempt to summarise the information contained in a set of trees, where each tree in the set is on the same taxa. • Some consensus tree methods are specific to rooted trees.

Why are consensus methods required? • Many phylogenetic methods produce a collection of trees rather than a single best tree. • Monte Carlo Markov Chain (MCMC) • Bootstrapping. • Equally parsimonious trees • Sometimes trees for different genes produce a collection of trees.

Terminology: Splits and clades • Each edge in an unrooted tree corresponds to a split or bipartition of the taxa set. • Each edge in a rooted tree corresponds to a clade.

Splits mouse dog turtle cat, dog, mouse, parrot | turtle parrot cat dog, cat | mouse, turtle, parrot cat, dog, mouse | turtle, parrot

Clades dog cat mouse parrot turtle

Strict Consensus • The strict consensus tree contains only those splits/clades that appear in all trees mouse mouse turtle dog dog mouse dog turtle turtle parrot parrot cat cat cat parrot mouse dog turtle parrot cat

Semi-strict • The semi-strict consensus tree also contains those splits/clades that don’t conflict with any of the input trees mouse mouse dog dog turtle turtle parrot cat cat parrot mouse dog turtle cat parrot

Majority-Rule • The majority-rule consensus tree contains only those splits/clades that appear in more than 50% of the input trees dog mouse mouse turtle turtle dog mouse dog turtle parrot cat parrot cat cat parrot turtle dog mouse parrot cat

dog cat mouse parrot turtle Terminolgy: 3-taxon statements • 3-taxon statements are triples of three species that show two species to be more closely related than is the third. • E.g. the tree below displays the 3-taxon statements ((dog,cat),mouse) ((dog,mouse),parrot) ((mouse,parrot),turtle) …and others…

Terminology: Rooted trees, hierarchies, clusters, and partitions Hierarchy of clusters Partitions abcd | ef {a,b,c,d} a | bcd | ef {b,c,d} a | b | cd | ef {e,f} {c,d} a b c d e f {a} {b} {c} {d} {e} {f}

Products of partitions • Given k partitions p1, p2, p3,…, pk of the same set of taxa, the product of these partitions is the partition where a and b are in the same block if and only if the are in the same block for each pi • Example: The product of abc|de and ad|bce is a|bc|d|e

Adams Consensus • Adams consensus method only applies to rooted trees. • It preserves all the 3-taxon statements that are common to all of the input trees. • Recursive algorithm that looks at the product of the maximal partitions of each of the input trees

AdamsTree algorithm (from Bryant 2003) Procedure AdamsTree(T1,…Tk) ifT1 contains only 1 leaf then returnT1 else construct the product of the maximal partitions of the input trees For each block B in the partition do construct AdamsTree(T1|B, …Tk|B) Attach the roots of these trees to a new node v return this tree end

e b c d a f a b c d e f Adams consensus example Maximal partition abcd | ef Maximal partition bcde | af Product of maximal partitions a|bcd|e|f {f} {a} {b,c,d} {e}

Adams consensus example cont. Restrict to b,c,d e b c d a f a b c d e f Maximal partition b | cd Maximal partition b | cd Product of maximal partitions b | cd {f} {a} {b,c,d} {e} {b} {c,d}

Adams consensus example cont. Restrict to c,d e b c d a f a b c d e f Maximal partition c | d Maximal partition c | d {f} {a} {b,c,d} {e} Product of maximal partitions c | d {b} {c,d} {c} {d}

What about an “Adams” like method for unrooted trees? • Instead of triples we would need to consider statements about quartets of taxa. • If a quartet ((a,b),(c,d)) appeared in all the input trees it should be displayed in the output. • Easy enough?

Three requirements (Steel, Dress and Böcker 2000) • Relabelling of the species at the tip of the tree should yeild the same answer relabelled in the appropriate way • The input order of the trees should not matter • A quartet that appears in all the input trees should appear in the output tree

No method can satisfy these 3 requirements • Counter example f a b a b c e f c d d e

Supertree methods • Super-tree methods take a set of trees on overlapping taxa sets and return a tree (or sometimes a ‘fail’ message) • Biological relevance • Not all genes are present in all species • Not all genes are easy to sequence for all species • Assembling the Tree of Life • Computationally impossible to try and build a tree for all taxa • Use a divide and conquer approach • And then use supertree methods to piece the Tree of Life together

Concept: Refinement c c b b d refines d a a e e The trees below are also refinements d e b b c d a a e c

Concept: Restriction T c b d e a f h g The label set X = {a,b,c,d,e,f,g,h} We can restrict T to any subset of the labels X’

Concept: Restriction E.g. The restriction to {a,c,e,g} T c c e b d e a a g f h g Find the subtree and then supress the degree two vertices

Concept: Displaying A tree T (on label set X) displays a tree T’ (on label set X’ subset of X) if Trestricted to the labels X’ is a refinement of T’ E.g. d c c e d b a f displays and d b a e f a f

Concept: Displaying BUT d c c e d b a f Does not display or b d a e f a c

The BUILD algorithm • Polynomial-time algorithm due to Aho et al (1981) • Takes a set of rooted input trees and either outputs a supertree that displays all of the input trees or returns a fail message.

BUILD algorithm • Recursive algorithm, at each step it constructs a graph associated with the triples displayed by the input trees. • Depending on whether this associated graph is connected or disconnected the algorithm either terminates or subdivides the problem. • What is this associated graph?

The associated graph • Nodes of the graph are the complete label set, i.e. all the labels that appear in any of the input trees • Put an edge between two nodes a and b if there is at least one input tree that displays the rooted triple ((a,b),c) for some c. • If this graph is connected stop and report a fail message • Otherwise call the algorithm again once for each connected component, restricting the input to the labels in that component.

BUILD Example (from Semple and Steel) d a b c e c b e a b f d b c d a {a,b,c,f} {d,e} e f

BUILD example continued Subproblem 1: Restrict input to {a,b,c,f} a b c c b a b f b c a {a,b,c,f} {d,e} f {f} {a,b} {c}

BUILD example continued Subproblem 2 and 3 on {d,e}, and {a,b} are trivial so the final tree is {a,b,c,f} {d,e} {f} {e} {a,b} {d} {c} {a} {b} a b c f d e

What if the trees don’t agree? • If the input trees are not compatible BUILD will return a fail message. • It is also of interest to have methods that will return some output even if the input trees cannot all be displayed by a single supertree. • Matrix representation with parsimony (MRP) is one such method…

Matrix Representation with Parsimony (MRP) • Supertree method invented independently by Baum and Ragan (1992). • Recode the input trees as a character matrix where each edge in each input tree defines a character. • Do a parsimony analysis of the resulting character matrix. • Take the strict consensus of the most parsimonious trees.

c d e b f a d c e b g a g d e h MRP example 4 6 2 4 2 8 4 6 2 8 3 5 7 3 1 9 3 5 7 5 1 9 1 123456789 123456789 12345 a 101010100 101010000 ????? b 011010100 011010000 ????? c 000110100 000001000 ????? d 000001100 000110000 01100 e 000000010 000000110 10100 f 000000001 ????????? ????? g ????????? 000000101 00010 h ????????? ????????? 00001

c d e b f a d c e b g a g d e h MRP example 10 most parsimonious trees Strict consensus: e c b f g a d h

The bootstrap, consenus-trees, and super-trees