Processing & Testing Phylogenetic Trees

Processing & Testing Phylogenetic Trees

Rooting

Rooting 1.Outgroup Rooting: Based on external information. 2.Midpoint Rooting:Directa posteriori use of the ultrametricity assumption. 3.Largest-Genetic-Variability-Group Rooting:Indirecta posteriori use of the ultrametricity assumption.

animal animal animal animal Rooting with outgroup plant Unrooted tree plant plant Rooted tree bacterial outgroup plant Monophyletic group plant plant animal Monophyletic group animal root animal animal

Midpoint rooting

Largest variation = Most ancient

Estimating Branch Length From pairwise distances to branch lengths: maximum likelihood, least squares, etc.

Estimating Divergence Times

Topological comparisons

Penny and Hendy's topological distance (dT) A commonly used measure of dissimilarity between two tree topologies. The measure is based on tree partitioning. dT = 2c c = the number of partitions resulting in different divisions of the OTUs in the two tree topologies under consideration.

Trees inferred from the analysis of a particular data set are called fundamental trees, i.e., they summarize the phylogenetic information in a data set.

Sometimes we have many fundamentaltrees pertaining to the same question. For example, we may have trees derived from different genes for the same taxa, or trees derived through different methods, or different runs in a simulation. In these cases we need to be able to summarize the data.

Consensus trees are trees that summarize the phylogenetic information in a set of fundamental trees.

In a strict consensus tree, all conflicting branching patterns are collapsed into multifurcations. In a X% majority-rule consensus trees, a branching pattern that occurs with a frequency of X% or more is adopted. When X = 100%, the majority-rule consensus tree will be identical with the strict consensus tree.

A tree is an evolutionary hypothesis

How do we know that the inferred tree is correct?

Joseph H. Camin (1922-1979)

Assessing tree reliability Phylogenetic reconstruction is a problem of statistical inference. One must assess the reliability of the inferred phylogeny and its component parts. Questions: (1) how reliable is the tree? (2) which parts of the tree are reliable? (3) is this tree significantly better than another one?

Bootstrapping • A statistical technique that uses intensive random resampling of data to estimate a statistic whose underlying distribution is unknown.

Bootstrapping • Characters are resampled with replacement to create many bootstrap replicate data sets (pseudosamples) • Each bootstrap replicate data set is analyzed • Frequency of occurrence of a group (bootstrap proportions) is a measure of support for the group

Bootstrapping - an example Partition Table Ciliate SSUrDNA - parsimony bootstrap 123456789 Freq ----------------- .**...... 100.00 ...**.... 100.00 .....**.. 100.00 ...****.. 100.00 ...****** 95.50 .......** 84.33 ...****.* 11.83 ...*****. 3.83 .*******. 2.50 .**....*. 1.00 .**.....* 1.00 Ochromonas (1) Symbiodinium (2) 100 Prorocentrum (3) Euplotes (8) 84 Tetrahymena (9) 96 Loxodes (4) 100 Tracheloraphis (5) 100 Spirostomum (6) 100 Gruberia (7)

Reduction of a phylogenetic tree by the collapsing of internal branches associated with bootstrap values that are lower than a critical value (C). (a) Gene tree for a-tubulin (b) C = 50% (c) C = 90%

Tests for two competing trees • All these tests use the null hypothesis that the differences between two trees (A and B) are no greater than expected from the sampling error

Favoring tree A Favoring tree B 0 Distribution of differences at each site Under the null hypothesis the mean of the differences in parsimony steps at each site is expected to be zero.

Tests for two competing trees A parametric test for comparing two trees under the assumption that all nucleotide sites are independent and equivalent. Di = difference in the minimum number of substitutions between the two trees at the ith informative site. D = SDi. n = number of informative sites. V(D) = sample variance of D

The null hypothesis, D = 0, is tested with the Student paired t-test with n – 1 degrees of freedom:

Likelihood Ratio Test • Likelihood of Hypothesis 1 = L1 • Likelihood of Hypothesis 2 = L2 •  = 2(ln L1 – lnL2) • Compare  to 2 distributionor to a simulated distribution.

Reliability of Phylogenetic Methods • Phylogenetic methods can also be evaluated in terms of their general performance, particularly their: • consistency - approach the truth with more data • efficiency - how quickly can they handle how much data • robustness - how sensitive to violations of assumptions • Studies of these properties can be analytical or by simulation

Problems with long branches With long branches most methods may yield erroneous trees. For example, the maximum-parsimony method tends to cluster long branches together. This phenomenon is called long-branch attraction or the Felsenstein zone

A A B p p D q q q C C D B p >> q TRUE TREE WRONG TREE

Chaperonin Maximum Likelihood Tree(Roger et al. 1998. PNAS 95: 229) Longest branches

Trees: Pectinate (a) versus Symmetrical (b)

Recommendations

Avoid the “Black Box” • Researchers invest considerable resources in producing molecular sequence data. • They should also invest the time and effort needed to get the most out of their data. • Modern phylogenetic software makes it easy to produce trees from aligned sequences, but phylogenetic inference should not be treated as a “black box.”

Choices are Unavoidable • There are many phylogenetic methods. • Thus, the investigator is confronted with unavoidable choices. • Not all methods are equally good for all data. • An understanding of the basic properties of the various phylogenetic methods is essential for informed choice of method and interpretation of results.

Data are not Perfect • Most data includes misleading evidence, and we need to have a cautious attitude to the quality of data and trees. • Data may have both systematic biases and unbiased noise that affect our chances of getting the correct tree • Different methods may be more or less sensitive to some problems.

Alignment • The data determine the results. • The alignment determines the data. • Be aware of alignment artefacts. • If using multiple alignment software, explore the sensitivity of the alignment to the parameters used. • Eliminate regions that cannot be aligned with confidence.

Models • The data should fit the assumptions of the model. • Explore the data for potential biases and deviations from the assumptions of the model.

Choice of Models • Complex models may better approximate the evolution of the sequences and, therefore, might be expected to give more accurate results. • More complex models require the estimation of more parameters each of which is subject to some error. • There is a trade-off between more realistic and complex models and their power to discriminate between alternative hypotheses.

Not all methods are good for all problems.

Processing & Testing Phylogenetic Trees