430 likes | 929 Views
Phylogenetic analysis. Selecting sequences Outgroup sequences Alignment Choice of method Example using one method. Three most important choices. Which sequences to include Outgroup sequences Alignment. “Outgroup” sequences be included.
E N D
Phylogenetic analysis • Selecting sequences • Outgroup sequences • Alignment • Choice of method • Example using one method
Three most important choices • Which sequences to include • Outgroup sequences • Alignment
“Outgroup” sequences be included • The best outgroup sequences are sequences clearly outside the group being studied, but not too far out. • Multiple outgroup sequences should be chosen. • The outgroup sequences are included in the data matrix just like the other sequences. • They will be used to root the tree.
Methods of phylogenetic analysis • Parsimony (Cladistics) • Maximum likelihood • Bayesian • Genetic distance (Neighbor-joining, etc.)
Parsimony (Cladistics) • Willi Hennig. 1950. Grundzüge einer Theorie der phylogenetischen Systematik. • 1966. Phylogenetic systematics. • Evidence comes from characters • Goal: build most parsimonious tree
Finding the most parsimonious tree • Goal- fewest evolutionary steps (optimality criterion) • Fewest a.a. changes • Fewest base changes • Many tree topologies are tested, choosing the best. • Unrooted • Rooting the tree comes later.
Rooting the tree • The outgroup taxa are included in the data matrix just like the other taxa. • Once the best tree is found, it is “rooted” along the branch connecting the outgroup and ingroup taxa.
What to do in case of a tie- consensus • A “strict” consensus tree is one in which the branches not present on all trees are collapsed, resulting in polytomies. • A “50% majority rule” consensus tree is one in which the branches not present on 50% of the trees are collapsed, resulting in polytomies. • Trees with many polytomies are said to be less resolved than trees with few or no polytomies.
Why are Maximum Likelihood and Bayesian methods considered an improvement over parsimony? • + They allow for a model of molecular evolution to be specified. • Not all changes from one base to another (or from one a.a. to another) are equally likely. • Not all positions have the same probabilty of change. • - They require that the correct model be specified.
What is Maximum Likelihood (ML)? • Just like parsimony, ML examines lots of trees and picks the best one. • However, the optimality criteria differ. • Parsimony -- fewest changes. • ML -- maximizes the probability of observing the data (aligned sequences), given a model of molecular evolution.
Models of molecular evolution • Substitution matrix • For proteins, this is the (observed) probability of one amino acid changing to another. • For DNA, it is the probability of one base changing to another. • Site-to-site variation in rate of change • Some sites don’t vary. • Among those that do, they vary at different rates.
Why is using a correct model of molecular evolution better than using parsimony? • Under some conditions, parsimony chooses the wrong tree (long branch attraction). • Methods using a model are more precise and result in fewer exact ties, generally. • For example, changes between two chemically similar a.a.’s can be used as “similarity”. Under parsimony all differences are simply “different”. • Models usually choose a single best tree, whereas parsimony usually chooses a large set of most parsimonious trees. • Branch length estimates are more accurate with a model.
What is Bayesian phylogenetic analysis? • Just like ML, we search for the best trees that are consistent with both the model and the data. • Optimality criterion: • -- maximizes the probability of the tree, given the data (aligned sequences) and the model of molecular evolution. • Bayesian analysis is the only one that automatically provides confidence estimates (similar to bootstrap values) for each node.
Example - Bayesian analysis of signal transduction proteins • Using ProtTest to find out how the sequences are evolving • Informing MrBayes of the model of molecular evolution • Using MrBayes to get the phylogeny • Making a figure
MrBayes doesn’t know when it has run long enough -- you decide. Average standard deviation of split frequencies: < 0.01
A B C D E B A E D C
What is Neighbor-joining (NJ)? • NJ is an algorithm for building a tree. • There is no optimality criterion. • First, a matrix of distances between all pairs of sequences is computed. • A substitution matrix is needed to do this. • Then, one pair is chosen from among all possible pairs, because combining them best minimizes the length of the tree.
Neighbor-joining • NJ is very fast. • There is no optimality criterion. • This means there is no way to assess its success. • There is also no way to say whether a “best” tree is significantly better that a set of “next best” trees. (mt Eve) • The tree it chooses is not always the shortest. Distances are estimated from noisy data and early mistakes in NJ can’t be revisited.
Large data sets • If you have over 50 sequences, or if you have very long sequences (hundreds of proteins) ProtTest and MrBayes may take more than a couple of days to finish. • Parsimony is much faster. • It allows node support (bootstrap values) to be calculated. • It doesn’t require a model of molecular evolution. • PAUP* can read nexus files. • NJ is faster still. Sometimes it is the only method that is fast enough. • A default model of molecular evolution must be used.
DNA sequences should be used when sequences are highly similar • Use a very similar procedure. • Use MrModelTest instead of ProtTest.
Summary • Three most important choices • Which sequences to include • Outgroup sequences • Alignment • Choice of method - Bayesian • Example - Look on Ned’s Computational Corner for more details.