Phylogenetic Analysis

Phylogenetic Analysis Introduction to bioinformatics Stinus Lindgreen stinus@binf.ku.dk Bioinformatics Centre, University of Copenhagen

Outline of the lecture • What is a phylogeny? • Why and how to interpret them • Programs: PHYLIP, PAUP* and BioEdit • Building a tree 1: Multiple alignment • Building a tree 2: The model • Building a tree 3: Construction • Building a tree 4: Evaluation

Nothing in Biology Makes Sense Except in the Light of Evolution Theodosius Dobzhansky (1900-1975)

Phylogeny • Phylogenetic inference predicts a tree based on characters (of some sort) • Some variation needed • Group together similar species/genes • Connect to most common ancestor • Unrooted tree: Just show connections • Rooted tree: Direction of evolution • Branch lengths can show divergence

Before sequences • Phylogenetic trees show evolutionary relationships • Existed longer than sequencing methods • Previously based on morphological characters • Still partly today – at least for checking • Mainly based on biological sequences • DNA or protein • Base phylogeny on mutations

Morphological tree

Modern tree A A G C G X X

Some pitfalls • Determining phylogeny is important for understanding biology • But also a very difficult problem • Beware of incorrect trees • Important to understand models and methods • The programs are helpful tools The result is only as good as the alignment

Assumptions Basic concepts of evolutionary theory • Relation to common ancestor • Phylogenetics represented by bifurcating tree • Mutations occur over evolutionary time Necessary to make phylogenetic inference possible

Tree of Life

Interpretation • Know your model • Both evolutionary and for tree construction • Know the assumptions of the model • Evolution independent? Identical between sites? The same for all sequences? • Are the sequences correct? • And are they representative? • And are they homologous? • Is the multiple alignment correct? What you get out is no better than what you put in

Some biological pitfalls Don’t make hasty conclusions! • Does your tree contradict common sense? • Then it’s probably wrong! • Differentiate between the homologs • Orthologs • Speciation, common ancestor, similar function • Paralogs • Gene duplication, within 1 organism, differing functions • Xenologs • Horizontal gene transfer – hard to tell, similar function

Software Today we’ll look at the programs before the methods Some programs for phylogenetic analysis • A multiple alignment program: Clustal, T-Coffee, MAFFT, Muscle… • A phylogenetic program: Phylip, PAUP*, MacClade, BioEdit… • Visualizing the tree: TreeView, NJplot

PAUP* • Commercial package • Apparently good • Many different methods and analysis methods • But since we don’t own a copy… • Similarly: MacClade only works on Macintosh…

PHYLIP • Free package • Many programs • Both distance and character based • Bootstrapping possible • But: • It can be a little difficult • No graphical user interface • And you will need to run many programs

BioEdit • Has phylogeny methods built in • Can call Phylip routines • No need for you to learn the command line • But no bootstrapping… (as far as I know) • Point and click: • Select the sequences in the alignment • Choose the wanted phylogeny • Voila!

PhyloWin • Another free program • Simple, not many possibilities • But you can make bootstrapping

Getting the software • Install BioEdit, PHYLIP, PhyloWin and NJplot • Links on the wiki

Constructing a tree To make a phylogenetic tree, four steps are needed: • Perform multiple alignment • Choose your model • Build the tree • Evaluate the quality A brief note: Ideally: Parallel alignment and phylogenetic inference • Very difficult – but it has been pursued

1) The multiple alignment Already discussed Some notes: • Recall that MA programs are not exact • Some manual editing often necessary • Consider the algorithm used • Does it consider the phylogeny of the data? • Clustal’s guide tree: Not correct phylogeny • What parameters are used? • Solve ambiguities, remove near-identical sequences • Gappy regions, identical sequences can bias the result

2) The model The model describes the data • Evolutionary events • Overall mutability • Evolutionary model? • Crucial – both for alignment and tree building • Are you looking at nucleotides or amino acids? • Where do we get most information? • Know the basis for the chosen model

Nucleotide models • Create 4×4 matrix • Either fixed cost • Character state • Or rate matrices • Probabilities • Used for different kinds of tree estimations • Include site specific information • Third codon position more variable

Nucleotide model 1 • Fixed cost for transitions and transversion • E.g. transversions are twice as costly as transitions • For a tree: Count the number of transitions/transversions • Calculate cost • Tends to minimize number of transversion • Cluster transitions

Nucleotide model 2 • Simple substitution rate matrix • Assume same rates AB and BA • Assume all mutations equally likely: Rate α • The Jukes-Cantor model

Nucleotide model 3 • More advanced rate matrix • Include transitions/tranversions • Rates α1 and α2 • The Kimura 2-parameter model

Amino acid models • A 20×20 substitution matrix • The BLOSUM matrices • Fixed cost matrices • Or the PAM matrices • Rate matrices • Described last week

3) Building the tree We have the sequences, the alignment and the model • Find the best tree • What is the best tree? • Two main strategies: • Distance based • Look at dissimilarities (=distances) • Character based • Look at the data

Problems with trees • The number of possible trees grows exponentially • For 15 taxa: 2.13·1014 possibilities… • How to search? • Branch and Bound • Branch swapping • Rooting the tree • Not a simple problem • All the following methods produce unrooted trees • Use an outgroup • Midpoint of longest branch

Distance methods • Some sequences more similar than others • Closely related sequences should be close in the tree • Abstract view on the data • Loss of information is usually a bad sign • Only use the distances between sequences • Recall Clustal • All methods start with a distance matrix

Distance methods • Can we get the correct answer? • Yes, if all mutation events were present • But: After one mutation, the site is ”saturated” • Additional mutations do not give additional info A B C: Distance 2 A C: Distance 1 • And mutations back will fool the method A B A: Distance 2 A A: Distance 0

UPGMA Unweighted Pair Group Method with Arithmetic Mean • Unweighted: The distances are used as they are • Pair: Find the two closest elements • Group: Put them together in a new group • Arithmetic Mean: Gives distances from the new group • Correct tree assuming a molecular clock • Evolutionary divergence time can be found from mutations • Mutation rates are constant

UPGMA illustrated • Find two closest: A and D • Create a new group [A+D] • Update distances: • Repeat for all sequences • Next time: Connect [A+D] with E

Trying UPGMA • Go to the wiki and do the UPGMA exercise

Neighbour joining • A little like UPGMA • Difference: NJ does not assume a molecular clock • But it assumes an additive tree • Distance between two leaves is the sum of the edges • Find the closest pair that is most apart from the rest of the tree • Connect pair and update distances • A little advanced: Take the overall distance to the rest of the tree into account • Corrects for varying mutation • Fast and can give good results

Fitch-Margoliash FM method • We have the pairwise distances • Each branch in the tree has a length • The length of all paths can be found • Optimize tree by moving internal nodes around • The best fit minimizes the overall error • The minimum squared deviation

Minimum Evolution The ME method • Find the shortest tree • Count number of changes • Similar to FM but only looks at branches FM B ME B A A

Trying NJ • Go to the wiki and do the NJ exercise

Character methods • Use the data (the actual characters) • All information at hand • More advanced, slower, but also more accurate • Maximum Parsimony (MP) • Occam’s razor: Simplest explanation • Maximum Likelihood (ML) • Advanced statistical method • Most probable tree given the data and the model

Maximum parsimony • How does evolution work? • Assumption: Path of least resistance • True evolution gives rise to fewest changes • The tree we want: • Describe the given sequences by fewest changes • The ancestral nodes must be as similar as possible • Predict a tree • Count the number of changes needed

MP illustrated A C G G C {A,C} {G} {A,C,G} {C}

MP illustrated A C G G C {A,C} {G} {A,C,G} {C} X X Cost: 2 changes

MP illustrated A C G G C C G C C CA CG

Maximum Likelihood • Given the data, predict the most probable model • Can optimize both tree and substitution model • We know the sequences • What is the most likely substitution rates? • Estimate from the alignment (and the phylogeny) • And what is the most likely tree? • Estimate from alignment and substitution rates • Computationally heavy and rather slow • Normally good results

Maximum Likelihood • General practice: Optimize model then tree • Calculate probability for each alignment column • Combine to probability for entire alignment • Averages over low and high probability sites • Likelihood of column given tree A A C A A A A C C A A A C G A L=P +P +P +…

Maximum likelihood • Then repeat this for all possible tree topologies • And all possible assignments to internal nodes • And then choose the combination that gives the highest probability… • Clearly very difficult

MP and ML exercise • Go to the wiki and do the MP and ML exercises

Summary of methods

The differences • Sometimes the differences can seem minimal • They affect the tree – but the same result is possible UPGMA and NJ • Minimize the overall length of the tree Maximum parsimony • Finds tree with fewest changes Maximum likelihood • Maximizes the probability of the tree given the data

4) Evaluating trees How good is the predicted tree? Some sequence variation needed • Is the signal strong enough? There are so many possible trees • Are there many trees similar to the prediction? • Which one to choose? • Is the tree robust? • Does it change much when e.g. removing a sequence?

Randomization • Is it possible that tree is just random? • Permute the columns of the alignment • i.e. shuffle the characters in a column • Build a new tree • Is it (partly) identical? • If the tree is just as likely to be random, then don’t put too much faith in it

Phylogenetic Analysis