Wellcome Trust Workshop Working with Pathogen Genomes Module 5 Phylogenetics

Wellcome Trust Workshop Working with Pathogen Genomes Module 5 Phylogenetics

Homology • Owen’s Definition of Homology: • Homology the same organ under every variety of form and function (true or essential correspondence) • Analogy superficial or misleading similarity • Richard Owen (1843)

Dog Frog Human Lizard Some Important Definitions • Homology vs Homoplasy: • Homology describes similarity due to common inheritance from an ancestor. Homologous characters are useful similarity. • Homoplasy describes similarity due to independent acquisitions of the same or superficially similar character state. Homoplasic characters provide a misleading picture of phylogeny. Hair Dog Frog Tail Human Lizard Present Absent

Phylogenetic Systematics • Phylogenetics aims to reconstruct the ancestry of biological lineages • It regards homology as evidence of common ancestry • Relationships are usually portrayed on tree diagrams • Monophyletic groups (clades) contain taxa that are more closely related to each other than to any outside the group • Distance between taxa reflects a decreasing number of shared, homologous characters

Bacterium 1 Bacterium 2 Bacterium 3 Eukaryote 1 Eukaryote 2 Eukaryote 3 Eukaryote 4 Bacterium 1 Bacterium 2 Bacterium 3 Eukaryote 1 Eukaryote 2 Eukaryote 3 Eukaryote 4 Cladograms and Phylograms Relative time Cladograms show branching order - branch lengths are meaningless Phylograms show branch order and branch lengths Absolute ‘time’ (divergence)

Eukaryote 1 Eukaryote 2 Eukaryote 4 Eukaryote 3 Rooted and Unrooted trees Archaea 1 Archaea 3 Unrooted tree Archaea 2 The root defines common ancestry bacterial outgroup Archaea 1 Monophyletic group Archaea 2 Tree rooted by outgroup Archaea 3 Eukaryote 1 Eukaryote 2 Monophyletic group root Eukaryote 3 Eukaryote 4

Branches Nodes can be freely rotated without changing the relationships shown Leaves / Tips / OTUs Nodes Some Tree Terms and Facts Archaea 1 Archaea 2 Archaea 3 Eukaryote 1 Eukaryote 2 Eukaryote 3 Eukaryote 4

Nodes can be freely rotated without changing the relationships shown Total distance = Some Tree Terms and Facts Eukaryote 1 Eukaryote 2 Eukaryote 3 Only horizontal distances indicate divergence Eukaryote 4 Archaea 1 Archaea 2 Archaea 3

Nodes can be freely rotated without changing the relationships shown Total distance = Some Tree Terms and Facts Archaea 1 Archaea 2 Archaea 3 Only horizontal distances indicate divergence Eukaryote 1 Eukaryote 2 Eukaryote 3 Eukaryote 4

Building a Phylogenetic Tree • Identify protein, DNA or RNA sequences of interest • Fasta format file of concatenated sequences • Multiple sequence alignment • ClustalX/muscle • Construct phylogeny • PHYML • View and edit tree • FigTree Note: There are many (many) other programs for alignment, tree building and tree viewing

An alignment is a hypothesis of positional homology between bases/amino acids of different sequences Phylogeny is meaningless unless it is based on a well-made alignment Multiple Alignments

Multiple Alignment Alignment can be easy or difficult Easy Difficult due to insertions or deletions (indels)

Multiple Alignment CLUSTAL Quick pairwise alignment: calculate distance matrix Neighbor-joining tree (guide tree) Progressive alignment following guide tree

There are many different phylogenetic methods So, you will be confronted with unavoidable choices Not all methods are equally good for all data Although we do not need to understand all the details of the various phylogenetic methods, an understanding of the basic properties is essential for informed choice of method and interpretation of results Building a Phylogenetic Tree Choices are Unavoidable!

Phylogenetic Methods

 Hill Climbing • Imagine tree ‘space’ is a hill • Better trees (measured by parsimony or likelihood) are higher • We can find the best tree using a robot with a simple program: • Accept uphill moves • Reject downhill moves     ‘Better’ trees

#$@*! Hill Climbing

Hill Climbing • Local maxima are a problem for methods using hill climbing algorithms to find the best tree • One way to reduce the probability of being stuck in a local maximum is to do repeat analyses from different starting points • I.e. beam in a number of robots to different starting positions

  Hill Climbing • Local maxima are a problem for methods using hill climbing algorithms to find the best tree • One way to reduce the probability of being stuck in a local maximum is to do repeat analyses from different starting points • I.e. beam in a number of robots to different starting positions

Maximum parsimony • Method: • Searches through tree topologies in ‘tree-space’ using a ‘hill-climbing’ algorithm. • Applies an optimising criterion: maximum parsimony. • Scores trees on their ‘length’, i.e., the number of character state changes required to explain the distribution of characters on a given tree topology. • Selects the topology with the fewest character changes overall.

Likelihood The Idiot’s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed A gentle introduction, for those of us who are small of brain, to the calculation of the likelihood of molecular sequences. http://www.bmnh.org/web_users/pf/idiots.pdf

Likelihood • We know that the process of sequence evolution isn’t as simple as parsimony assumes • There may be multiple substitutions at a single site • Not all changes between bases/amino acids are equally likely • Some bases may be essential for the correct function of a gene so are less likely to change than others (or not at all) • Likelihood methods allow us to incorporate such knowledge into complex models of evolution • Ideally we would like to be able to calculate the probability of a tree being produced by our data and model • Unfortunately this is not possible • However, we can calculate the likelihood of our data given our model (and the tree)

Likelihood • Imagine tossing a coin and getting a head. What is the probability (likelihood) of that result?

Likelihood • Imagine tossing a coin and getting a head. What is the probability (likelihood) of that result? • If our model is that the coin is fair, the probability is 0.5 • If our model is a double headed coin, the probability is 1 • The model you choose can have a big effect on the likelihood

Maximum likelihood • Methods: • A (complex) model of DNA or protein sequence evolution is used to estimate parameters for specific substitutions and other qualities of molecular sequences. Usually including: • Composition: the frequency of each base • Rates: the rates of substitution between each base Rates: We also need to know the relative rates of change between the character states (car colours or bases) There are many models available for this, from very simple: JC assumes all changes are the same Through intermediate: We know that transitions occur more frequently than transversions, so we can give the model a ratio for this difference. Better still we can estimate this difference from our dataset …to very complex: GTR allows all changes to occur at a different rate which is estimated during the analysis Composition: Imagine a city where people have cars that are red, blue, green or yellow In this city is a busy car park where people park for varying times and then leave and are immediately replaced by another car Over time the composition of car colours in the car park will reflect the composition of car colours in the city as a whole Even if the car park started completely full of blue cars, over time it will still tend towards the city composition To correctly model this process we need to know the composition of car colours in the the city

Maximum likelihood • Methods: • A (complex) model of DNA or protein sequence evolution is used to estimate parameters for specific substitutions and other qualities of molecular sequences. Usually including: • Composition: the frequency of each base • Rates: the rates of substitution between each base • Various models accommodate sources of • molecular homoplasy that might result in • the wrong tree: • ‘Multiple hits’ • (substitutional saturation) • Rate convergence • Rate heterogeneity • Base composition bias • Codon usage bias • Secondary structure • Covariance

Characters Taxa 1 2 3 4 5 6 7 8 9 A A C C T G A T G C B A G C T G G T T C C A G C A G A T G G D T C C T C G T G C E T C T T A A T G C Characters Taxa A B C D E Bootstrapping • Bootstrapping is a way to produce a measure of confidence in the relationships found in a phylogenetic analysis • Characters (sites/amino acids) are resampled with replacement to produce a set of replicate data sets • Each replicate is analysed (e.g. with parsimony/distance/maximum likelihood) • Frequency of occurrence of groups in the results of these analyses is a measure of support for those groups • Bootstrap proportions (BPs) are often represented as a number on each branch of a tree showing how often that relationships occurred in the replicate analyses 9 5 2 2 Random Number Generator: 2 C G G C C 5 G G G C A 9 C C G C C 2 C G G C C 7 T T T T T 7 T T T T T 2 C G G C C 1 A A A T T 6 A G A G A

Maximum likelihood: effect of rate matrices PHYML MtRev matrix 91 PHYML WAG matrix 92 Keane et al. 2006

Bayesian inference • Method: • Maximum likelihood tries to find the best values for the branch lengths and model parameters • Bayesian inference, on the other hand, allows these parameters to have some uncertainty • The result is not a single tree with specific parameters, but a distribution • Maximum likelihood expresses itself as the probability of the data given the model (including the tree) • Bayesian inference expresses the result as the probability of the model (including the tree) given the data (= posterior probability) • Bayesian inference requires a prior probability to be set for each parameter

Bayesian inference: an example with rare diseases and imperfect tests • Imagine there is a disease called bad spelling disease that we know is suffered by 1% of the population • We have a test that is quite accurate: • it detects the disease 90% of the time in patients with the disease • But, it will give positive results 10% of the time in patients without the disease • If you have a patient that tests positive, what is the probability that they actually have the disease?

Bayesian inference: an example with rare diseases and imperfect tests • Its easier to explain if we imagine giving the test to 1000 patients • As we know 1% of people suffer from BSD, we know that by chance: • 10 will have the disease • 9 of those will test positive • 990 will not have the disease • 99 of those will test positive • So 108 tests give positive results • But if you test positive your probability of having the disease is only 9/108 = 8%

Bayesian inference: an example with rare diseases and imperfect tests • Before the test we believed that each patient had a probability of 1% of having the disease • (= our prior probability) • If they tested positive we can adjust this probability to 8% (= our posterior probability) • But we want to be more certain so we can give them a second test • This time our prior is 8% • ~9 will have the disease (8% of 108) • 8 of those will test positive • ~99 will not have the disease • 9 of those will test positive • This time 17 tests give positive results • And if you test positive your probability of having the disease is now 8/17 = 47% • The more tests we do, the more our initial prior probability is overwhelmed by the data (the test results)

Bayesian inference: the maths bit (you don’t need to remember this… I don’t)

Bayesian inference • Method (cont.): • In practice it is impossible to do Bayesian calculations for phylogenetic applications analytically • Rather, we use an MCMC process to search through tree-space. • An MCMC can handle more parameters (I.e. more complex models) than ML

Slightly downhill steps are often accepted Drastic downhill steps are almost never accepted Uphill steps are always accepted MCMC (Markov Chain Monte Carlo) • MCMC searches allow both uphill and downhill moves • It has a few simple rules • Using these rules the robot tends to find and stay near the top of the hill • They also allow crossing of valleys between local maxima

MCMC • An MCMC has no end point (it does not search for the ‘best’ tree like ML) • Instead it explores tree space • The rules mean it spends most of its time exploring trees that fit the data well • Because it has no ultimate goal we must tell it when to stop

MCMC • An MCMC will not find a single tree • Instead, every so often during the MCMC search we save the current tree • The first few trees saved are from the beginning of the search when the MCMC is not sampling ‘good’ trees • Trees in this ‘burn-in’ region are disposed of • This gives us a set of ‘good’ trees

Bayesian methods can allow very complex models

Further details Textbooks: Hall Phylogenetic trees made easy. Sinauer Associates. Page & Holmes Molecular Evolution: A Phylogenetic Approach. Blackwell Science. Felsenstein Inferring Phylogenies. Sinauer Associates. Software: Phyml http://atgc.lirmm.fr/phyml/ PAUP* (NJ, MP, ML): http://paup.csit.fdsu.edu PHYLIP (NJ, MP, ML): http://evolution.genetics.washington.edu/phylip.html MrBayes (Bayesian): http://mrbayes.csit.fdsu.edu Splitstree (Networks): http://www.splitstree.org FindModel (Model Test): http://www.hiv.lanl.gov/content/sequence/findmodel/findmodel.html SeaView (Contains Clustal, Muscle, PHYLIP and PhyML + a simple tree viewer: http://pbil.univ-lyon1.fr/software/seaview.html Websites: MultiPhyl (ML via email): http://distributed.cs.nuim.ie/multiphyl.php Phylogeny.fr (Robust Phylogenetic Analysis For The Non-Specialist): http://www.phylogeny.fr/ Felsenstein’s Phylogeny program page (links to available software): http://evolution.genetics.washington.edu/phylip/software.html

Wellcome Trust Workshop Working with Pathogen Genomes Module 5 Phylogenetics

Wellcome Trust Workshop Working with Pathogen Genomes Module 5 Phylogenetics

Presentation Transcript

Biomedical Ethics at the Wellcome Trust

KEMRI-Wellcome Trust Research Programme

Wellcome Trust - Funding the best science

Phylogenetics workshop: Protein sequence phylogeny

‘Working with Families’ Workshop

The Wellcome Trust

Working with Workshop

Wellcome Trust Workshop Working with Pathogen Genomes Module 2 Gene Prediction

DTC/Wellcome Trust Postgraduate Course 2007

The Wellcome Trust

Wellcome Trust Medical Photographic Library

Joint EBI-Wellcome Trust

Wellcome Trust : how do we fund ?

Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny

Chris Penkett Wellcome Trust Sanger Institute

Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis

Joint EBI-Wellcome Trust

WELLCOME TRUST ADVANCED COURSES Working with Pathogen Genomes 17th - 21st November 2008

The Wellcome Trust

Wellcome Trust - Funding the best science

Biomedical Ethics at the Wellcome Trust