1 / 24

Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny

Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny. Phylogeny. Phylogeny refers to the ancestry of a biological lineage, but is also synonymous with phylogenetic tree Taxonomy began by grouping taxa together based on morphology at various structural levels

geona
Download Presentation

Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny

  2. Phylogeny • Phylogeny refers to the ancestry of a biological lineage, but is also synonymous with phylogenetic tree • Taxonomy began by grouping taxa together based on morphology at various structural levels • Phylogeny is tree-like, or dichotomous • Phylogeny provides the historical basis to the comparative method

  3. Principle of phylogenetics • Inferring relationships is about similarity. • Homology describes similarity due to common inheritance from an ancestor. Homologous characters are useful similarity. • Homoplasy describes similarity due to independent acquisitions of the same or superficially similar character state. Homoplasious characters provide a mis-leading picture of phylogeny. • Distance in a phylogenetic tree reflects a decreasing number of shared, homologous characters (assuming that evolution maximises homology).

  4. Phylogenetic trees in biology • Tool for understanding biological processes • Examination of phylogeny to determine distance to characterized molecules • draw conclusions regarding biological functions not otherwise apparent • multiple alignments vs. pairwise homology • Genomes are historical entities • their structure and function reflect the past

  5. Applications to genome biology • Gene family evolution • orthology vs paralogy • gene duplications and losses can be inferred through comparisons of ‘gene’ and ‘species’ trees • the placement of a gene in the ‘wrong’ position within a phylogeny is used to support horizontal gene transfer. • Microarray data analysis • Comparative genome hybridization (CGH) distance matrix • Phylogenomics • gene order, gene content and concatenated sequences can be used to infer phylogeny • Recombination • tests for recombination and gene conversion use phylogenetic profiles to detect breakpoints

  6. Building a phylogenetic tree • Identify protein, DNA or RNA sequences of interest • Fasta format file of concatenated • Multiple sequence alignment • ClustalX • Construct phylogeny • PHYML • View and edit tree • ATV

  7. Overview of ClustalX Procedure CLUSTAL W Quick pairwise alignment: calculate distance matrix Neighbor-joining tree (guide tree) alpha-helices Progressive alignment following guide tree

  8. Creating multiple alignments • Phylogeny is meaningless unless it is based on a well-done alignment • Issues to consider • Alignment parameters • Weight matrix parameters • Gap penalties • Truncated sequences • Non homologous sequences

  9. Multiple alignments: parameters

  10. Multiple alignments: Gap penalties High gap penalties Default gap penalties Low penalties

  11. Multiple alignments: truncated sequences

  12. Multiple alignments: non-homologous sequences

  13. Constructing phylogenies • Stages in constructing phylogenies: • Data scoring; producing genetic distances or character states (‘distance’ or ‘discrete’ data). • Tree sorting; processes for searching ‘tree-space’, e.g., hill-climbing or MCMC. • Estimation; identifying the most acceptable tree topology and model parameters using a variety of methods (‘clustering’ or ‘optimising’ methods) • Phylogenetic methods: • Algorithmic • Neighbor-joining • UPGMA • Tree-searching • Maximum parsimony • Maximum likelihood • Bayesian inference • No one method is best for all circumstances

  14. Neighbor Joining (NJ) • Principles: • Tree topology and branch lengths are estimated from a genetic distance matrix. • Advantages: • A single tree is estimated by minimising genetic distance, in a short time and with little computational expenditure. • Disadvantages: • The method lacks accuracy because there is no attempt to correct for potential bias (homoplasy). • The method lacks precision because the outcome is partly contingent on the tree with which the search process begins.

  15. Maximum parsimony (MP) • Principles: • Searches through tree topologies in ‘tree-space’ using a ‘hill-climbing’ algorithm. • Scores trees on their ‘length’, i.e., the number of character state changes required to explain the distribution of characters on a given tree topology. • Looks for the tree with the minimum number of changes, i.e. the topology with the fewest character changes overall. • Advantages: • Generally accurate method with few assumptions. • Phylogenetic hypotheses can be statistically tested by comparing the lengths of different trees. • Tree estimation is relatively fast and undemanding. • Disadvantages: • There are typically several shortest trees, resulting in a potentially ambiguous consensus topology. • There is no explicit model of evolution and so the method is prone to error under certain circumstances, e.g., long-branch attraction (homoplasy).

  16. Maximum likelihood (ML) • Principles: • Looks for the tree that, under a given model of evolution, maximizes the likelihood of the observed data • Applies a complex model of DNA or protein sequence evolution that estimates parameters for specific substitutions and other qualities of molecular sequences • Locates the most likely tree topology through a hill-climbing algorithm • Various models accommodate sources of • molecular homoplasy that might result in • the wrong tree: • ‘Multiple hits’ • (substitutional saturation) • Rate convergence • Rate heterogeneity • Base composition bias • Codon usage bias • Secondary structure • Covariance

  17. Maximum Likelihood • Advantages: • Highly accurate because considerable biological realism is introduced through the substitutional model. This allows various forms of homoplasy to be corrected for. • Phylogenetic estimation within the likelihood framework provides a robust statistical context in which to evaluate specific hypotheses. • A single tree is produced that is generally precise. • Disadvantages: • The complexity of the estimation process means that it is slow and computationally demanding. • The hill-climbing algorithm is susceptible to local optima and so does not guarantee to return the most optimal solution.

  18. Bootstrapping a tree • Statistical estimate of the reliability of groupings • Subsamples of sites in an alignment are used to generate trees • Process is iterated multiple times (100-1000 times) • Agreement among the resulting trees is summarized with a majority-rule consensus tree

  19. Bayesian • Principles: • Based on the notion of posterior probabilities: probabilities that are estimated, based on some model (prior expectations), after learning something about the data. • Uses an MCMC process to search through tree-space. • Selects the tree-topology with the highest probability, given the data. • Advantages: • Intuitive • Potential for any complex model. • Provides both parameter estimates (i.e., tree) and their probabilities in a single analysis. • Many different hypotheses can be evaluated in a single analysis. • The MCMC algorithm makes integrating over all parameter values fast and accurate; MCMCs are able to break out of local optima.

  20. BI Bayesian • Disadvantages: • An evolutionary model must be specified a priori, in form of prior probabilities (‘priors’). Is there sufficient knowledge of these probabilities? • The MCMC must be run long enough for variation in the parameter estimates to smooth out or reach ‘convergence’. The time required is never certain. • Posterior probabilities describe the absolute probability of particular nodes and branch lengths; these can be overestimated.

  21. Remember All trees are wrong

  22. Cladograms and phylograms Bacterium 1 Cladograms show branching order - branch lengths are meaningless Bacterium 2 Bacterium 3 Eukaryote 1 Eukaryote 2 Eukaryote 3 Eukaryote 4 Phylograms show branch order and branch lengths Bacterium 1 Bacterium 2 Bacterium 3 Eukaryote 1 Eukaryote 2 Eukaryote 3 Eukaryote 4

  23. eukaryote eukaryote eukaryote eukaryote Rooting using an outgroup archaea archaea Unrooted tree archaea The root defines common ancestry bacteria outgroup archaea Rooted by outgroup archaea Monophyletic group archaea eukaryote eukaryote Monophyletic group root eukaryote eukaryote

  24. Further details Textbooks: Page & Holmes Molecular Evolution: A Phylogenetic Approach. Blackwell Science. Felsenstein Inferring Phylogenies. Sinauer Associates. Hall Phylogenetic trees made easy. Sinauer Associates. Software: Phyml http://atgc.lirmm.fr/phyml/ PAUP* (NJ, MP, ML): http://paup.csit.fdsu.edu PHYLIP (NJ, MP, ML): http://evolution.genetics.washington.edu/phylip.html MrBayes (Bayesian): http://mrbayes.csit.fdsu.edu Splitstree (Networks) http://www.splitstree.org FindModel (Model Test) http://www.hiv.lanl.gov/content/sequence/findmodel/findmodel.html Websites: MultiPhyl (ML via email) http://distributed.cs.nuim.ie/multiphyl.php Felsenstein’s Phylogeny program page (links to available software): http://evolution.genetics.washington.edu/phylip/software.html

More Related