1 / 40

Molecular Evolution and Phylogeny Using AgentSheets, Excel, R and MEGA

demetrius
Download Presentation

Molecular Evolution and Phylogeny Using AgentSheets, Excel, R and MEGA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. 1 Mol. Evolution & Phylogeny, SC|09 Education, Nov 15, 2009 Molecular Evolution and Phylogeny Using AgentSheets, Excel, R and MEGA Jeff Krause Shodor Ananth Kalyanaraman Washington State University

    2. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 2 Phylogenetics Basics

    3. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 3 Tree

    4. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 4 Graph

    5. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 5 Phylogenetics Basics The goal is to infer evolutionary relationships among biological entities based on quantifiable similarities and differences Phylogenetic Trees Trees, branches and nodes == graphs, edges and vertices

    6. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 6 Biology on trees

    7. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 7 Biology on trees

    8. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 8 Tree topology depicts relatedness and timing

    9. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 9 Branch lengths may represent timing Different types of trees can be used to represent the same inferred phylogeny Depending on the method of phylogeny estimation, and assumptions made, the branch lengths may represent the timing of events, either relative or absoluteDifferent types of trees can be used to represent the same inferred phylogeny Depending on the method of phylogeny estimation, and assumptions made, the branch lengths may represent the timing of events, either relative or absolute

    10. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 10 Phylogenetics Basics The goal is to infer evolutionary relationships among biological entities based on quantifiable similarities and differences Phylogenetic Trees Trees, branches and nodes == graphs, edges and vertices Topology and branch length reflect relatedness and timing Require a quantitative metric of similarity/difference Morphology Molecular sequence data

    11. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 11 The Tree of Life: The Ultimate Phylogeny CIPRES aims to establish the cyber infrastructure (platform, software, database) required to attempt a reconstruction of the Tree of Life (10-100M organisms)

    12. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 12 Types of Phylogenies Relationships between taxa Species Trees Gene Trees Data Morphological Tree of Life Web (Maddison/Maddison): http://tolweb.org/ Nuclear Genome Organelle Genome

    13. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 13 Example Phylogenies

    14. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 14 Molecular Evolution and Phylogenetics Biology basics Central dogma: DNA -> RNA -> protein DNA replication and processing between can lead to changes in DNA composition Metrics of distance Observed substitution probabilities “How often do we see A replaced with C” Distance based on evolutionary model “How many events separate these two sequences” Markov Models of Sequence Evolution Markov process – future state only depends on current state, not how it got there Molecular genetic mechanisms at multiple scales with distinct probabilities Single site events – sequences Events at larger scales Bio baseics Non-biologists need to understand the central dogma And that DNA is replicated and otherwise processed within a cell between cell divisions Many of these molecular events can lead to changes in the composition of the DNA Distance metrics Phenomenological – how often do we see this replaced A replaced with C Model based – Based on a particular model for sequence evolution, how many events separate these two sequences, and what are there probabilitiesBio baseics Non-biologists need to understand the central dogma And that DNA is replicated and otherwise processed within a cell between cell divisions Many of these molecular events can lead to changes in the composition of the DNA Distance metrics Phenomenological – how often do we see this replaced A replaced with C Model based – Based on a particular model for sequence evolution, how many events separate these two sequences, and what are there probabilities

    15. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 15 Nucleotide substitution: Jukes-Cantor model C C A T G Think about a single position in a nucleotide sequence If we were able to keep track of a particular position and look at it after some amount of evolutionary time we would see that in some portion of daughter sequences the position will have maintained it’s original identity In other daughter sequences it will have changed to another nucleotide In a Markov model we assume that the rates at which these events occur depend only on the current state of the system, not on how this state was achieved The Jukes-Cantor model is the simplest nucleotide substitution model because it assumes that all of the substitution rates are equal So if we’re considering a position that is currently an “A” then the at some time in the future we would expect to see a “C”, “G” or “T” at that position in equal proportions Think about a single position in a nucleotide sequence If we were able to keep track of a particular position and look at it after some amount of evolutionary time we would see that in some portion of daughter sequences the position will have maintained it’s original identity In other daughter sequences it will have changed to another nucleotide In a Markov model we assume that the rates at which these events occur depend only on the current state of the system, not on how this state was achieved The Jukes-Cantor model is the simplest nucleotide substitution model because it assumes that all of the substitution rates are equal So if we’re considering a position that is currently an “A” then the at some time in the future we would expect to see a “C”, “G” or “T” at that position in equal proportions

    16. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 16 Simulating Jukes-Cantor sequence evolution One nucleotide per sequence position Simulating change as finite difference using rate equation would give fractional abundances at each position (population) Need to convert matrix of rates to transition probabilities P(t) = {pij(t)} = eMt

    17. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 17 Simulating Jukes-Cantor sequence evolution P(t) =

    18. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 18 Jukes-Cantor models AgentSheets Cell lineage tree Excel Cell lineage tree two sequence distance Probability vs. time R Cell lineage tree vs. phylogenetic reconstruction

    19. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 19 Commercial Aspects of Phylogeny Reconstruction Identification of microorganisms public health entomology sequence motifs for groups are patented example: differentiating tuberculosis strains Dynamics of microbial communities pesticide exposure: identify and quantify microbes in soil Vaccine development variants of a cell wall or protein coat component porcine reproductive and respiratory syndrome virus isolates from US and Europe were separate populations HIV studied through DNA markers Biochemical pathways antibacterials and herbicides Glyphosate (Roundup?, Rodeo ?, and Pondmaster ?): first herbicide targeted at a pathway not present in mammals phylogenetic distribution of a pathway is studied by the pharmaceutical industry before a drug is developed Pharmaceutical industry predicting the natural ligands for cell surface receptors which are potential drug targets a single family, G protein coupled receptors (GPCRs), contains 40% of the targets of most pharm. companies

    20. 20 Mol. Evolution & Phylogeny, SC|09 Education, Nov 15, 2009 Algorithmic Modeling for Phylogenetics

    21. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 21 Techniques for Phylogeny Reconstruction Distance-based methods Neighbor joining Maximum Parsimony (MP) Maximum Likelihood (ML) Markov Chain Monte Carlo (MCMC)

    22. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 22 Neighbor-Joining Iterative algorithm Maintain a pairwise distance matrix Find the closest two taxa Collapse them into one row (internal node) and recompute distance from the merged row to every other row Loop to 2 Build tree as you go (+) Polynomial time algorithm, and hence fast (-) Represents a “greedy solution” Real world optimality could be missed

    23. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 23 Maximum Parsimony Compute the tree which has the lowest cost edges i.e., the sum of edge costs is minimized Based on the Occam’s razor: simplest explanation for evolution, minimizes the sum of the number of evolutionary events along the tree branches

    24. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 24 Maximum Parsimony Algorithm: Enumerate all possible tree topologies (with taxa at the leaves) For each tree: Score all edges Score (tree) = S edge scores Report min cost tree (+) More realistic (-) NP-Hard combinatorially explosive

    25. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 25 Maximum Likelihood Compute a likelihood score for each possible tree, under a given parametric model Examine multiple parametric models Report a tree with the highest likelihood score across all parametric models (+) Statistically consistent, and most biologically relevant (-) NP-Hard, and typically harder than MP methods - takes weeks to months for even x10 of taxa

    26. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 26 MCMC Markov Chain Monte Carlo Similar to ML, in that uses probabilistic models But MCMC doesn’t solve the ML problem They perform random walks through model trees Output: a probability distribution on trees (which correspond to the likely evolutionary history) (-) This version is also time consuming

    27. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 27 Heuristic Techniques used for MP, ML Branch-and-bound Explore all possibilities but: Compute a lower bound (LB) score for a given path If LB > best solution seen so far, then prune/discontinue the path, and look elsewhere Saves a lot of time in practice, without sacrificing on the optimality Similar to the Traveling Salesman Problem

    28. 28 Mol. Evolution & Phylogeny, SC|09 Education, Nov 15, 2009 HPC for Phylogenetics

    29. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 29 www.phylo.org A community project, funded by an $11.6M NSF Information Technology Research grant

    30. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 30 Exploiting data about gene content and gene order has proved extremely challenging from a computational perspective tasks that can easily be carried out in linear time for DNA data have required entirely new theories (such as the computation of inversion distance) or appear to be NP-hard The focus has thus been on simple genomes, preferably genomes consisting of a single chromosome, and where evolution can reasonably be assumed to have been driven mostly through gene order changes.

    31. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 31 GRAPPA: Genome Rearrangements Analysis Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithms http://www.cc.gatech.edu/~bader/code.html Freely-available, open-source, GNU GPL already used by other computational phylogeny groups, Caprara, Pevzner, LANL, FBI, Smithsonian Institute, Aventis, GlaxoSmithKline, PharmCos. Gene-order Phylogeny Reconstruction Breakpoint Median Inversion Median over one-billion fold speedup from previous codes Parallelism scales linearly with the number of processors

    32. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 32 Using GRAPPA to solve Campanulaceae Phylogeny On the 512-processor IBM Linux cluster, we ran the full analysis (all 14 billion trees) in under 1.5 hours – a 1,000,000-fold speedup (and using true inversion distance) compared with the best previous code BPanalysis 256 IBM Netfinity 4500R nodes of dual 733MHz Intel Pentium III processors, interconnected with Myrinet 2000 Current release of GRAPPA (v. 1.6) now takes minutes to solve the same problem on several processors

    33. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 33 Campanulaceae

    34. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 34 Gene Order Phylogeny Many organelles appear to evolve mostly through processes that simply rearrange gene ordering (inversion, transposition) and perhaps alter gene content (duplication, loss). Chloroplast have a single, typically circular, chromosome and appear to evolve mostly through inversion:

    35. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 35 Breakpoint Analysis (Sankoff & Blanchette 1998) For each tree topology do somehow assign initial genomes to the internal nodes repeat for each internal node do compute a new genome that minimizes the distances to its three neighbors replace old genome by new if distance is reduced until no change Sankoff & Blanchette implemented this in a C++ package This is NP-hard, even for just three taxa!

    36. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 36 Reconstruction Software: single chromosome, organellar size (< 200 genes) 1998 BP Analysis Sankoff 8 taxa ? 1 day 13 taxa ? 250 years 2000 GRAPPA 13 taxa ? 1 day (512 proc. cluster) (200 serial, 100,000 parallel) 2001 GRAPPA 13 taxa ? 1 hour (laptop) (2,000,000 serial) 20 taxa ? 3 million years 2003 DCM-GRAPPA 1,000 taxa ? 2 days (effectively unbounded speedup) 2004 DCM-GRAPPA Handles unequal gene content (first method with this capability)

    37. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 37 Cyberinfrastructure Challenges Current HPC systems are designed for physics-based simulations that use Floating-point, linear algebra Top 500 List measures Linpack! Regular operations (high-degrees of locality) e.g., Matrices, FFT, CG Low-order polynomial-time algorithms Focus of current HPC systems: Dense linear algebra Sparse linear algebra FFT or multi-grid Global scatter-gather operations Dynamically evolving coordinate grids Dynamic load-balancing Particle-based or lattice-gas algorithms Continuum equation solvers Computational biology and bioinformatics often require Integer performance Strings, trees, graphs Combinatorics Optimization, LP Computational geometry Irregular data accesses Heuristics and solutions to NP-hard problems Next-generation cyberinfrastructure must take Biology into consideration

    38. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 38 Parsimony Codes Phylip (Felsenstein) http://evolution.genetics.washington.edu/phylip.html Hennig86 (Farris) http://www.cladistics.org/ Nona (Goloboff) and TNT (Goloboff, Farris, Nixon) http://www.cladistics.com/ PAUP* (Swofford) http://paup.csit.fsu.edu/ MEGA (Kumar, Tamura, Jakobsen, Nei) http://www.megasoftware.net/ GRAPPA (Bader, Moret, Warnow) http://www.phylo.unm.edu/

    39. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 39 Likelihood Codes Phylip (Felsenstein) http://evolution.genetics.washington.edu/phylip.html PAUP* (Swofford) http://paup.csit.fsu.edu/ PAML (Yang) http://abacus.gene.ucl.ac.uk/software/paml.html FastDNAml (Olsen, Matsuda, Hagstrom, Overbeek) http://geta.life.uiuc.edu/~gary/programs/fastDNAml.html Felsenstein’s List of Software: http://evolution.genetics.washington.edu/phylip/software.html

    40. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 40 Some Useful References Felsenstein, J. (2003). Inferring Phylogenies (2nd ed.). Sinauer Associates. Nielsen, R. (2005). Statistical Methods in Molecular Evolution (1st ed.). Springer. Yang, Z. (2006). Computational molecular evolution (p. 357). Oxford University Press. C.R. Linder & T. Warnow (2005). “Chapter 19. Phylogenetics”, in the Handbook of Computational Molecular Biology, Ed. S. Aluru, Chapman & Hall/CRC Computer and Information Science Series.

More Related