400 likes | 769 Views
E N D
1. 1 Mol. Evolution & Phylogeny, SC|09 Education, Nov 15, 2009 Molecular Evolution and Phylogeny Using AgentSheets, Excel, R and MEGA Jeff Krause
Shodor
Ananth Kalyanaraman Washington State University
2. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 2 Phylogenetics Basics
3. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 3 Tree
4. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 4 Graph
5. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 5 Phylogenetics Basics The goal is to infer evolutionary relationships among biological entities based on quantifiable similarities and differences
Phylogenetic Trees
Trees, branches and nodes == graphs, edges and vertices
6. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 6 Biology on trees
7. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 7 Biology on trees
8. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 8 Tree topology depicts relatedness and timing
9. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 9 Branch lengths may represent timing Different types of trees can be used to represent the same inferred phylogeny
Depending on the method of phylogeny estimation, and assumptions made, the branch lengths may represent the timing of events, either relative or absoluteDifferent types of trees can be used to represent the same inferred phylogeny
Depending on the method of phylogeny estimation, and assumptions made, the branch lengths may represent the timing of events, either relative or absolute
10. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 10 Phylogenetics Basics The goal is to infer evolutionary relationships among biological entities based on quantifiable similarities and differences
Phylogenetic Trees
Trees, branches and nodes == graphs, edges and vertices
Topology and branch length reflect relatedness and timing
Require a quantitative metric of similarity/difference
Morphology
Molecular sequence data
11. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 11 The Tree of Life: The Ultimate Phylogeny CIPRES aims to establish the cyber infrastructure (platform, software, database) required to attempt a reconstruction of the Tree of Life
(10-100M organisms)
12. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 12 Types of Phylogenies Relationships between taxa
Species Trees
Gene Trees
Data
Morphological
Tree of Life Web (Maddison/Maddison): http://tolweb.org/
Nuclear Genome
Organelle Genome
13. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 13 Example Phylogenies
14. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 14 Molecular Evolution and Phylogenetics Biology basics
Central dogma: DNA -> RNA -> protein
DNA replication and processing between can lead to changes in DNA composition
Metrics of distance
Observed substitution probabilities
“How often do we see A replaced with C”
Distance based on evolutionary model
“How many events separate these two sequences”
Markov Models of Sequence Evolution
Markov process – future state only depends on current state, not how it got there
Molecular genetic mechanisms at multiple scales with distinct probabilities
Single site events – sequences
Events at larger scales
Bio baseics
Non-biologists need to understand the central dogma
And that DNA is replicated and otherwise processed within a cell between cell divisions
Many of these molecular events can lead to changes in the composition of the DNA
Distance metrics
Phenomenological – how often do we see this replaced A replaced with C
Model based – Based on a particular model for sequence evolution, how many events separate these two sequences, and what are there probabilitiesBio baseics
Non-biologists need to understand the central dogma
And that DNA is replicated and otherwise processed within a cell between cell divisions
Many of these molecular events can lead to changes in the composition of the DNA
Distance metrics
Phenomenological – how often do we see this replaced A replaced with C
Model based – Based on a particular model for sequence evolution, how many events separate these two sequences, and what are there probabilities
15. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 15 Nucleotide substitution:Jukes-Cantor model C C A T G Think about a single position in a nucleotide sequence
If we were able to keep track of a particular position and look at it after some amount of evolutionary time
we would see that in some portion of daughter sequences the position will have maintained it’s original identity
In other daughter sequences it will have changed to another nucleotide
In a Markov model we assume that the rates at which these events occur depend only on the current state of the system, not on how this state was achieved
The Jukes-Cantor model is the simplest nucleotide substitution model because it assumes that all of the substitution rates are equal
So if we’re considering a position that is currently an “A” then the at some time in the future we would expect to see a “C”, “G” or “T” at that position in equal proportions
Think about a single position in a nucleotide sequence
If we were able to keep track of a particular position and look at it after some amount of evolutionary time
we would see that in some portion of daughter sequences the position will have maintained it’s original identity
In other daughter sequences it will have changed to another nucleotide
In a Markov model we assume that the rates at which these events occur depend only on the current state of the system, not on how this state was achieved
The Jukes-Cantor model is the simplest nucleotide substitution model because it assumes that all of the substitution rates are equal
So if we’re considering a position that is currently an “A” then the at some time in the future we would expect to see a “C”, “G” or “T” at that position in equal proportions
16. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 16 Simulating Jukes-Cantorsequence evolution One nucleotide per sequence position
Simulating change as finite difference using rate equation would give fractional abundances at each position (population)
Need to convert matrix of rates to transition probabilities
P(t) = {pij(t)} = eMt
17. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 17 Simulating Jukes-Cantor sequence evolution P(t) =
18. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 18 Jukes-Cantor models AgentSheets
Cell lineage tree
Excel
Cell lineage tree
two sequence distance
Probability vs. time
R
Cell lineage tree vs. phylogenetic reconstruction
19. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 19 Commercial Aspects of Phylogeny Reconstruction Identification of microorganisms
public health entomology
sequence motifs for groups are patented
example: differentiating tuberculosis strains
Dynamics of microbial communities
pesticide exposure: identify and quantify microbes in soil
Vaccine development
variants of a cell wall or protein coat component
porcine reproductive and respiratory syndrome virus isolates from US and Europe were separate populations
HIV studied through DNA markers
Biochemical pathways
antibacterials and herbicides
Glyphosate (Roundup?, Rodeo ?, and Pondmaster ?): first herbicide targeted at a pathway not present in mammals
phylogenetic distribution of a pathway is studied by the pharmaceutical industry before a drug is developed
Pharmaceutical industry
predicting the natural ligands for cell surface receptors which are potential drug targets
a single family, G protein coupled receptors (GPCRs), contains 40% of the targets of most pharm. companies
20. 20 Mol. Evolution & Phylogeny, SC|09 Education, Nov 15, 2009 Algorithmic Modeling for Phylogenetics
21. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 21 Techniques for Phylogeny Reconstruction Distance-based methods
Neighbor joining
Maximum Parsimony (MP)
Maximum Likelihood (ML)
Markov Chain Monte Carlo (MCMC)
22. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 22 Neighbor-Joining Iterative algorithm
Maintain a pairwise distance matrix
Find the closest two taxa
Collapse them into one row (internal node) and recompute distance from the merged row to every other row
Loop to 2
Build tree as you go
(+) Polynomial time algorithm, and hence fast
(-) Represents a “greedy solution”
Real world optimality could be missed
23. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 23 Maximum Parsimony Compute the tree which has the lowest cost edges
i.e., the sum of edge costs is minimized
Based on the Occam’s razor:
simplest explanation for evolution, minimizes the sum of the number of evolutionary events along the tree branches
24. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 24 Maximum Parsimony Algorithm:
Enumerate all possible tree topologies (with taxa at the leaves)
For each tree:
Score all edges
Score (tree) = S edge scores
Report min cost tree
(+) More realistic
(-) NP-Hard
combinatorially explosive
25. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 25 Maximum Likelihood Compute a likelihood score for each possible tree, under a given parametric model
Examine multiple parametric models
Report a tree with the highest likelihood score across all parametric models
(+) Statistically consistent, and most biologically relevant
(-) NP-Hard, and typically harder than MP methods
- takes weeks to months for even x10 of taxa
26. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 26 MCMC Markov Chain Monte Carlo
Similar to ML, in that uses probabilistic models
But MCMC doesn’t solve the ML problem
They perform random walks through model trees
Output: a probability distribution on trees (which correspond to the likely evolutionary history)
(-) This version is also time consuming
27. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 27 Heuristic Techniques used for MP, ML Branch-and-bound
Explore all possibilities but:
Compute a lower bound (LB) score for a given path
If LB > best solution seen so far, then prune/discontinue the path, and look elsewhere
Saves a lot of time in practice, without sacrificing on the optimality
Similar to the Traveling Salesman Problem
28. 28 Mol. Evolution & Phylogeny, SC|09 Education, Nov 15, 2009 HPC for Phylogenetics
29. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 29 www.phylo.org
A community project, funded by an $11.6M NSF Information Technology Research grant
30. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 30 Exploiting data about gene content and gene order has proved extremely challenging from a computational perspective
tasks that can easily be carried out in linear time for DNA data have required entirely new theories (such as the computation of inversion distance) or appear to be NP-hard
The focus has thus been on simple genomes, preferably genomes
consisting of a single chromosome, and
where evolution can reasonably be assumed to have been driven mostly through gene order changes.
31. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 31 GRAPPA: Genome Rearrangements Analysis Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithms
http://www.cc.gatech.edu/~bader/code.html
Freely-available, open-source, GNU GPL
already used by other computational phylogeny groups, Caprara, Pevzner, LANL, FBI, Smithsonian Institute, Aventis, GlaxoSmithKline, PharmCos.
Gene-order Phylogeny Reconstruction
Breakpoint Median
Inversion Median
over one-billion fold speedup from previous codes
Parallelism scales linearly with the number of processors
32. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 32 Using GRAPPA to solve Campanulaceae Phylogeny On the 512-processor IBM Linux cluster, we ran the full analysis (all 14 billion trees) in under 1.5 hours – a 1,000,000-fold speedup (and using true inversion distance) compared with the best previous code BPanalysis
256 IBM Netfinity 4500R nodes of dual 733MHz Intel Pentium III processors, interconnected with Myrinet 2000
Current release of GRAPPA (v. 1.6) now takes minutes to solve the same problem on several processors
33. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 33 Campanulaceae
34. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 34 Gene Order Phylogeny Many organelles appear to evolve mostly through processes that simply rearrange gene ordering (inversion, transposition) and perhaps alter gene content (duplication, loss).
Chloroplast have a single, typically circular, chromosome and appear to evolve mostly through inversion:
35. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 35 Breakpoint Analysis(Sankoff & Blanchette 1998) For each tree topology do
somehow assign initial genomes to the internal nodes
repeat
for each internal node do
compute a new genome that minimizes the distances to its three neighbors
replace old genome by new if distance is reduced
until no change
Sankoff & Blanchette implemented this in a C++ package
This is NP-hard, even for just three taxa!
36. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 36 Reconstruction Software: single chromosome, organellar size (< 200 genes) 1998 BP Analysis
Sankoff
8 taxa ? 1 day
13 taxa ? 250 years
2000 GRAPPA
13 taxa ? 1 day (512 proc. cluster)
(200 serial, 100,000 parallel)
2001 GRAPPA
13 taxa ? 1 hour (laptop)
(2,000,000 serial)
20 taxa ? 3 million years
2003 DCM-GRAPPA
1,000 taxa ? 2 days
(effectively unbounded speedup)
2004 DCM-GRAPPA
Handles unequal gene content
(first method with this capability)
37. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 37 Cyberinfrastructure Challenges Current HPC systems are designed for physics-based simulations that use
Floating-point, linear algebra
Top 500 List measures Linpack!
Regular operations (high-degrees of locality)
e.g., Matrices, FFT, CG
Low-order polynomial-time algorithms
Focus of current HPC systems:
Dense linear algebra
Sparse linear algebra
FFT or multi-grid
Global scatter-gather operations
Dynamically evolving coordinate grids
Dynamic load-balancing
Particle-based or lattice-gas algorithms
Continuum equation solvers Computational biology and bioinformatics often require
Integer performance
Strings, trees, graphs
Combinatorics
Optimization, LP
Computational geometry
Irregular data accesses
Heuristics and solutions to NP-hard problems
Next-generation cyberinfrastructure must take Biology into consideration
38. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 38 Parsimony Codes Phylip (Felsenstein)
http://evolution.genetics.washington.edu/phylip.html
Hennig86 (Farris)
http://www.cladistics.org/
Nona (Goloboff) and TNT (Goloboff, Farris, Nixon)
http://www.cladistics.com/
PAUP* (Swofford)
http://paup.csit.fsu.edu/
MEGA (Kumar, Tamura, Jakobsen, Nei)
http://www.megasoftware.net/
GRAPPA (Bader, Moret, Warnow)
http://www.phylo.unm.edu/
39. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 39 Likelihood Codes Phylip (Felsenstein)
http://evolution.genetics.washington.edu/phylip.html
PAUP* (Swofford)
http://paup.csit.fsu.edu/
PAML (Yang)
http://abacus.gene.ucl.ac.uk/software/paml.html
FastDNAml (Olsen, Matsuda, Hagstrom, Overbeek)
http://geta.life.uiuc.edu/~gary/programs/fastDNAml.html
Felsenstein’s List of Software:
http://evolution.genetics.washington.edu/phylip/software.html
40. Mol. Evol. & Phylogeny, SC|09 Education, Nov 15, 2009 40 Some Useful References Felsenstein, J. (2003). Inferring Phylogenies (2nd ed.). Sinauer Associates.
Nielsen, R. (2005). Statistical Methods in Molecular Evolution (1st ed.). Springer.
Yang, Z. (2006). Computational molecular evolution (p. 357). Oxford University Press.
C.R. Linder & T. Warnow (2005). “Chapter 19. Phylogenetics”, in the Handbook of Computational Molecular Biology, Ed. S. Aluru, Chapman & Hall/CRC Computer and Information Science Series.