1 / 54

Molecular phylogenetics

Molecular phylogenetics. Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca. Major Research Themes in XiaLab. Molecular phylogenetics Optimization at cellular and molecular level Genome replication, Transcription, Translation How to accomplish these processes efficiently?

baxter
Download Presentation

Molecular phylogenetics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Molecular phylogenetics Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca

  2. Major Research Themes in XiaLab • Molecular phylogenetics • Optimization at cellular and molecular level • Genome replication, Transcription, Translation • How to accomplish these processes efficiently? • Software development in bioinformatics and molecular evolution • DAMBE • AMIADA • goldMINER Slide 2

  3. Biodiversity Slide 3

  4. Convergent Evolution Placental mammals Marsupials Slide 4

  5. The Story of the German Farmer The elder son of the German Farmer: Strong and Robust Immunological & Electrophoretic Diagnosis German Farmer: Strong and Robust The younger son of the German Farmer: Weak and unmanly Slide 5

  6. Page & Holmes, p.3 Slide 6

  7. Three Kingdoms of Life Thermotoga Chloroflexus Escherichia Bacillus Rhodocyclus Mitochondria Rickettsia Dictyostelium Chloroplasts Zea Anacycsis Oxytricha Bacteria Saccharomyces Human Xenopus Thermococcus Trypanosoma Euglena Methanococcus Eucarya Methanobacterium Methanospirillum Sulfolobus Halococcus Thermoproteus Haloferax Archaea Slide 7

  8. Where have all the whales gone? • Facts: • North Atlantic minke whales were not taken for commercial purposes under IWC resolutions since 1986 • Fin whales have not been hunted legally since 1986 • Hunting of humpback whales has been prohibited since 1966 • Birth rate was found to be higher than death rate • Why not more whales? • Illegal hunting? • Forensics Minke whele (North Atlantic) Sample #19a Sample #9 Sample #15 Sample #19b Humpback whale Sample #41 Sample #3 Sample #11 Sample WS4 Fin whale Slide 8

  9. Where have all the turtles gone? Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery Adult Feeding Grounds Slide 9

  10. Conservation of the Green Turtle (a) Rookeries demographically independent Adult Feeding Grounds Rookery 1 Rookery 2 Rookery 3 (b) Rookeries demographically dependent Adult Feeding Grounds From Avise (1994, p 372) Slide 10

  11. Mitochondrial DNA Variation Ind1 Rookery 1 Ind2 Ind3 Ind4 Ind5 Ind6 Rookery 2 Ind7 Ind8 Ind9 Ind10 Ind11 Rookery 3 Ind12 Ind13 Ind14 Ind15 Ind16 Rookery 4 Ind17 Ind18 (The original data set is far more extensive and complicated) Slide 11

  12. Rooted vs unrooted trees Fig. 5.2 Root = common ancestor of all entities being studied Rooted tree has particular node which leads by a unique path to any other node # possible rooted vs. unrooted trees for 3 OTUs? for 4 OTUs…? (Fig. 5.5) Slide 12

  13. Scaled vs unscaled branches Fig. 5.3 Slide 13

  14. Gene tree vs species tree • Genetic polymorphisms may be present in a population before it splits into 2 distinctly different populations • - divergence time between 2 genes sequences may predate divergence time between 2 species - changes in DNA sequences can occur before or after speciation Gene tree may not always reflect species tree Fig. 5.6 Slide 14

  15. True tree and inferred tree Mouse Mouse Rat Mouse Rat Mouse Mouse Monkey Chimp Chimp Chimp Chimp Chimp Chimp Rat Monkey Rat Monkey Mouse Mouse Mouse Chimp Rat Chimp Mouse Mouse Rat Chimp Rat Mouse Mouse Mouse Rat Chimp Chimp Chimp Monkey Monkey Rat Monkey Rat Monkey Monkey Rat Rat Monkey Monkey Monkey Mouse Rat Rat Mouse Rat Monkey Rat Mouse Rat Mouse Chimp Mouse Monkey Monkey Chimp Chimp Rat Monkey Monkey Monkey Monkey T2 T3 T1 Slide 15

  16. Maximum Parsimony (MP) Method • Mapping character state changes to alternative topologies • Apply the maximum parsimony criterion to choose the best tree. • Efficient dynamic programming algorithm developed by Walter Fitch and David Sankoff • The only method with branch-and-bound search • Problems • Long-branch attraction • Failure to account to multiple substitutions Slide 16

  17. Maximum parsimony method 1 2 1 2 1 3 4 3 3 4 4 2 Dot = nt sub inferred on that branch Fig. 5.14 Slide 17

  18. Maximum parsimony method Fig. 5.14 After analyzing all informative sites, add up all dots - tree with fewest is favoured tree Slide 18

  19. Computing N1 • Each node is represented by a set of characters, with the terminal nodes (leaves) each represented by a set containing a single character. • The MP method traverses through each internal node, starting from the node closest to the leaves. • If two sets of the two daughter nodes have an empty intersection, then the node will be represented by the union of the two daughter sets, otherwise the node will be represented by the intersection. • Once the operation reaches the root, then the number of union operations is the minimum number of changes needed to map the site to the tree. Slide 19

  20. Tree Length • Site 1 requires four union operations • Sites 3, 5, and 8 each require only one union operation • Sites 6 and 7, which are polymorphic with two nucleotide states but not informative, will require one change for any topology. • The tree length for the topology above: 4+(1+1+1)+(1+1) = 9 Slide 20

  21. Inconsistency (Felsenstein, 1978) A B MP tree Model tree C A Rates or p p Branch lengths q p >> q Wrong q q B D C D • With more data the certainty that parsimony will give the wrong tree increases - so that parsimony is statistically inconsistent • It is now recognised that long-branch attraction is one of the most serious problems in phylogenetic inference Slide 21

  22. Nucleotide Substitutions ACACTCGGATTAGGCT ATACTCAGGTTAAGCT ACAATCCGGTTAAGCT T C C AGACTCGGATTAGGCT parallel convergent coincidental single Observed sequences ACACTCGGATTAGGCT multiple back From WHL Actual number of changes during the evolution of the two daughter sequences: 12 Observed number of differences between the two daughter sequences: 3. Correcting for multiple substitutions to to estimate the true number of changes, i.e., 12. Slide 22

  23. Genetic distance: JC69 model The time is 2t Observed proportion of sites different between the two. Genetic distance is defined as K = 2 µt where µ is the rate of substitution under the JC69 model µ=3 Sp1: AAG CCT CGG GGC CCT TAT TTT TTG | | | | | | | | | | | | | | | | | | Sp2: AAT CTC CGG GGC CTC TAT TTT TTT p=0.25 K=0.304099 Slide 23

  24. A UPGMA Tree Human Chimp Gorilla Orang Gibbon 20 MY 0.6 0.5 0.3 0.2 Slide 24

  25. Distance-based method • Distance matrix • Tree-building algorithms • UPGMA • Neighbor-joining • FastME • Fitch-Margoliash • Criterion-based methods • Branch-length estimation • Tree-selection criterion Slide 25

  26. Branch Length Estimation • For three OTUs, the branch lengths can be estimated directly • For more than three OTUs, there are two commonly used methods for estimating branch lengths • The least-square method • Fitch-Margoliash method • Don’t confuse the Fitch-Margoliash method of branch length estimation with the Fitch-Margoliash criterion of tree selection • Illustration of the least-square method of branch length estimation Slide 26

  27. For three OTUs 1 x1 x3 3 x2 2 1 2 3 1 0.000 0.092 0.1792 0.000 0.1793 0.000 1 2 31 0.000 d12 d132 0.000 d233 0.000 d12 = x1 + x2 d13 = x1 + x3 d23 = x2 + x3 Slide 27

  28. Least-square method 1 3 x3 x1 x5 x2 x4 2 4 4 Sp1 Sp2 0.3 Sp3 0.4 0.5 Sp4 0.4 0.6 0.6 4 Sp1 Sp2 d12 Sp3 d13 d23 Sp4 d14 d24 d34 Slide 28

  29. Least-square method 1 3 x3 x1 x5 Least-squares method: Find xi values that minimize SS x2 x4 2 4 d’12 = x1 + x2 d’13 = x1 + x5+ x3 d’14 = x1 + x5 + x4 d’23 = x2 + x5 + x3 d’24 = x2 + x5 + x4 d’34 = x3 + x4 (d12 - d’12)2= [d12 – (x1 + x2)]2 (d13 - d’13)2 = [d13 – (x1 + x5+ x3)]2 (d14 - d’14)2 = [d14 – (x1 + x5 + x4)]2 (d23 - d’23)2 = [d23 – (x2 + x5 + x3)]2 (d24 - d’24)2 = [d24 – (x2 + x5 + x4)]2 (d34 - d’34)2 = [d34 – (x3 + x4)]2 Slide 29

  30. Least-squares method SS = [d12 – (x1 + x2)]2 + [d13 – (x1 + x5+ x3)]2 + [d14 – (x1 + x5 + x4)]2 + [d23 – (x2 + x5 + x3)]2+ [d24 – (x2 + x5 + x4)]2+ [d34 – (x3 + x4)]2 Take the partial derivative of SS with respective to xi, we have SS/x1 := -2 d12 + 6 x1 + 2 x2 - 2 d13 + 4 x5 + 2 x3 - 2 d14 + 2 x4 SS/x2 := -2 d12 + 2 x1 + 6 x2 - 2 d23 + 4 x5 + 2 x3 - 2 d24 + 2 x4 SS/x3 := -2 d13 + 2 x1 + 4 x5 + 6 x3 - 2 d23 + 2 x2 - 2 d34 + 2 x4 SS/x4 := -2 d14 + 2 x1 + 4 x5 + 6 x4 - 2 d24 + 2 x2 - 2 d34 + 2 x3 SS/x5 := -2 d13 + 4 x1 + 8 x5 + 4 x3 - 2 d14 + 4 x4 - 2 d23 + 4 x2 - 2 d24 Setting these partial derivatives to 0 and solve for xi, we have x1 = d13/4 + d12/2 - d23/4 + d14/4 - d24/4 x2 = d12/2 - d13/4 + d23/4 - d14/4 + d24/4, x3 = d13/4 + d23/4 + d34/2 - d14/4 - d24/4, x4 = d14/4 - d13/4 - d23/4 + d34/2 + d24/4, x5 = - d12/2 + d23/4 - d34/2 + d14/4 + d24/4 + d13/4 Slide 30

  31. Least-squares method 1 3 x3 x1 x5 x2 x4 2 4 x1 = d13/4 + d12/2 - d23/4 + d14/4 - d24/4 x2 = d12/2 - d13/4 + d23/4 - d14/4 + d24/4, x3 = d13/4 + d23/4 + d34/2 - d14/4 - d24/4, x4 = d14/4 - d13/4 - d23/4 + d34/2 + d24/4, x5 = - d12/2 + d23/4 - d34/2 + d14/4 + d24/4 + d13/4 4 Sp1 Sp2 0.3 Sp3 0.4 0.5 Sp4 0.4 0.6 0.6 x1 = 0.075 x2 = 0.225 x3 = 0.275 x4 = 0.325 x5 = 0.025 Slide 31

  32. Minimum Evolution Criterion 1 1 1 2 2 3 x3 x3 x3 x1 x1 x1 x5 x5 x5 x2 x2 x2 x4 x4 x4 4 2 3 4 4 3 The minimum evolution (ME) criterion: The tree with the shortest TreeLen is the best tree. Slide 32

  33. Maximum likelihood Method • Likelihood L of a tree is the probability of observing the data given the treeL = P(data|tree) • Find the tree with the highest L value • Results depends on model of nucleotide substitution Slide 33

  34. A example of Tree: Four sequences 2 1 3 4 A , C , G , T 3 2 1 1 5 6 5 6 5 6 4 4 Tree1 Tree2 Tree3 2 3 Unrooted tree for Sp1,Sp2,Sp3,Sp4 Number 5 and 6 stand for the two interior nodes whose nucleotides could be either A,C,G or T. Slide 34

  35. Likelihood Method • The likelihood function for a nucleotide site(6-th site) is given by p6= Prob +Prob + … +Prob Site 1 2 3 4 5 6 7 8 9 10 Sp1 A C C A T G G T A A Sp2 A C A G T G C T A G Sp3 G C A G T C G T A G Sp4 G C A A C T C C A A Prob.: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 1(G) t1 t3 3(C) t5 5 6 t4 t2 4(T) 2(G) Tree1 16 Slide 35

  36. Calculation of Likelihood 1(G) t1 t3 3(C) t5 5 6 t4 t2 4(T) 2(G) Tree1 where P(A),P(T),P(C),P(G) are empirical nucleotide frequencies satisfying P(A)+P(T)+P(C)+P(G)=1, Pii(t) and Pij(t) are given by JC69 • lnLTree1= ln(p1)+ln(p2)+…+ln(p10) • Calculate lnLTree2,lnLTree3 similarly. • We choose the tree which has the highest lnL i.e. Max(lnLTree1, lnLTree2 , lnLTree3 ). when Slide 36

  37. Problems with ML method The ML method is strictly data-based. If we sampled 6 fish all being males, then our estimation of p is 6/6 = 1. Slide 37

  38. Bayesian inference Fig. 13-8 in Xia, X. 2007. Bioinformatics and the cell: modern computational approaches in genomics, proteomics and transcriptomics, Springer. Slide 38

  39. Tree quality assessment • Re-sampling methods for subtree support • Bootstrap • Jackknife (delete-half jackknife) • Monte-Carlo method • Significant tests for alternative trees • Distance-based method • Maximum parsimony method • Maximum likelihood method Slide 39

  40. Bootstrapping Figure 5.26 Slide 40

  41. Bootstrap example Bootstrap values > 50% are shown Gene tree for a - tubulin Fig. 5.27 Slide 41

  42. Trees with bootstrap values Branches with bootstrap values < 50% are collapsed Bootstrap values < 90% collapsed Fig. 5.27 Slide 42

  43. Phylogenetic Hypothesis Testing: MP Frog AAGGT Pigeon GTGGC Eagle GTGGT Elephant GAAAC Lion GAAAT Tree 1 Tree 2 Frog AAGGT AAGGT Lion GAAAT AAGGT GTGGT GAGGT Eagle GTGGT GAGGT GAGGT Elephant GAAAC GAAAT GAGGC Pigeon GTGGC 1 2 34 5 Frog ACCCAAAGGCCTT Eagle GCCCTAAGGCCTT Pigeon GCCCTAAGGCCTC Lion GCCCAAAAACCTT Elephant GCCCAAAAACCTC Tree 1 1 1 11 2 Tree 2 1 2 22 1 Slide 43

  44. Test Phylogenetic Hypotheses: ML • Maximum likelihood-based method • Kishino-Hasegawa’s RELL test • lnL1 from Tree 1 • lnL2 from Tree 2 • D = lnL1 – lnL2 • Var(D) obtained by resampling • DNLML test Slide 44

  45. Things to consider • Data: • Mixture of paralogous and orthologous genes • Gene conversion • Convergent evolution • Horizontal gene transfer • Why mtDNA is popular in molecular phylogenetics? • Too little or too much variation (Substitution saturation): choose rapidly evolving sequences for recently diverged taxa and highly conserved genes for resolving deep phylogenies. • Sequence alignment (circularity between alignment and phylogenetics) • Substitution models • Tree selection criteria • Tree search algorithms A phylogenetic tree represents a hypothesized phylogenetic relationship among ingroup species, and published trees often contain errors. (P. 217) Slide 45

  46. Recent Advances • More sophisticated substitution models • Better distance estimation: independent estimation versus simultaneous estimation • More thorough search algorithms • More efficient method for evaluating alternative topologies • More efficient computational implementations, e.g, Markov chain Monte Carlo for Bayesian inference. Slide 46

  47. Ancient DNA 8 clones of PCR-amplified mtDNA from 26,000 year old cave-bear bone “Note that direct sequencing would lead to ambiguous results at least at two positions (arrows)” Forms of DNA damage likely to affect ancient DNA C/G to T/A changes due to deamination of C residues? “Assuming neutral pH, 15oC … take about 100,000 years for hydrolytic damage to destroy all DNA…. Some environmental conditions could extend this time limit…” “ … to consider amplification of DNA molecules older than one million years of age is overly optimistic.” (Svante Paabo lab) Slide 47 Hofreiter, Nat.Rev.Genet. 2:353, 2001

  48. Tasmanian wolf Classified as relative of South American marsupials (based on morphology) … but PCR of mtDNA from museum specimen … Tasmanian wolf Aus. tiger cat Tasmanian devil Fig. 5.35 Slide 49

  49. Dispersal of modern human population “Out-of-Africa mitochondrial Eve” hypothesis Hartl & Jones Fig. 14.30 Slide 50

More Related