570 likes | 881 Views
Molecular phylogenetics. Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca. Major Research Themes in XiaLab. Molecular phylogenetics Optimization at cellular and molecular level Genome replication, Transcription, Translation How to accomplish these processes efficiently?
E N D
Molecular phylogenetics Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca
Major Research Themes in XiaLab • Molecular phylogenetics • Optimization at cellular and molecular level • Genome replication, Transcription, Translation • How to accomplish these processes efficiently? • Software development in bioinformatics and molecular evolution • DAMBE • AMIADA • goldMINER Slide 2
Biodiversity Slide 3
Convergent Evolution Placental mammals Marsupials Slide 4
The Story of the German Farmer The elder son of the German Farmer: Strong and Robust Immunological & Electrophoretic Diagnosis German Farmer: Strong and Robust The younger son of the German Farmer: Weak and unmanly Slide 5
Page & Holmes, p.3 Slide 6
Three Kingdoms of Life Thermotoga Chloroflexus Escherichia Bacillus Rhodocyclus Mitochondria Rickettsia Dictyostelium Chloroplasts Zea Anacycsis Oxytricha Bacteria Saccharomyces Human Xenopus Thermococcus Trypanosoma Euglena Methanococcus Eucarya Methanobacterium Methanospirillum Sulfolobus Halococcus Thermoproteus Haloferax Archaea Slide 7
Where have all the whales gone? • Facts: • North Atlantic minke whales were not taken for commercial purposes under IWC resolutions since 1986 • Fin whales have not been hunted legally since 1986 • Hunting of humpback whales has been prohibited since 1966 • Birth rate was found to be higher than death rate • Why not more whales? • Illegal hunting? • Forensics Minke whele (North Atlantic) Sample #19a Sample #9 Sample #15 Sample #19b Humpback whale Sample #41 Sample #3 Sample #11 Sample WS4 Fin whale Slide 8
Where have all the turtles gone? Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery Rookery Adult Feeding Grounds Slide 9
Conservation of the Green Turtle (a) Rookeries demographically independent Adult Feeding Grounds Rookery 1 Rookery 2 Rookery 3 (b) Rookeries demographically dependent Adult Feeding Grounds From Avise (1994, p 372) Slide 10
Mitochondrial DNA Variation Ind1 Rookery 1 Ind2 Ind3 Ind4 Ind5 Ind6 Rookery 2 Ind7 Ind8 Ind9 Ind10 Ind11 Rookery 3 Ind12 Ind13 Ind14 Ind15 Ind16 Rookery 4 Ind17 Ind18 (The original data set is far more extensive and complicated) Slide 11
Rooted vs unrooted trees Fig. 5.2 Root = common ancestor of all entities being studied Rooted tree has particular node which leads by a unique path to any other node # possible rooted vs. unrooted trees for 3 OTUs? for 4 OTUs…? (Fig. 5.5) Slide 12
Scaled vs unscaled branches Fig. 5.3 Slide 13
Gene tree vs species tree • Genetic polymorphisms may be present in a population before it splits into 2 distinctly different populations • - divergence time between 2 genes sequences may predate divergence time between 2 species - changes in DNA sequences can occur before or after speciation Gene tree may not always reflect species tree Fig. 5.6 Slide 14
True tree and inferred tree Mouse Mouse Rat Mouse Rat Mouse Mouse Monkey Chimp Chimp Chimp Chimp Chimp Chimp Rat Monkey Rat Monkey Mouse Mouse Mouse Chimp Rat Chimp Mouse Mouse Rat Chimp Rat Mouse Mouse Mouse Rat Chimp Chimp Chimp Monkey Monkey Rat Monkey Rat Monkey Monkey Rat Rat Monkey Monkey Monkey Mouse Rat Rat Mouse Rat Monkey Rat Mouse Rat Mouse Chimp Mouse Monkey Monkey Chimp Chimp Rat Monkey Monkey Monkey Monkey T2 T3 T1 Slide 15
Maximum Parsimony (MP) Method • Mapping character state changes to alternative topologies • Apply the maximum parsimony criterion to choose the best tree. • Efficient dynamic programming algorithm developed by Walter Fitch and David Sankoff • The only method with branch-and-bound search • Problems • Long-branch attraction • Failure to account to multiple substitutions Slide 16
Maximum parsimony method 1 2 1 2 1 3 4 3 3 4 4 2 Dot = nt sub inferred on that branch Fig. 5.14 Slide 17
Maximum parsimony method Fig. 5.14 After analyzing all informative sites, add up all dots - tree with fewest is favoured tree Slide 18
Computing N1 • Each node is represented by a set of characters, with the terminal nodes (leaves) each represented by a set containing a single character. • The MP method traverses through each internal node, starting from the node closest to the leaves. • If two sets of the two daughter nodes have an empty intersection, then the node will be represented by the union of the two daughter sets, otherwise the node will be represented by the intersection. • Once the operation reaches the root, then the number of union operations is the minimum number of changes needed to map the site to the tree. Slide 19
Tree Length • Site 1 requires four union operations • Sites 3, 5, and 8 each require only one union operation • Sites 6 and 7, which are polymorphic with two nucleotide states but not informative, will require one change for any topology. • The tree length for the topology above: 4+(1+1+1)+(1+1) = 9 Slide 20
Inconsistency (Felsenstein, 1978) A B MP tree Model tree C A Rates or p p Branch lengths q p >> q Wrong q q B D C D • With more data the certainty that parsimony will give the wrong tree increases - so that parsimony is statistically inconsistent • It is now recognised that long-branch attraction is one of the most serious problems in phylogenetic inference Slide 21
Nucleotide Substitutions ACACTCGGATTAGGCT ATACTCAGGTTAAGCT ACAATCCGGTTAAGCT T C C AGACTCGGATTAGGCT parallel convergent coincidental single Observed sequences ACACTCGGATTAGGCT multiple back From WHL Actual number of changes during the evolution of the two daughter sequences: 12 Observed number of differences between the two daughter sequences: 3. Correcting for multiple substitutions to to estimate the true number of changes, i.e., 12. Slide 22
Genetic distance: JC69 model The time is 2t Observed proportion of sites different between the two. Genetic distance is defined as K = 2 µt where µ is the rate of substitution under the JC69 model µ=3 Sp1: AAG CCT CGG GGC CCT TAT TTT TTG | | | | | | | | | | | | | | | | | | Sp2: AAT CTC CGG GGC CTC TAT TTT TTT p=0.25 K=0.304099 Slide 23
A UPGMA Tree Human Chimp Gorilla Orang Gibbon 20 MY 0.6 0.5 0.3 0.2 Slide 24
Distance-based method • Distance matrix • Tree-building algorithms • UPGMA • Neighbor-joining • FastME • Fitch-Margoliash • Criterion-based methods • Branch-length estimation • Tree-selection criterion Slide 25
Branch Length Estimation • For three OTUs, the branch lengths can be estimated directly • For more than three OTUs, there are two commonly used methods for estimating branch lengths • The least-square method • Fitch-Margoliash method • Don’t confuse the Fitch-Margoliash method of branch length estimation with the Fitch-Margoliash criterion of tree selection • Illustration of the least-square method of branch length estimation Slide 26
For three OTUs 1 x1 x3 3 x2 2 1 2 3 1 0.000 0.092 0.1792 0.000 0.1793 0.000 1 2 31 0.000 d12 d132 0.000 d233 0.000 d12 = x1 + x2 d13 = x1 + x3 d23 = x2 + x3 Slide 27
Least-square method 1 3 x3 x1 x5 x2 x4 2 4 4 Sp1 Sp2 0.3 Sp3 0.4 0.5 Sp4 0.4 0.6 0.6 4 Sp1 Sp2 d12 Sp3 d13 d23 Sp4 d14 d24 d34 Slide 28
Least-square method 1 3 x3 x1 x5 Least-squares method: Find xi values that minimize SS x2 x4 2 4 d’12 = x1 + x2 d’13 = x1 + x5+ x3 d’14 = x1 + x5 + x4 d’23 = x2 + x5 + x3 d’24 = x2 + x5 + x4 d’34 = x3 + x4 (d12 - d’12)2= [d12 – (x1 + x2)]2 (d13 - d’13)2 = [d13 – (x1 + x5+ x3)]2 (d14 - d’14)2 = [d14 – (x1 + x5 + x4)]2 (d23 - d’23)2 = [d23 – (x2 + x5 + x3)]2 (d24 - d’24)2 = [d24 – (x2 + x5 + x4)]2 (d34 - d’34)2 = [d34 – (x3 + x4)]2 Slide 29
Least-squares method SS = [d12 – (x1 + x2)]2 + [d13 – (x1 + x5+ x3)]2 + [d14 – (x1 + x5 + x4)]2 + [d23 – (x2 + x5 + x3)]2+ [d24 – (x2 + x5 + x4)]2+ [d34 – (x3 + x4)]2 Take the partial derivative of SS with respective to xi, we have SS/x1 := -2 d12 + 6 x1 + 2 x2 - 2 d13 + 4 x5 + 2 x3 - 2 d14 + 2 x4 SS/x2 := -2 d12 + 2 x1 + 6 x2 - 2 d23 + 4 x5 + 2 x3 - 2 d24 + 2 x4 SS/x3 := -2 d13 + 2 x1 + 4 x5 + 6 x3 - 2 d23 + 2 x2 - 2 d34 + 2 x4 SS/x4 := -2 d14 + 2 x1 + 4 x5 + 6 x4 - 2 d24 + 2 x2 - 2 d34 + 2 x3 SS/x5 := -2 d13 + 4 x1 + 8 x5 + 4 x3 - 2 d14 + 4 x4 - 2 d23 + 4 x2 - 2 d24 Setting these partial derivatives to 0 and solve for xi, we have x1 = d13/4 + d12/2 - d23/4 + d14/4 - d24/4 x2 = d12/2 - d13/4 + d23/4 - d14/4 + d24/4, x3 = d13/4 + d23/4 + d34/2 - d14/4 - d24/4, x4 = d14/4 - d13/4 - d23/4 + d34/2 + d24/4, x5 = - d12/2 + d23/4 - d34/2 + d14/4 + d24/4 + d13/4 Slide 30
Least-squares method 1 3 x3 x1 x5 x2 x4 2 4 x1 = d13/4 + d12/2 - d23/4 + d14/4 - d24/4 x2 = d12/2 - d13/4 + d23/4 - d14/4 + d24/4, x3 = d13/4 + d23/4 + d34/2 - d14/4 - d24/4, x4 = d14/4 - d13/4 - d23/4 + d34/2 + d24/4, x5 = - d12/2 + d23/4 - d34/2 + d14/4 + d24/4 + d13/4 4 Sp1 Sp2 0.3 Sp3 0.4 0.5 Sp4 0.4 0.6 0.6 x1 = 0.075 x2 = 0.225 x3 = 0.275 x4 = 0.325 x5 = 0.025 Slide 31
Minimum Evolution Criterion 1 1 1 2 2 3 x3 x3 x3 x1 x1 x1 x5 x5 x5 x2 x2 x2 x4 x4 x4 4 2 3 4 4 3 The minimum evolution (ME) criterion: The tree with the shortest TreeLen is the best tree. Slide 32
Maximum likelihood Method • Likelihood L of a tree is the probability of observing the data given the treeL = P(data|tree) • Find the tree with the highest L value • Results depends on model of nucleotide substitution Slide 33
A example of Tree: Four sequences 2 1 3 4 A , C , G , T 3 2 1 1 5 6 5 6 5 6 4 4 Tree1 Tree2 Tree3 2 3 Unrooted tree for Sp1,Sp2,Sp3,Sp4 Number 5 and 6 stand for the two interior nodes whose nucleotides could be either A,C,G or T. Slide 34
Likelihood Method • The likelihood function for a nucleotide site(6-th site) is given by p6= Prob +Prob + … +Prob Site 1 2 3 4 5 6 7 8 9 10 Sp1 A C C A T G G T A A Sp2 A C A G T G C T A G Sp3 G C A G T C G T A G Sp4 G C A A C T C C A A Prob.: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 1(G) t1 t3 3(C) t5 5 6 t4 t2 4(T) 2(G) Tree1 16 Slide 35
Calculation of Likelihood 1(G) t1 t3 3(C) t5 5 6 t4 t2 4(T) 2(G) Tree1 where P(A),P(T),P(C),P(G) are empirical nucleotide frequencies satisfying P(A)+P(T)+P(C)+P(G)=1, Pii(t) and Pij(t) are given by JC69 • lnLTree1= ln(p1)+ln(p2)+…+ln(p10) • Calculate lnLTree2,lnLTree3 similarly. • We choose the tree which has the highest lnL i.e. Max(lnLTree1, lnLTree2 , lnLTree3 ). when Slide 36
Problems with ML method The ML method is strictly data-based. If we sampled 6 fish all being males, then our estimation of p is 6/6 = 1. Slide 37
Bayesian inference Fig. 13-8 in Xia, X. 2007. Bioinformatics and the cell: modern computational approaches in genomics, proteomics and transcriptomics, Springer. Slide 38
Tree quality assessment • Re-sampling methods for subtree support • Bootstrap • Jackknife (delete-half jackknife) • Monte-Carlo method • Significant tests for alternative trees • Distance-based method • Maximum parsimony method • Maximum likelihood method Slide 39
Bootstrapping Figure 5.26 Slide 40
Bootstrap example Bootstrap values > 50% are shown Gene tree for a - tubulin Fig. 5.27 Slide 41
Trees with bootstrap values Branches with bootstrap values < 50% are collapsed Bootstrap values < 90% collapsed Fig. 5.27 Slide 42
Phylogenetic Hypothesis Testing: MP Frog AAGGT Pigeon GTGGC Eagle GTGGT Elephant GAAAC Lion GAAAT Tree 1 Tree 2 Frog AAGGT AAGGT Lion GAAAT AAGGT GTGGT GAGGT Eagle GTGGT GAGGT GAGGT Elephant GAAAC GAAAT GAGGC Pigeon GTGGC 1 2 34 5 Frog ACCCAAAGGCCTT Eagle GCCCTAAGGCCTT Pigeon GCCCTAAGGCCTC Lion GCCCAAAAACCTT Elephant GCCCAAAAACCTC Tree 1 1 1 11 2 Tree 2 1 2 22 1 Slide 43
Test Phylogenetic Hypotheses: ML • Maximum likelihood-based method • Kishino-Hasegawa’s RELL test • lnL1 from Tree 1 • lnL2 from Tree 2 • D = lnL1 – lnL2 • Var(D) obtained by resampling • DNLML test Slide 44
Things to consider • Data: • Mixture of paralogous and orthologous genes • Gene conversion • Convergent evolution • Horizontal gene transfer • Why mtDNA is popular in molecular phylogenetics? • Too little or too much variation (Substitution saturation): choose rapidly evolving sequences for recently diverged taxa and highly conserved genes for resolving deep phylogenies. • Sequence alignment (circularity between alignment and phylogenetics) • Substitution models • Tree selection criteria • Tree search algorithms A phylogenetic tree represents a hypothesized phylogenetic relationship among ingroup species, and published trees often contain errors. (P. 217) Slide 45
Recent Advances • More sophisticated substitution models • Better distance estimation: independent estimation versus simultaneous estimation • More thorough search algorithms • More efficient method for evaluating alternative topologies • More efficient computational implementations, e.g, Markov chain Monte Carlo for Bayesian inference. Slide 46
Ancient DNA 8 clones of PCR-amplified mtDNA from 26,000 year old cave-bear bone “Note that direct sequencing would lead to ambiguous results at least at two positions (arrows)” Forms of DNA damage likely to affect ancient DNA C/G to T/A changes due to deamination of C residues? “Assuming neutral pH, 15oC … take about 100,000 years for hydrolytic damage to destroy all DNA…. Some environmental conditions could extend this time limit…” “ … to consider amplification of DNA molecules older than one million years of age is overly optimistic.” (Svante Paabo lab) Slide 47 Hofreiter, Nat.Rev.Genet. 2:353, 2001
Tasmanian wolf Classified as relative of South American marsupials (based on morphology) … but PCR of mtDNA from museum specimen … Tasmanian wolf Aus. tiger cat Tasmanian devil Fig. 5.35 Slide 49
Dispersal of modern human population “Out-of-Africa mitochondrial Eve” hypothesis Hartl & Jones Fig. 14.30 Slide 50