470 likes | 963 Views
Combining genes in phylogeny And How to test phylogeny methods …. Tal Pupko Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University talp@post.tau.ac.il. Multiple sequence alignment (vWF). Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK
E N D
Combining genes in phylogeny And How to test phylogeny methods… Tal Pupko Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University talp@post.tau.ac.il
Multiple sequence alignment (vWF) Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR
From sequences to a phylogenetic tree VWF Rat QEPGGLVVPPTDA Rabbit QEPGGMVVPPTDA Gorilla QEPGGLVVPPTDA Cat REPGGLVVPPTEG
Multiplemultiple sequence alignment Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR
Phylogenetic studies are now based on the analysis of multiple genes Murphy et al. (2001b) 19 nuclear genes + 3 mitochondrial genes (16,400 bp)
Consensus tree a b c d e a b c d e a b c d e A consensus tree summarizes information common to two or more trees.
Strict consensus a b c d e a b c d e a b c d e a b c d e Strict consensus Strict consensus includes only those groups that occur in all the trees being considered.
Strict consensus a b c d e a b c d e a b c d e a b c d e Strict consensus Problem: the split {ab} is found 2 out of 3 times, and this is not shown in the strict consensus.
Majority-rule consensus a b c d e a b c d e a b c d e a b c d e Majority-rule consensus Majority-rule consensus: splits that are found in the majority of the trees are shown.
Majority-rule consensus a b c d e a b c d e a b c d e a b c d e Majority-rule consensus 67 100 67 The percentage of the trees supporting each splits are indicated
Problem with Majority-rule consensus a b c d e e b c d a Majority-rule consensus= Strict consensus = a b c d e However in both trees if we consider only {b,c,d}, then in both trees b is closer to c than b to d, or c to d.
Adams consensus a b c d e e b c d a b c d a e Adams consensus= Adams consensus will give the subtrees that are common to all trees. Adams consensus is useful where there is one or more sequences with unclear positions but there’s a subset of sequences that are common to all trees.
Networks a b c d e A network is sometimes used to represent tree in which recombination occurred.
Maximum Likelihood A t1 t3 S t2 X C
Gene 1 +Gene 2 + Gene 3 Sp1: TCTGT…AACTCTTT…GAATCGTT…GCC Sp2: TCTGC…GACTCGCT…GGAACGCT…CCC Sp3: CTTAT…GATCTATT…GGAATATT…CGA Sp4: CCTAT…GATCCATT…GGACCATT…CCA Sp1 Sp2 Sp3 Sp4 e.g., Murphy et al. (2001) Multiple genes analysis concatenate analysis Evolutionary model
Evolutionary model Evolutionary model Evolutionary model Sp1 Sp1 Sp1 Sp2 Sp2 Sp2 Sp3 Sp3 Sp3 Sp4 Sp4 Sp4 e.g., Murphy et al. (2001) Multiple genes analysis concatenate analysis Gene 1 Gene 2 Gene 3 Sp1: TCTGT…AAC Sp2: TCTGC…GAC Sp3: CTTAT…GAT Sp4: CCTAT…GAT Sp1: TCTTT…GAA Sp2: TCGCT…GGA Sp3: CTATT…GGA Sp4: CCATT…GGA Sp1: TCGTT…GCC Sp2: ACGCT…CCC Sp3: ATATT…CGA Sp4: CCATT…CCA
What are branch lengths Branch lengths correspond to evolutionary distance: d = AA replacements/site= [AA replacements/(site*year)]*year= Evolutionary rate * year
Evolutionary model1 Evolutionary model3 Evolutionary model2 Sp1 Sp1 Sp1 Sp2 Sp2 Sp2 Sp3 Sp3 Sp3 Sp4 Sp4 Sp4 e.g., Nikaido et al. (2001) Multiple genes analysis separate analysis Gene 1 Gene 2 Gene 3 Sp1: TCTGT…AAC Sp2: TCTGC…GAC Sp3: CTTAT…GAT Sp4: CCTAT…GAT Sp1: TCTTT…GAA Sp2: TCGCT…GGA Sp3: CTATT…GGA Sp4: CCATT…GGA Sp1: TCGTT…GCC Sp2: ACGCT…CCC Sp3: ATATT…CGA Sp4: CCATT…CCA
Example n= 44 ; g = 22 m = 0 85 1870 Multiple genes analysis Number of parameters Number of species = n Number of gene = g Number of parameters in the model = m Concatenate analysis Separate analysis Number of parameter m+(2n-3) g*(m+(2n-3))
Multiple genes analysis Number of parameters Both oversimplified model and over-parameterization may lead to the wrong phylogenetic conclusions
Evolutionary model1 Evolutionary model3 Evolutionary model2 Sp1 Sp1 Sp1 Sp2 Sp2 Sp2 Sp3 Sp3 Sp3 Sp4 Sp4 Sp4 Rate=1 Rate=0.5 Rate=1.5 Multiple genes analysis proportional analysis Gene 1 Gene 2 Gene 3 Sp1: TCTGT…AAC Sp2: TCTGC…GAC Sp3: CTTAT…GAT Sp4: CCTAT…GAT Sp1: TCTTT…GAA Sp2: TCGCT…GGA Sp3: CTATT…GGA Sp4: CCATT…GGA Sp1: TCGTT…GCC Sp2: ACGCT…CCC Sp3: ATATT…CGA Sp4: CCATT…CCA
Example n= 44 g = 22 m = 0 85 1870 106 Multiple genes analysis Number of parameters Number of species = n Number of gene = g Number of parameters in the model = m Concatenate analysis Separate analysis Proportional analysis Number of parameter g-1+gm+(2n-3) m+(2n-3) g*(m+(2n-3))
Aims of our study To compare 3 types of multiple-genes analysis: Concatenate analysis Separate analysis Proportional analysis 3 protein datasets: Mitochondrial data set [56 species, 12 genes] Nuclear dataset (“short genes”) [46 species, 6 genes] Nuclear dataset (“long genes”) [28 species, 4 genes] (Short genes- based on Murphy dataset)
Bonobo Chimpanzee Man Gorilla Sumatran orangutan Bornean orangutan Common gibbon Barbary ape Baboon White-fronted capuchin Slow loris Tree shrew Japanese pipistrelle Long-tailed bat Jamaican fruit-eating bat Horseshoe bat Little red flying fox Ryukyu flying fox Mouse Rat Glires Vole Cane-rat Guinea pig Squirrel Dormouse Rabbit Pika Pig Hippopotamus Sheep Cow Alpaca Blue whale Fin whale Sperm whale Donkey Horse Indian rhino White rhino Elephant Carnivora Aardvark Grey seal Harbor seal Dog Cat Asiatic shrew Insectivora Long-clawed shrew Small Madagascar hedgehog Hedgehog Gymnure Mole Armadillo Xenarthra Bandicoot Wallaroo Opossum Platypus Comparing topologies Morphological topology (Based on Mc Kenna and Bell, 1997) Archonta Ungulata
Perissodactyla Donkey Horse Carnivora Indian rhino White rhino Grey seal Harbor seal Dog Cetartiodactyla Cat Blue whale Fin whale Sperm whale Hippopotamus Sheep Cow Chiroptera Alpaca Pig Little red flying fox Ryukyu flying fox Moles+Shrews Horseshoe bat Japanese pipistrelle Long-tailed bat Afrotheria Jamaican fruit-eating bat Asiatic shrew Long-clawed shrew Mole Small Madagascar hedgehog Xenarthra Aardvark Elephant Armadillo Rabbit Lagomorpha + Scandentia Pika Tree shrew Bonobo Chimpanzee Man Gorilla Sumatran orangutan Primates Bornean orangutan Common gibbon Barbary ape Baboon White-fronted capuchin Rodentia 1 Slow loris Squirrel Dormouse Cane-rat Rodentia 2 Guinea pig Mouse Rat Vole Hedgehog Hedgehogs Gymnure Bandicoot Wallaroo Opossum Platypus Aims of our study Mitochondrial topology
Chiroptera Round Eared Bat Eulipotyphla Flying Fox Hedgehog Pholidota Mole Pangolin Whale 1 Cetartiodactyla Hippo Cow Carnivora Pig Cat Dog Perissodactyla Horse Rhino Glires Rat Capybara 2 Scandentia+ Dermoptera Rabbit Flying Lemur Tree Shrew 3 Human Primate Galago Sloth Xenarthra 4 Hyrax Dugong Elephant Afrotheria Aardvark Elephant Shrew Opossum Kangaroo Aims of our study Nuclear topology (Madsenl tree)
Comparing different models using AKAIKE INFORMATION CRITERION A model which minimizes the AIC is considered to be the most appropriate model.
Results: the best multiple gene analysis The proportional analysis is the best for the mitochondrial dataset Separate analysis Concatenate analysis Proportional analysis df 121 132 1320 Ln(L) -89921.78 -91188.71 -90999.30 182483.55 182619.42 182262.60 AIC (Mitochondrial tree, N-Gamma rate model)
Results: the best multiple gene method The Proportional analysis is the best for the Nuclear dataset (“Short genes”) Separate analysis Concatenate analysis Proportional analysis df 95 100 540 Ln(L) -11192.12 -11618.67 -11543.87 23464.23 23427.33 23287.74 AIC (Murphy dataset, Madsenl tree, N-Gamma rate model)
Results: the best multiple gene method The Separate analysis is the best for the Nuclear dataset (“Long genes”) Separate analysis Concatenate analysis Proportional analysis df 57 60 216 Ln(L) -31153.28 -31519.10 -31406.81 62738.56 63152.21 62933.63 AIC (Madsen dataset, Murphyl tree, N-Gamma rate model)
Conclusion: the best multiple gene method 1- The concatenate model is always the worst way to analyze multiple genes. 2- Selecting between the separate analysis or the proportional analysis depends on the data considered: The proportional model is more adapted for short genes, the separate model for longer sequences
Results: mammalian phylogeny • The morphological tree is always rejected • P(K-H test) < 0.05 • whatever the model used • whatever the dataset
Results: mammalian phylogeny • The mitochondrial tree is the best tree for the mitochondrial dataset. But we cannot reject the nuclear tree. • The nuclear tree is the best for the nuclear datasets, and we can reject the mitochondrial tree. Conclusion (Topology): It seems that the nuclear tree is the best tree among the 3 alternative trees.
Modelisation of site rate variation The gamma distribution: Homogenous model: F(t+x) = F(t).P(x) Site proportions f(r) Gamma model: F(t+x) = S (1/n).F(t).P(x.Rn) c n=1 Substitution rates (R)
Likelihoods with rate variation A d2 d1 G Continuous d3 C A d2 d1 G Discrete d3 C
Results: the best site-rate variation model Mitochondrial data set (Mitochondrial tree, proportional analysis) Homogenous model 1-Gamma model N-Gamma model df 121 132 120 Ln(L) -98998.68 -91094.30 -90999.30 198237.37 182430.61 182262.60 AIC
Conclusion: the best site-rate variation model The N-Gamma model is always the best site-rate variation model.
Combining Multiple Genes Collaborations Dorothee Huchon (Florida State University) Masami Hasegawa (Institute of Statistical Mathematics) Norihiri Okada (Tokyo Institute of Technology) Ying Cao (ISM).
Known phylogenies Best way to test different methods of phylogenetic reconstruction is on trees that are known to be true from other resources… Problem: known phylogenies are very rare. Known phylogeny: laboratory animals, crop plants (and even those are often suspect). Also their evolutionary rate is very small…
Known phylogenies David Hillis and colleagues have created “experimental” phylogenies in the lab.
Known phylogenies They have used bacteriophage T7 and subdivided cultures of it, in the present of a mutagen. They then sequenced a marker gene from the final cultures and gave the sequences as input to few phylogenetic methods. The output of the tree building methods was compared to the true tree.
Known phylogenies In fact, they used restriction sites method to infer the phylogeny, using MP, NJ, UPGMA and others. All methods reconstructed the true tree.
Known phylogenies They also compared outputs of ancestral sequence reconstruction, using MP. 97.3% of the ancestral states were correctly reconstructed. Encouraging!
Known phylogenies Criticism: (1) The true tree was very easy to infer, because it was well balances, and all nodes are accompanied by numerous changes. (2) The mutations by a single mutagen do not reflect reality.