450 likes | 548 Views
Qu’a apporté la génomique à la phylogénie des animaux ? Hervé Philippe Département de Biochimie, Centre Robert Cedergren, Université de Montréal, Succursale Centre-Ville, Montréal, Québec H3C3J7, Canada. Cambrian explosion: a paleontological perspective. Marrella. Aysheaia. Halkieria.
E N D
Qu’a apporté la génomique à la phylogénie des animaux ? Hervé Philippe Département de Biochimie, Centre Robert Cedergren, Université de Montréal, Succursale Centre-Ville, Montréal, Québec H3C3J7, Canada
Cambrian explosion: a paleontological perspective Marrella Aysheaia Halkieria Pikaia
Cambrian explosion: a molecular perspective Choanoflagellata Porifera Porifera Porifera Cnidaria Cnidaria Ecdysozoa Ecdysozoa Ecdysozoa Lophotrochozoa Lophotrochozoa Lophotrochozoa Lophotrochozoa Deuterostomia Deuterostomia Deuterostomia Deuterostomia Molecular phylogenies should resolve series of speciation events separated by a few millions of years MYa 700 600 500 400 300 200 100 0 Cambrian explosion
Lack of resolution in molecular phylogenetics (1) Inadequate selection of sequences (non-orthologous, saturated, etc.) Inadequate tree reconstruction method Inadequate taxon sampling Rapid diversification of species • Points (1), (2) and (3) are always mixed: • A (simplistically) theoretical overview • Analyses of several case studies • A molecular dating approach
Cambrian explosion: a molecular perspective Choanoflagellata Porifera Porifera Porifera Cnidaria Cnidaria Ecdysozoa Ecdysozoa Ecdysozoa Lophotrochozoa Lophotrochozoa Lophotrochozoa Lophotrochozoa Deuterostomia Deuterostomia Deuterostomia Deuterostomia Bootstrap support 95% requires 3 substitutions on the corresponding branch (Felsenstein, 1985) T MYa 700 600 500 400 300 200 100 0 Cambrian explosion 18S Ribosomal RNA (~1000 positions): ~100 substitutions over 500 MY resolution for branches with T 15 MY 146 genes (Delsuc et al. 2006, 33800 positions): ~7000 substitutions over 500 MY resolution for branches with T 0.25 MY 50 genes (Rokas et al. 2005, 12060 positions): ~2400 substitutions over 500 MY resolution for branches with T 0.7 MY
Phylogenetic signal Inferred trees 3: 10% 3 2: 25% 1 2 3 1000 positions 1: 80% 1 2 12000 positions 3: 100% 2 1 3 2: 100% Phylogenetic signal 1: 100% true history
Comparison of ML SSU and LSU trees (A and B, respectively) Medina M. et.al. PNAS 2001;98:9707-9712
50 genes (12,060 amino acid positions), ML RtREV+I+ / MP bootstrap support Rokas et al. (2005) Animal evolution and the molecular signature of radiations compressed in time. Science, 310:1993-1998
Phylogenetic signal Inferred trees 3: 10% 3 2: 25% 1 2 3 1000 positions 1: 80% 1 2 12000 positions 3: 100% 2 1 3 2: 100% Phylogenetic signal 1: 100% true history
Non-phylogenetic signal Sequence evolves according to a very complex and heterogeneous process that our tree reconstruction method approximates as best as they can using elaborated model of sequence evolution Real complexities: mutation process is not homogeneous over time and across the genome, population structure is not homogeneous over time, selective pressures are not homogeneous over time and across the genome Nucleotide compositions are heterogeneous across species, evolutionary rate is heterogeneous across positions and over time (heterotachy), substitution process is heterogeneous across positions and over time, positions are inter-dependent, etc. All the complexities that are not adequately handled by our oversimplified models of sequence evolution can imply systematic biases, which are referred here as non-phylogenetic signal
Phylogenetic signal and non-phylogenetic signal Inferred trees 3: 10% 3 2: 5% 1 2 3 apparent signal 1: 70% 1 2 3: 100% 2 1 3 2: 5% 1 2 3 Non-phylogenetic signal 1: 100% Phylogenetic signal true history 1 2 3 1000 positions 12000 positions
Systematic errors (inconsistency) A C A C q q p<q2 p B D B D Systematic error: the error in phylogenetic estimates that is due to the failure of the reconstruction method to account fully for multiple substitutions (in a probabilistic framework, the properties of the data) Systematic errors will not disappear with phylogenomics, and may indeed become more apparent A A B C C B D D LONG BRANCH ATTRACTION (Felsenstein, 1978)
Multiple substitutions at the same position C C GC AC AG A A C C Tree building artefact A A
MLMP 9956 9455 9751 7236 8454 10075 4374 50 genes (12,060 amino acid positions), ML RtREV+I+ / MP bootstrap support Rokas et al. (2005) Science, 310:1993-1938
Phylogenetic signal and non-phylogenetic signal Inferred trees 3: 84% ML 3 2: 99% 1 2 3 apparent signal 1: 100% 2 3: 54% 1 MP 2: 56% 1 2 3 Non-phylogenetic signal 1: 75% 1 Phylogenetic signal 2 MLMP 9956 9455 9751 7236 8454 10075 4374 true history 3 12000 positions 1 12000 positions 2 3
Phylogenomics yields incongruent results PLoS Biology Nature Current Biology
Single gene phylogeny of Schierwater et al. (2009) Anthozoa Amoebozoa Hexapoda Ascomycota Crustacea Basidomycota Scyphozoa Calcarea Priapulida Hexactinellida Bivalvia Demospongiae Mitochondrial ATP synthase F0 subunit 6 Echinodermata Scyphozoa NON-HOMOLOGOUS Hexactinellida Anthozoa Calcarea Trichoplax adhaerens Demospongiae Choanoflagellata PARALOGOUS ER HSP70 Hydrozoa Echinodermata Mammalia Mammalia Ctenophora Hexapoda Trichoplax adhaerens Crustacea Choanoflagellata Bivalvia Amoebozoa Annelida Cytosolic HSP70 Excavata Gastropoda Ascomycota Hemichordata Basidomycota 0.1 Chromalveolata Annelida 0.1
Single gene phylogeny of Schierwater et al. (2009) CDC42 Basidomycota Amoebozoa Ascomycota Ascomycota RAC1 Choanoflagellata Basidomycota Calcarea Choanoflagellata Anthozoa Trichoplax_adhaerens Small RAS-like GTPase Hexactinellida Excavata Chromalveolata Chromalveolata Mammalia Echinodermata Echinodermata Demospongiae Hexapoda Hexactinellida Trichoplax adhaerens Anthozoa Mammalia Demospongiae Amoebozoa Annelida Hexapoda 0.1 Cubozoa GTP-binding nuclear protein Ran Calcarea Priapulida Gastropoda Hydrozoa 0.1
Single gene phylogeny of Schierwater et al. (2009) Demospongiae Anthozoa Cubozoa Chromalveolata Scyphozoa Excavata Anthozoa Ascomycota Trichoplax adhaerens Mammalia Pair box domain protein PAX-B Mammalia Hexapoda Basidomycota Trichoplax adhaerens Pol II Amoebozoa Echinodermata Hydrozoa Hexapoda 0.1 0.1 Pol III DNA directed RNA polymerase
0.1 0.1 Excavata Ciliophora Contaminated dataset Schierwater et al. (2009) PLoS Biol 7(1): e1000020 Amoebozoa Basidiomycota Ascomycota Choanoflagellata Placozoa 4 Calcarea Porifera Demospongiae 9 Hexactinellida 98 53 Ctenophora Cnidaria 27 62 Bilateria Excavata Ciliophora Amoebozoa Ascomycota Basidiomycota Choanoflagellata Calcarea 36 Porifera Demospongiae Clean dataset Philippe et al. (2011) PLoS Biol in press 44 Hexactinellida Ctenophora 40 4 9 Placozoa Cnidaria 23 Bilateria 38
Dunn et al. : 150 genes 24,708 positions Contaminations: Symsagitiferra: 13 genes (including 6 Chlorophyta, 2 Ciliophora, 2 Bacteria) 4 Neochildia (Microsporidia) 2 Saccoglossus (Mus) 2 Acanthoscurria (angiosperm) 2 Hydra (Artemia) 1 Oscarella (Pseudomonas) 1 Asterina (Bacteria) 1 Dugesia (Gallus) 1 Xiphinema (Lumbricus) 1 Monosiga (Rhizopus) 1 Macrostomum 2 Trichinella 2 Priapulus 1 Branchiostoma
Dunn et al.: 150 genes 24,708 positions Frameshifts: 63 concerned species Drosophila 2 Paraplanocera 3 Echinoderes 4 Xenoturbella 4 Chaetopterus 5 Cyanea 5 Cristatella 6 Platynereis 6 Spinochordodes 6 Cryptococcus 8 Spadella 8 Mnemiopsis 9 Bugula 10 Gnathostomula 10 Hydra 10 Sphaeroforma 10 Turbanella 10 Chaetoderma 15 Myzostoma 15 Scutigera 16 Carcinus 18 Lumbricus 20 Ptychodera 20 Euperipatoides 21 Carcinoscorpius 22 Symsagittifera 22 Chaetopleura 23 Homo 25 Boophilus 30 Hypsibius 30 Richtersius 30 Daphnia 32 Asterina 35 Anoplodactylus 40 Argopecten 43 Xiphinema 43 Acropora 45 Dugesia 46 Brachionus 50 Ciona 50 Branchiostoma 52 Hydractinia 53 Haementeria 54 Flaccisagitta 55 Strongylocentrotus 55 Acanthoscurria 58 Aplysia 58 Saccoglossus 60 Capsaspora 68 Gallus 73 Phoronis 87 Capitella 93 Echinococcus 100 Ferrenopenaeus 112 Monosiga 118 Schmidtea 129 Oscarella 141 Mytilus 151 Euprymna 201 Trichinella 281 Crassostrea 296 Macrostomum 382 Biomphalaria 384
Frameshifts: 3868 “invented” amino acids 5 introns: Anoplodactylus Chaetopterus Ciona Themiste Trichinella Many single point errors : A total 970 errors (in large part due to the use of erroneous mitochondrial genetic code!) Several genes with paralogy issues: 2-5 intractable problems 10-20 tractable problems DUNN: 150 genes 21,152 positions 55.6% of missing data UPDUNN: 150 genes 18,463 positions 35.6% of missing data
Saccharomyces Cryptococcus Sphaeroforma Amoebidium Capsaspora Monosiga 86 Amphimedon 62 Oscarella mertensiid Mnemiopsis Hydractinia Hydra Cyanea 84 Nematostel Acropora Symsagittifera Neochildia 58 Homo Gallus Ciona Branchiostoma Xenoturbella Saccogloss Ptychodera 30 Strongylocentrotus Asterina Gnathostomula Spadella Flaccisagitta Philodina Brachionus Macrostomum Paraplanocera Echinococcus Schmidtea Dugesia Pedicellina Cristatella Bugula Phoronis Turbanella Terebratalia Cerebratulus Carinoma Chaetopterus Myzostoma Themiste Platynereis Lumbricus Haementeria Urechis Capitella Chaetopleura Chaetoderma Euprymna Mytilus Crassostrea BS=100% Argopecten Biomphalaria Aplysia 70<BS<100 Priapulus Echinoderes Xiphinema Trichinella Spinochordodes Richtersius 0.2 Hypsibius Euperipatoides Drosophila Daphnia Fenneropenaeus Carcinus Scutigera Carcinoscorpius Anoplodactylus Acanthoscurria Boophilus Porifera Ctenophora Cnidaria Clean Dunn et al. dataset Bilateria CAT+G model 150 genes 18,463 positions 35.6% of missing data
Model of sequence evolution + 20 stationary probabilities (i) + 190 relative rates (ij = ji) C D E F G H I K A C D E F G H I K L M N P Q R S T V W Y L M N P , Q R S T V W Y A C D E F G H I K L M N P Q R S T V W WAG matrix l a b
The CAT model of sequence evolution Man M A E I G R L I E F S A M V D F W Q N R C Frog M A E I G R L V E Y S A M V D F W Q N R C Zebrafish M A D L G K L I D Y S A L V D F W Q N R C Fly M S D I G K L V E F S P M V E F W Q Q K C Yeast M S E I G R L V E F T P M V E F W Q N R C Amoeba L S E L G R L V D F T A M V D F W N N R C Paramecium L A E L G K L V E Y A P M I D F W Q A R C Green alga L S D L G K L I D F S A M I N F W Q N K C Homogeneous (WAG) model Heterogeneous (CAT) model : K distinct profiles amino acid profiles 1 substitution matrix … ACD...VWY ACD...VWY ACD...VWY ACD...VWY Categories (modes): 1 2 3 … K Lartillot & Philippe (2004) Mol Biol Evol. 21:1095-1109
The CAT model of sequence evolution A C D E F G H I K L M N P Q R S T V W Y To keep the number of parameters low, a category is only defined by a set of stationary probabilities (the relative rates are uniform), and the number of categories is inferred from the alignment + 20 stationary probabilities (i) + uniform relative rates (ij=ji) C D E F G H I K … L M N P Q R S T V W Y A C D E F G H I K L M N P Q R S T V W Y A C D E F G H I K L M N P Q R S T V W Lartillot & Philippe (2004) Mol Biol Evol. 21:1095-1109
Stable categories inferred by the CAT model E Q The size of an amino acid is proportional to its stationary probability D N
Multiple substitutions between two amino acids D E D E D D E E D E D E D D D E E E D E D E E D D E D E D E E D E E D
What is predicted by evolutionary models? GTR WAG CAT # substitutions 0 1 2 3 4 5 6 7
Multiple substitutions between two amino acids D E D E D D These multiple substitutions are well handled by the CAT model because this position will be explained by this profile: E E D E D E D D D E E E D E D E E D D E D E D E E D E E D
What is predicted by WAG replacement matrix? # substitutions
Further reduction of non-phylogenetic signal Human Mouse Zebrafish Tunicate Triclad platyhelminth Trematode platyhelminth Mollusk Annelid Priapulid Arthropod Nematode Anthozoan cnidarian Hydrozoan cnidarian Hexactinelid poriferan Calcareous poriferan Demosponge Choanoflagellate 100 100 100 95 63 100 49 99 52 98 76 78 76 55 74 0.02 Alignment of Rokas et al. (2005):50 genes (12,060 amino acid positions) Model CAT+, inferred using phylobayes; 100 bootstrap replicates
Reduction of non-phylogenetic signal 100 Chordates 90 Protostomes 80 Ecdysozoa 70 Lophotrochozoa 60 Bootstrap support Bilaterians 50 Cnidarians 40 Poriferans 30 20 10 0 MP rtREV+ CAT+
3 5 3 Fungi Ichthyosporea Choanoflagellata Homoscleromorpha 2 64 Porifera Calcarea 4 90 2 Hexactinellida Demospongiae 99 9 Placozoa 3 22 3 Ctenophora 78 62 Cnidaria 98 62 Bilateria 5 3 3 Fungi Ichthyosporea 91 Choanoflagellata Ctenophora 4 2 Homoscleromorpha 55 Hexactinellida 2 Demospongiae 80 99 9 22 Calcarea Placozoa 38 98 Cnidaria Bilateria 45 0.1 128 genes 30,257 positions Philippe et al. (2009) Curr. Biol. Model CAT+ 0.1 Model WAG+ Philippe et al. (2011) PLOS Biol.
Improvement of phylogenetic resolution 1 Phylogenetic signal 3 2 3 2 Non-phylogenetic signal 1 true history Phylogenomics: phylogenetic signal as well as non-phylogenetic signal are abundant To improve resolution, one has to use the same methods as to avoid systematic errors: Complex model of sequence evolution Rich taxon sampling Removal of fast evolving positions and taxa
3 Choanoflagellata 2 Homoscleromorpha 91 4 Calcarea 96 2 Porifera Hexactinellida Demospongiae 9 Placozoa 22 3 Ctenophora 93 62 Cnidaria 94 90 Bilateria 0.1 0.1 Model CAT+ 47 species 128 genes 30,257 positions Philippe et al. (2009) Curr. Biol. Choanoflagellata Model CAT+ 3 Ctenophora 70 Cnidaria 94 18 species Same sampling as Schierwater et al. Calcarea Placozoa 44 Demospongiae 56 9 86 Hexactinellida 53 Bilateria Philippe et al. (2011) PLOS Biol.
Hétérogénéité des modèles M A D I G R L I E F S A M V D F W M G E I G R L V E Y S A M V D F W M A E L G K L I D Y S A L V D F W M T D I G K L V E F S P M V E F W M W D I G R L V E F T P M V E Y W M S D L A R L V D F T A M V D F W M Y D L G K L I D F S A M I N F W M A D I G R L I E F S A M V D Y W M E D I G R L V E Y S A M V D F W M R D L G K L I D Y S A L V D F W au cours du temps états de caractères entre les sites • Hétérogénéité des états de caractères • matrices d’échange : Dayhoff, WAG … LG, GTR • Hétérogénéité entre les sites • loi gamma, modèle CAT • Hétérogénéité au cours du temps • modèle covarion, points de changements
Hypothèse Hétéropécilie variation temporelle du processus de substitution en acides aminés pour un site donné(poikillw= to vary)
Retrait progressif des sites hétéropécilles Protocole Données • 13 protéines mitochondriales • 68 espèces Bilateria Deuterostomia Protostomia Cnidaria Porifera CAT+G4 Choanoflagellata • Inférence par CAT+Γ4 avec les jeux réduits Sites retirés suivant une valeur croissante de PIPn Roure & Philippe (2011) BMC Evol Biol 11:17
Retrait progressif des sites hétéropécilles probabilité postérieure taille de l’alignement Bilateria Deuterostomia Cnidaria Choanoflagellata Roure & Philippe (2011) BMC Evol Biol 11:17 41 Protostomia Porifera
Retrait progressif des sites hétéropécilles probabilité postérieure taille de l’alignement Deuterostomia Cnidaria Choanoflagellata Roure & Philippe (2011) BMC Evol Biol 11:17 Protostomia Porifera
Retrait progressif des sites hétéropécilles probabilité postérieure taille de l’alignement Deuterostomia Cnidaria Cnidaria Porifera Choanoflagellata Choanoflagellata Roure & Philippe (2011) BMC Evol Biol 11:17 Deuterostomia Protostomia Protostomia Porifera Porifera
Retrait progressif des sites hétéropécilles probabilité postérieure taille de l’alignement Deuterostomia Deuterostomia Cnidaria Cnidaria Porifera Choanoflagellata Choanoflagellata Choanoflagellata Roure & Philippe (2011) BMC Evol Biol 11:17 Deuterostomia Protostomia Protostomia Protostomia Cnidaria Porifera Porifera Porifera
Retrait progressif des sites Sites à évolution rapide probabilité postérieure probabilité postérieure Sites hétéropéciles taille de l’alignement taille de l’alignement Le regroupement incorrect des Cnidaires et des Porifères n’est pas dû à la présence de sites à évolution rapide, mais à la présence de sites hétéropéciles qui est erronément interprétée comme une synapomorphie pour regrouper Cnidaires et Porifères Roure & Philippe (2011) BMC Evol Biol 11:17