470 likes | 658 Views
Phylogenetic analysis based on a collection of mitochondrials proteins. Audrey No ël Daniel Darmon. Data. 36 species bacteria, fungus, green plants, metazoa, protists (red algae, stramenopiles, Ichthyosporea, choanoflagellates) Sequences were already translated 11 mitochondrials proteins
E N D
Phylogenetic analysis based on a collection of mitochondrials proteins Audrey Noël Daniel Darmon
Data • 36 species • bacteria, fungus, green plants, metazoa, protists (red algae, stramenopiles, Ichthyosporea, choanoflagellates) • Sequences were already translated • 11 mitochondrials proteins • atp6, atp9, cob, cox1, cox2, cox3, nad1, nad3, nad4, nad4L, nad5 Stramenopile - phytophthora Ichthyosporea - amoebidium Choanoflagellate - monosiga
Plan of the project File parsing Muscle Muscle Gblocks Gblocks Reformating Super-Matrix Supertree Paup Clann Phyml ProML MrBayes Protpars Protdist Puzzle Tree-puzzle
Muscle program • Multiple sequence comparison by log-expectation • Creates multiple alignments of amino acid or nucleotide sequences • A range of options allows accuracy optimization, speed, or some compromise between the two • Default parameters are those that gave the best average accuracy in our tests • The program author’s own tests shows that MUSCLE can achieve both better average accuracy and better speed than CLUSTALW or T-Coffee, if the number of sequences is less than 50 Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
Output file from Muscle >Hyaloraphidium curvatum mt atp6 ; 237 aa-----------MTFYSPMEAYQVIPVFG-------------PVNDVAIFLVIGFS----FLLIL----GLGLSK-MQTVVPSNWYLAIEAAHTTIFTMVRTYI--------G--PAY----AYWLPFLFTLFFGLFFSNVFGLLPYSTTPTTHLIITFNLAVFLLVTAIANGFRRYGYAMMGLFIPSGTPLPLIPMLVVVEMLAYVTRIIALGIRISVNMITGHTLVKVIGGFLWE--AFEGGTNIMI---LILPMVLLTVFLVLEVLIAYLQAYIYTFICMITIKDFL----------- >Harpo105 mt atp6 ; 237 aa-----------MFCYSPMEAYSVINLGS-------------GFNDVAIFLLFAFG----FITLL----SYALTA-NQTLVPSNWFLGLETYHVTLYSMVQTYI--------G--SKA----GAWFPFIYTLFSALLFSNLFGLLPYSTTPTTHLIITFNLALFLMVTAIANGFRRYRYAIFGVFIPAGTPLGLIPLIVIVEVLAYITRISSLGIRITVNMVTGHTLVKVVSGFIYE--GFLGGTSILI---LALPVALLTVFLILELLIAYLQAYIFTFISCITIKDFS--------- >Monoblepharella mt atp6 ; 237 aa-----------MFSYSPMEAYAVINL-------------GYGFTDVAIFLIFAFG----FLTFL----GYALTS-NQTLVPSNWFLGLETYHVTLYTMVRTYI--------G--SKA----GAWFPFLYTLFTGLLFSNLFGLLPYSTTPTTHLIITFNLALFLMITAIANGFRRYRYAIFGLFIPAGIPVALIPVISVVEVLAYITRISSLGIRITVNMVTGHTLVKVVSGFIYE--GFLGGTSVII---LALPVALLAVFLILELLIAYLQAYIFTFISCITIKDFS----------- Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
Gblocks program • Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis • Eliminates poorly aligned positions and divergent regions of an alignment of DNA or protein sequences • These positions may not be homologous or may have been saturated by multiple substitutions and it is convenient to eliminate them prior to phylogenetic analysis Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
Output file from Gblocks >Acanthamoeba PLVLFVFSFI LFANLIGLLP YGFTITGHII FTFQIAFSLF FGITLFFNNK TEFFNLFVPS GVPKPLIPFL VVIEVVSYLI RPFSLSVRLF ANMLAGHTLL NILSAFPLLF IVFIIVLEFC IAIVQAYIFS ILTCIYLND >Allomyces0 PLIFTFFSFV FISNILGMIP YSFTPTSHIS VTLGLSIAIM IGVTLFSKHQ LDFFSLFVPK GTPLALVPLL VLIEFISYSA RAFSLALRLT ANVSAGHCLF GVISALPLAV LVVLYGLELL VALLQSYVFT LLTCSYLAD >Amoebidium PFIFTLFTYI VVLNLMGMVP YVFSATAHIS VALALSFGIW FGVTLFSLHG INFLSMFMPQ GAPMALAPLL VMIELVSYSA RAISLGVRLA ANISAGHLLL AILSGFPALV IFAMSGLELA VAVIQAYVFT LLTCIYIND >Arabidopsis PCILVTFLFL LFCNLQGMIP YSFTVTSHFL ITLALSFSIF IGITIFQRHG LHFFSFLLPA GVPLPLAPFL VLLELISYCF RALSLGIRLF ANMMAGHSLV KILSGFPLFI VLALTGLELG VAILQAYVFT ILICIYLND >Aspergillus PFIYALFIFI LVNNLIGMVP YSFASTSHFI LTFSMSFTIV LGATFLQRHG LKFFSLFVPS GCPLGLLPLL VLIEFISYLS RNVSLGLRLA ANILSGHMLL SILSGFPLAF IIAFSGLELA IAFIQAQVFV VLTCSYIKD Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
Reformatting • Each phylogenetic program has its own quirks involving input format • It was therefore frequently necessary to transform our files into other formats • Clustal\Sreformat\Readseq were very useful Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
Bootstrap • Statistical method of testing the reliability of the dataset\tree with the creation of pseudoreplicate datasets by randomly sampling the original character matrix to create new matrices of the same size as the original • Steps: • Characters are resampled to create many bootstrap replicate data sets • Each data set is analysed with parsimony, ML … • Agreement among the resulting trees is summarized within generally a majority-rule consensus tree • Frequencies of occurrence of groups, bootstrap proportions, are a measure of support for these group • A bootstrap value can then be assigned to each internal node in the original tree, this value being the number of times that the branch pattern seen at that node was reproduced in the replicate trees
Plan for supertree from each proteins files from Gblocks Without bootstrap With bootstrap Seqboot Protpars PhyML Protpars Clann Clann Consense Strict Consense Majority rule not extended Clann Clann
Supertree without bootstrap • Use protpars and Phyml programs Muscle - Gblocks - Reformatting - Supertree/ Super-matrix – Tree-puzzle
Protpars • Comes from Phylip package • Construct tree using parsimony method • Parsimony • selects the tree that requires the minimum number of character changes to explain the observed data Muscle - Gblocks - Reformatting - Supertree/ Super-matrix – Tree-puzzle
Phyml • Constructs the tree with maximum likelihood (ML) method • ML • What is the probability (likelihood) that a given tree could have produced the observed data, under a given model? • Allows the incorporation of the processes of character evolution Muscle - Gblocks - Reformatting - Supertree/ Super-matrix – Tree-puzzle
Clann program • The input file of the source trees has to be like a phylip format and without : • any branch lengths • any internal node labels • Matrix representation using Parsimony (MRP) • method that combine information from multiple trees Muscle - Gblocks - Reformatting - Supertree/ Super-matrix – Tree-puzzle
Matrix representation using Parsimony • Creates a matrix whose characters refer to the topologies of the source trees • The method examines each internal branch of each rooted source tree and assigns • 1 to any taxa contained within the clade defined by that internal branch • 0 to any taxon that is contained within the source tree, but not in the clade • ? to any taxa not present in the source tree • The columns in the matrix each represent one internal node in one of the source trees • A supertree is construct from this matrix using the heuristic search Muscle - Gblocks - Reformatting - Supertree/ Super-matrix – Tree-puzzle
MRP Muscle - Gblocks - Reformatting - Supertree/ Super-matrix – Tree-puzzle Source : 1
Heuristic searches • Best method when the number of taxa in the dataset is more than 10 • Used tree bisection and reconnection as described and implemented in PAUP Muscle - Gblocks - Reformatting - Supertree/ Super-matrix – Tree-puzzle
Treebisection and reconnection • The tree is bisected along a branch yielding two disjoint subtrees • The subtrees are then reconnected by joining a pair of branches, one from each subtree • All possible bisections and pairwise reconnections are evaluated • If a rearrangement is successful in finding a better tree, a round of rearrangements is initiated on this new tree Muscle - Gblocks - Reformatting - Supertree/ Super-matrix – Tree-puzzle
Muscle - Gblocks - Reformatting - Supertree/ Super-matrix – Tree-puzzle
Supertree with bootstrapusing parsimony • Seqboot program from PHYLIP package • Protpars from Phylip • Consense : • The consensus tree consists of monophyletic groups that occur as often as possible in the data • Strict \ Majority rule not extended/extended • Clann Muscle - Gblocks - Reformatting - Supertree/ Super-matrix – Tree-puzzle
Consensus trees • Majority rule extended • Any set of species that appears in more than 50% of the trees is included • The program then considers the other sets of species in order of the frequency with which they have appeared, and add them until the tree is fully resolved • Majority rule not extended • A set of species is included in the consensus tree if it is present in more than half of the input trees • Strict • A set of species must appear in all input trees to be included in the strict consensus tree Muscle - Gblocks - Reformatting - Supertree/ Super-matrix – Tree-puzzle
Results of supertrees • All the trees shown here are rooted with a group of 3 bacterial species • Branch length has been calculated from a concatenation sequence with ML • For support, bootstrap values have been generated with the clann program Muscle - Gblocks - Reformatting - Supertree/ Super-matrix – Tree-puzzle
ML without bootstrapbefore supertree constructionboot values < 50 not shown Bacteria 88 Fungi 85 94 88 95 97 57 77 99 98 97 69 58 67 85 58 86 57 97 74 75 100 Metazoa 84 Red algae 90 68 Green plants 90 83 92 83 92 Stramenopiles
Parsimony without bootstrapbefore supertree construction Bacteria Red algae Green plants Stramenopiles Fungi Metazoa
Parsimony with bootstrapconsense MR not extendedboot values < 50 not shown 73 Bacteria 81 52 76 Green plants 95 100 82 82 52 51 81 Red algae 65 Stramenopiles 96 76 91 Fungi 97 Metazoa 84 53 87 87 55 100 95 85 95 76 99 74 100 75 73
Parsimony with bootstrapconsense Strict 4 Bacteria 43 Fungi 47 20 37 62 22 11 30 8 34 Metazoa 27 Stramenopiles 27 Red algae 2 57 Green plants
Plan for SuperMatrix Without bootstrap With bootstrap Output from concatenation Output from concatenation Seqboot (internal to program) OR mrbayes paup phyml puzzle protdist protpars proml Many trees in one file trees mrbayes paup phyml puzzle protdist protpars proml trees
Super-Matrix • The second method of tree reconstruction in this project was the supermatrix method. • In this method all protein sequences of the same species are concatenated, with missing protein sequences from some species replaced by question marks • Allows the reconstructed tree to be accurate even with a surprisingly high proportion of missing data Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
MRBAYES • This program looks for the answer in an answer space, moving randomly around it until it finds somewhere to converge. • Because of this it is not possible to know ahead of time when the program will begin to converge onto a final tree • It is possible, however, to note afterwards at what point the ‘burn-in’ occurred and study the trees only past that point Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
MRBAYES Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
MrBayes Fungi Metazoa Stramenopiles Green plants Red algae Bacteria Harpochytr Fungi
PROTDIST • This program uses protein sequences to compute a distance matrix, under four different models of amino acid replacement. It can also compute a table of similarity between the amino acid sequences. • The distance for each pair of species estimates the total branch length between the two species, and can be used in the distance matrix programs FITCH, KITSCH or NEIGHBOR. • This is an alternative to the use of the sequence data itself in the parsimony program PROTPARS. Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
PUZZLE • PUZZLE is a computer program to reconstruct phylogenetic trees by distance from molecular sequence data using some maximum likelihood methods. • It uses likelihood mapping to investigate the support of a hypothesized internal branch without computing an overall tree and to visualize the phylogenetic content of a sequence alignment. Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
Distance Bacteria Green plants Red algae Stramenopiles Metazoa Fungi Harpochytr
PHYML • Refer to your memory… Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
PROML • This program implements the maximum likelihood method for protein amino acid sequences. It uses the either the Jones-Taylor-Thornton or the Dayhoff probability model of change between amino acids. • This program uses a Hidden Markov Model (HMM) method of inferring different rates of evolution at different amino acid positions • Can specify to the program that there will be a number of different possible evolutionary rates • Slow, however Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
Max Likelihood (PhyML, ProML) Fungus 76 68 97 60 55 100 100 100 100 100 100 98 99 97 99 93 100 99 91 100 100 98 Harpochytr 99 100 83 72 100 Metazoa 100 87 100 66 Stramenopiles 79 67 97 99 65 93 54 Green plants 81 100 69 87 96 77 54 98 Red algae 86 100 100 Bacteria 100 100
PAUP and PROTPARS • Parsimony-based methods • These select the tree that requires the minimum number of character changes to explain the observed data • ProtPars takes into account the ‘rules’ of the genetic code, so certain AA changes are not allowed Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
Parsimony Bacteria Green plants Red algae Stramenopiles Metazoa Fungi
Likelihood ratio test • Done with tree-puzzle program • Statistical test of the goodness-of-fit between two models • Use to compare differents trees estimated using the same likelihood model Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
Likelihood ratio test COMPARISON OF USER TREES Tree log L difference S.E. p-1sKH p-SH -------------------------------------------------------------------- PhyML -12958.22 17.63 10.5535 0.0470 - 0.5700 + Protp. -12957.19 16.61 11.7588 0.0950 + 0.6020 + Protp.Bs.MRne -12957.51 16.93 11.6506 0.0790 + 0.5910 + Protp.Bs.S -13476.82 536.23 64.8718 0.0000 - 0.0000 - Bayes -12950.25 9.67 7.5080 0.1180 + 0.7770 + ProtD.sing -12948.18 7.60 6.7313 0.1310 + 0.8480 + ProdD.Bs -12951.33 10.75 7.4288 0.0860 + 0.7420 + Puzz.sing -12942.04 1.46 5.0720 0.3660 + 0.9850 + Puzz.Bs -12943.63 3.05 10.1913 0.3930 + 0.9240 + PhyML.sing -12940.58 0.00 <---- best1.0000 + 1.0000 + PhyML.Bs -12941.99 1.40 4.5803 0.3620 + 0.9700 + ProML.sing -12943.46 2.88 9.9232 0.3830 + 0.9120 + ProML.Bs -12943.26 2.68 9.5717 0.3880 + 0.9520 + Paup.sing.Bs -12964.13 23.55 13.7457 0.0440 - 0.4930 + Protp.sing -12943.13 2.55 11.8216 0.4070 + 0.8910 + Protp.Bs -12947.66 7.08 11.9463 0.2840 + 0.8320 + ML : Maximum Likelihood Bs : bootstrap S : consense strict MRne : consense majority rule not extended Muscle - Gblocks - Reformatting - Supertree / Super-matrix – Tree-puzzle
Conclusion • The many methods used all resulted in different topologies • However, the major groups seems to cluster well in both categories of tree reconstruction • Except in the supertree using parsimony with “consense strict” that shows many uncertainties, but on the other hand the groups that do emerge are absolutely certain to be true • PhyML appears to have given the most likely tree reconstruction • The bootstrap values of the supertree seem to be worse than those of the supermatrix. • With supertree we compute the tree by gene and a single gene contains less information than all the information from all the genes,as is the case for the super-matrix. So the phylogenetic signal is stronger and therefore the results are improved • More genes and more species are needed to further test the methods and see if the uncertainty can be resolved
Acknowledgements MERCINAIARA!
References • Creevey C.J. and McInerney J.O. 2004 Clann: investigating phylogenetic information through supertree analyses. Bioinformatics.