370 likes | 479 Views
Phylogenetic Service Set. Webservices and Workflow to infer and use phylogenetic information in biodiversities studies. Saverio Vicario CNR-ITB, Bari (Italy). Meaning of a phylogeny. It is a summary of the evolutionary history of a group of organisms Topology summarize the relationships
E N D
Phylogenetic Service Set Webservices and Workflow to infer and use phylogenetic information in biodiversities studies Saverio Vicario CNR-ITB, Bari (Italy)
Meaning of a phylogeny It is a summary of the evolutionary history of a group of organisms Topology summarize the relationships Branch length summarize expected change along a given section Object of this description could be the full organism, or groups or single heredity units (SNPs or genes) What ever attribute of an organism that has an heredity component could “mapped” on tree: its history could be inferred based on the phylogeny of the species/gene
Contribution to INVA • Finding population of origin of invasive species and modes of the invasion • Phylogeograpy • Detecting selection acting on aliens species or native after alien intervention • Molecular Evolution • Diagnose an alien species from mixed sample of individuals • Molecular Systematics-> Barcode • Describing impact of aliens on a community: biodiversity profiling with phylogenetic diversity (based on species list or environmental sequencing)
Contribution to ECOS • Annotation of metagenome • Phylogenomics : comparing divergence of different gene families within and across samples to find what metabolic pathway is a more active in a given environment (Phylogenetic inference, Molecular Evolution) • Biodiversity profiling: phylogenetic differentiation across samples of house keeping gene to find out mode of ecosystem formation (Molecular Systematics-> Barcode; Phylogenetic diversity) • Biodiversity profiling from species list and known phylogeny (Phylogenetic diversity)
Overall plan of phylogenetic set Phylogenetic inference • Align based on HMM profile or using scoring matrix (HMMer3.0, Muscle) • User interface to describe model of substitution • Phylogenetic inference format translator • Infer Phylogeny (MrBayes, RAXML,…) • Asses convergence numerical parameters and topology for MCMC/MC inference (CODA R pkg and GeoKS) • Assess overall goodness of fit of phylogenetic inference with Posterior predictive test and relative with Akaike (HyPhy) Use of Phylogenetic information • Estimate phylogeneticdiversity (Phylocom, …) • Estimate evolutionary parameters (HyPhY) Utilities • Add new sequences to a phylogeneticinference (Pplacer) • Permutate, Resample, Thin tree list (scripting with R)
Phylogenetic inference Guess with an Inference Observe Evolutionary model W AGCTGCG X ACCGGTG Z AGTTGTG Y AGTTGCG Tree +
Gathering data At the moment it should be user supplied In the future it should be based on taxonomic, geographic, and traits (i.e. gene) availability I assume possibility of data input from taxonomic service set
Best Practice and Robust Workflow • Biovel need would like to promote interdisciplinary work offering robust workflow such that scientist of other field could use state of the art methods of a given discipline • Phylogenetic inference it is easy to misuse and not implement best practice
Phylogenetic inference pitfall -I • Dependent from the call of homology of the alignment • Workflow that will give conservative quality score for single sites • Highly dependent from model ( and prior if bayesian): • Need to test absolute fit of model to data • Need to compare models • Need to help user in describing the model • Difference between gene and species tree: • use species phylogeny tools • Check for paralogy • homogeneity of model and check several gene phylogeny • Numerical estimation of parameters is satisfactory? • Test of convergence of MCMC/MC
Phylogenetic inference pitfall – II • Prior on branches isstill very problematic • Not really possible to produce robust workflow with bayesian inference for branch length estimation (molecular clock) • Probably demographic explicit model are less problematic, because they try to tackle the problem explicitly
Phylogenetic inference pitfall – III Comparing models • Comparing Model with Akaike not appropriate under Bayesian framework, probably state of the art for maximum likelihood, but only relative evaluation • Good estimate of Bayesian Factor are difficult to estimate and not yet standard ( see Phycas and Phylobayes implementation), and still is a relative evaluation • Posterior predictive test and the L statistics seems more robust test applicable to Maximum Likelihood and Bayesian Using Mixed models?
Alighment WF for coding sequences • Given a set of nucleotide coding sequence • Perform all possible translation changing frame and genetic codes • Perform gene homology call HMMsearch on PFAMdb and find frame and compatible genetic codes • Align protein alignment on Protein profile (HMMalign), obtaining sites quality scores • Guide alignment Dna on protein and import quality scores
Other Alignment WF planned • Generic= Muscle +Gblocks • RNA =Infernal (HMM for RNA) • …
Different problems • Access the correct software and approach for the question • Describe the model in the input file • Check for convergence • Evaluate model
Select software Signal saturation leading to LBA and heterogeneity of rates PhyloBayes? MrBayes Beast ? RaXML Garlie? Divergence TNT ? MrBayes+ Best Species barrier Mismatch between gene and species tree Demographical complexity Depending on the divergence time and if the history that we are reconstructing is within or between species different simplifying assumption could be used in the model
Oh Evolutionary Model Description Language, where art thou? • Our first user interface based on MrBayes nexus description • Hyphy batch language very rich but no prior • BEAST XML input file … • …
Details in the model description Transition matrix Evolutionary model I.e. GTR, mtREV, HKY i.e. I.e. equal, empirical, estimate BaseFreq Site Var I.e. equal, gamma, .. S1 S2 S3 Group of Sites W AGCTGCG X ACCGGTG Z AGTTGTG Y AGTTGCG Group of branches Evol Topology1 Topology2 B1=a *B3 B1<- demographic/geographic model X
Our inteface for model Transition matrix Evolutionary model I.e. GTR, mtREV, HKY i.e. I.e. equal, empirical, estimate BaseFreq Site Var I.e. equal, gamma, .. S1 S2 S3 Group of Sites W AGCTGCG X ACCGGTG Z AGTTGTG Y AGTTGCG Group of branches Topology1 Topology2 … Topology1== Topology2 or Topology1!= Topology2 B1=a *B3
Convergence GeoKS for convergence of tree in MCMC ( web application (http://mblabproject.it/geoks/ess_options.html) R pkgs (Coda or Boa) for convergence of continuous parameters
GEOKS • Based on Billera’s tree space Compare the distribution of Billera tree distance (topology +branch) of two clouds of trees versus a mean tree Second round of revision Sys. Bio.
Evaluating model Relative comparison • Akaike Information Criterion • L of Ibrahim ( to be implemented with Hyphy) Absolute assessment • Posterior Predictive test to be implemented in Hyphy Not so keen to include MrModeltest-> too much emphasis to select among transition matrix all submodel of same GTR
What transition matrix? Nucleotide model requires 4X4 matrix Some RNA model 16X16 matrix Protein models requires 23X23 matrix but often they are pre-calculate (i.e. Blosum62) Codon model 61X61 61X61 matrices are quite time consuming for CPU and they are generally used only when tree is known , but GPU availability makes this models more accessible. Codon model are much more realistic for coding sequence, only way to parse the different selective force (ω, dn, ds)
Use of Ontology in Workflow • Connect input and ouput of two workflow that are semantically coherent • Substitute or make redundant services within a workflow
What Ontology? • EDAM (http://edamontology.sourceforge.net/) • Data and methods of general bioinformatics including basic phylogeny • CDAO (https://www.nescent.org/wg_evoinfo/CDAO) • Data only, but very much specialized on comparative studies and phylogeny
What to do with it • Annotate input and output of services/ workflow
Getting already inferred phylogeny • Where to find them? • TreeBase/nescent web services plan (https://www.nescent.org/wg/evoinfo/index.php?title=PhyloWS) • REST service not yet there but Phylris a first sketch of it • How likely is to re-use phylogeny? • Taxon list need to match exactly! Taxonomic services to check match taking in account synonymy • Possible Tree operation to match taxon list: • Subsetting or Pruning (easy and clean) • Tree object of several scripting languages could do the job • Patching several trees or making SuperTree (difficult and choice dependent)
Phylogenetic Diversity But also One General formula that includes Rao and Faith Phylogenetic diversity (PD) and corrected version of Allen’s PD that better generalize Shannon entropy I implemented the formula in python script in order to estimate phylogenetic beta diversity across communities as mutual information of the communities
Phylogenetic diversity • It was recently considered in a GEOBON meeting Essential Biodiversity Variable (although in a more general sense than here used) • It allow to describe the amount of variation within a sample but also where in the tree and how much there are differentiation across sample • It could be a powerful tool to summarize environmental sequencing data
Example across 3 localities: NI, CI, SI Anne Chao, et al. 2010 Phil. Trans. R. Soc. B,365:3599-3609 p1 Only CIandSI p1+p2 p2 p4+p3+p5 p3 p4+p3 p4 p5
Hypothesis of workflow on phylogenetic differentiation across localities Define Taxonomic group Clean Environmental Sequences Get Reference Sequence from NCBI/EMBL/BOLD Build Reference Alignment Filter Locus and Taxa with HMM profile Phylogenetic Inference Add Sequence to Alignment Add Sequence to Phylogeny (pplacer) Describe region differentiation with Phylogenetic diversity Identify Species, alpha and beta diversity
Other post phylogenetic inference application Reconstructing past history of a given traits on a species phylogeny (es, R pkg ape, but BayesTraits could be more interesting or phylocom) Biogeography: comparison of phylogeny across groups of species to infer geographical barrier and event of general impact on biodiversity ChronoBiogeography: same thing but with dating, distinguish the effects of recurrent climate change …
Acknowledgments • CNR – ITB • BachirBalech • AriannaConsiglio • Giorgio Grillo • INFN-IGI ( Italian Grid Initiative) • GiacintoDonvito • Pasquale Notarangelo Testing Workflow Model Definition GUI ICT infranstructure