Introduction to phylogenomics

Introduction to phylogenomics Julie Thompson Laboratory of Integrative Bioinformatics and Genomics IGBMC, Strasbourg, France julie@igbmc.fr

Phylogenomics • A combination of : • genomics (study of function and structure of genes and genomes) • molecular phylogenetics (study of evolutionary relationships among organisms) • Two different aspects : • using phylogenetic data to infer functions for DNA and protein sequences (Eisen. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 1998) • using genomic data to infer phylogenetic relationships (species trees) and to gain insights into the mechanisms of molecular evolution (O'Brien and Stanyon. Phylogenomics. Ancestral primate viewed. Nature 1999)

1. Phylogeny-based functional inference • Homology based methods • Non-homology based methods

Phylogeny-based functional inference • Used in molecular biology, genetics, development, behavior, epidemiology, ecology, systematics, conservation biology, forensics… • draw structural/functional inferences from the structure of the tree or from the way the character states map onto the tree • use these clues to build hypotheses and models of important events, systems, predict behavior, etc. Reviewed in: Brown & Sjölander, PLoS Comput Biol. 2006 Levasseur et al, Evolutionary Bioinformatics, 2008

Complexes & Networks : Copresence/coabsence Fusion/fission Sequence Interactome : Interologous approach Model Systems (Human, mouse, drosophila, yeast …) New Systems (human, partial experimental data) New Systems (partial experimental Data) Structure Interfaceome : Conserved residues interactions Function Evolution Promotome : Phylogenetic footprint Propagation Modelling Propagation Modelling Transcriptome, proteome… Similar Expression Phylogeny-based functional inference A two-step process: Model Systems (Human, mouse, drosophila, yeast …) Knowledge extraction 2 Data & Information inference 1 Phylogenetic Inference

Errors : • gene duplications • (ortholog/paralog) • multi-domain proteins • existing database errors human mouse1 Transfer function from highest scoring sequence with known function mouse2 worm yeast human mouse1 Propagation from model systems Classical method : similarity-based functional annotation (from Blast best hit) Perform Blast search to detect similar sequences Perform Blast search to detect similar sequences

Propagation from model systems Problems : • distantly related sequences may have different functions • spurious hits in low complexity regions • propagation of existing database annotation errors Classical method : similarity-based functional annotation (from Blast best hit) Example : complex domain organisation

Problems : domain organisation SW:Y449_MYCGE SW:Y663_MYCPN SW:SYFB_IDILO SPT:A5IAL4_LEGPC RNA binding domain phenylalanyl-tRNA synthetase

Annotation errors • Sequence prediction errors : • 65% of the sequences are in silico predictions • 44% of eukaryote predicted proteins are partially incorrect: at least one suspicious indel or divergent segment (Bianchetti et al, 2005) • Function annotation errors : • 66% of sequences in the UniProt database have GO annotations, but only 3% have evidence codes indicating experimental support (Krishnamurthy et al, 2006) • 10-30% of genome functional annotations are erroneous (Devos, Valencia, 2000)

human mouse1 Perform multiple alignment of sequences representing potential homologs mouse2 worm yeast human mouse2 mouse1 worm human yeast mouse2 mouse1 duplication worm fusion yeast Propagation from model systems Phylogeny-based inference Construct phylogenetic tree and identify orthologs Perform Blast search to detect similar sequences Perform Blast search to detect similar sequences Infer function from set of orthologs, domain organisation, conserved motifs (also 3D structure, etc.)

Assumption • We can identify set of homologous sequences and differentiate orthologs from paralogs • orthologous sequences (diverged by speciation) are more reliable predictors of protein function than paralogous sequences (that diverged by gene duplication) ancestor gene A speciation mouse gene A human gene A orthologs duplication paralogs human gene A’ human gene A mouse gene A

Define orthologous groups • pairwise orthology: reciprocal best hits (RBH) • Inparanoid (Remm et al., 2001) • COGs: Clusters of Orthologous Groups (Tatusov et al., 1997) • orthoMCL (Li et al., 2003) • EggNOG (Jensen et al., 2008) ancestor gene A speciation mouse gene A human gene A orthologs duplication paralogs human gene A’ human gene A mouse gene A

Problems: • leading to wrong orthology assumptions • varying rates of evolution • sequencing errors, non-predicted sequences • gene duplication followed by differential gene loss rat human Sub-family X mouse Sub-family A worm gene loss (human ) rat Sub-family Y fly human Sub-family B duplication RBH: human B mouse A RBH: human X rat Y

Unknown sequence Unknown sequence More general prediction: GPCR of unknown specificity Define orthologous groups • Tree-based orthology: build a phylogenetic tree of a group of genes and compare gene tree to species tree to define speciation, duplication events • Resampled Inference of Orthologs (RIO) (Zmasek and Eddy, 2002) • Orthostrapper (Storm and Sonnhammer, 2002) • Levels Of Orthology From Trees (LOFT) (Van de Heijden et al, 2007) • Example: G protein-coupled receptors Prediction: Opiod receptor

Large scale analysis pipelines • FIGENIX(Gouret et al, 2005): automatic pipeline for structural/functional annotation and phylogeny • SIFTER (Engelhardt et al, 2005): statistical inference algorithms to propagate function annotations within a phylogeny • PhyloFacts(Krishnamurthy et al, 2006): database of protein families, integrating different predictions and experimental data in a phylogeny • MACSIMS (Thompson et al, 2006): information management system, to propagate structural/functional data within a multiple alignment

Large scale analysis: example • Phylogenies of peroxisomal proteins (yeast, rat) were reconstructed to determine their origin : eukaryotic, bacterial or archaeal • 39–58% were of eukaryotic origin (biogenesis or maintenance) • 13–18% were of bacterial origin (enzymes) by recruitment of proteins originally targeted to mitochondria bacteria archaea Gabaldón et al. Biology Direct, 2006.

Large scale analysis : example • Figenix functional analysis of genes lost in mammals/vertebrates, but present in other animals • More than 50% of lost genes are involved in biomolecular metabolism/catabolism • e.g. TPS biosynthesizes Trehalose 6P from UDP-glucose, a disaccharide crucial for the survival of species in dry and freezing periods and other stress conditions Danchin, Gouret and Pontarotti. BMC Evolutionary Biology 2006

Online resource : PhyloFacts • The Berkeley Phylogenomics Group : a phylogenomic encyclopedia containing 10,000 'books' for protein families, pre-calculated structural, functional and evolutionary analyses. FlowerPower MUSCLE NJ MP ML SAM Blast to PDB MSA analysis SCI-PHY PFAM http://phylogenomics.berkeley.edu/phylofacts

Online resource : PhyloFacts • Search with fasta sequence http://phylogenomics.berkeley.edu/phylofacts

http://phylogenomics.berkeley.edu/phylofacts

Phylogeny-based inference • Warning: inference accuracy depends on evolutionary distance and the particular functional attribute under consideration • Some attributes of protein families, such as the 3D structure, are conserved across large evolutionary distances • Other attributes, such as substrate specificity, can be modified by a few amino acid substitutions in critical positions

MACSIMS : Information Management System http://bips.u-strasbg.fr/MACSIMS/ • Data collection : • creation of a relational database • (BIRD, H. Nguyen) • Information management: • data validation • reliable propagation • Efficient exploitation : • automatic, high-throughput processing • (XML format) • visualisation (JalView editor) Thompson et al, 2006

Substrate specificity rhodocoxin reductase *** FAD binding thioredoxin reductase http://bips.u-strasbg.fr/MACSIMS/

MACSIMS : Information Management System Sulfatase protein family : GALNS • Mutations in GALNS gene are implicated in Morquio A syndrome : • mutation C79Y -> severe phenotype • others -> milder phenotypes http://bips.u-strasbg.fr/MACSIMS/

“non-homology” based methods • They predict functional associations between proteins : • physical interactions • co-membership in pathways, regulons or other cellular processes • When no characterised homologs are available, 'nonhomology' methods can be used to analyze other patterns : • gene co-inheritance (phylogenetic profiling) • gene context • domain fusion • gene neighborhood (operon, synteny, …) • gene regulation (phylogenetic footprinting / shadowing)

Phylogenetic profiling • Joint presence or joint absence of two traits across large numbers of species can be used to infer a biological connection • e.g. involvement of two different proteins in the same biological pathway (Pelligrini et al., 1999) • Hypothesis: • A biological process (photosynthesis, methanogenesis, histidine biosynthesis, …) may require the concerted action of many proteins • If some protein critical to a process is lost, other proteins dedicated to that process would become useless; natural selection makes it unlikely they will be retained over evolutionary time • Therefore, genes that are functionally related should be gained and lost together from genomes during evolution, which results in a correlation of their occurrence vectors

Phylogenetic profiling • For each gene, code Presence (1) or Absence (0) in each species • Group genes with same or similar profiles • Genes with similar profiles are likely to be functionally related

Phylogenetic profiling: example Comparative Genomics Identifies a Flagellar and Basal Body Proteome that Includes the BBS5 Human Disease Gene Li et al, Cell, 2004

Phylogenetic profiling • Other methods: • Similarity-based methods (correlating rates of evolution) (eg. Marcotte, 2000) • Comparison of trees, rather than simple co-presence/co-absence (eg. for STRING database, von Mering et al, 2003) • Problems/limitations : • Need to include a large number of genomes • Genes may not be predicted (or badly predicted) • Functional link is inferred, but no clues to exact gene functions

Domain fusion (Rosetta stone) • Hypothesis: • some pairs of interacting proteins have homologs in another organism fused into a single protein chain • A comparison of sequence homologs from multiple organisms can reveal these fused sequences • called Rosetta Stone sequences because they decipher the interactions between the protein pairs (Marcotte et al, 1999) • Example:

Rosetta stone : genome analysis • Prediction of E. coli genome-wide gene network • Problems : • The networks generated are sparse, but begin to define cellular systems • May not be scaleable to higher eukaryotes due to large numbers of duplicate genes, promiscuous domains

Gene neighborhood methods • genes that frequently co-occur in the same operon (genomic region) in a diverse set of species are more likely to physically interact or be involved in the same pathway (Dandekar et al, 1998; Huynen et al, 2000;…) • Example: fatty acid biosynthesis fatty acid degradation predicted transcription factor TF may regulate fatty acid degradation and biosynthesis From Harrington et al, PNAS 2007

Protein function prediction using combined methods • E.g. PLEX (Date and Marcotte, 2005) • mySQL relational database, with gene sequences, chromosomal positions, pre-computed phylogenetic profiles and Rosetta Stone linkages, accessible via a web-based interface http://bioinformatics.icmb.utexas.edu/plex/

specific function non-specific function conserved protein singleton Protein function prediction using combined methods • Study of protein function prediction in genomes and metagenomes • Combination of homology and non-homology approaches From Harrington et al, PNAS 2007)

Phylogenetic footprinting • Used to identify Transcription Factor Binding Sites (TFBS) within a non-coding region of DNA • Hypothesis: selective pressure causes regulatory elements to evolve at a slower rate than the non-functional surrounding sequence • Phylogenetic shadowing : a related technique used with closely related species Tagle et al, 1988

Phylogenetic footprinting • Protocol: • Carefully choose species with orthologous genes to provide enough sequence divergence • Decide on the length of the upstream / downstream region to be analysed • Align the sequences • Look for conserved regions and analyse them • Example: From Zhang and Gerstein Journal of Biology 2003

Footprinting programs… • Multiple alignment of genomic regions: PipMaker, AVID, Multiz • Experimentally validated motif databases: DBTSS, EPD • Motif prediction: First EF, Eponine and GenScan • Integrated systems: CONREAL, ConSite, Footer, PHYLONET, PromAn, PhyloScan • Problems: • Species specific binding sites • Very short binding sites • Less specific binding factors • Compound binding regions

2. Construction of species trees

2. Construction of species trees • Problem • phylogenetic trees based on single gene families, may show conflict due to a variety of causes (gene duplication, loss, horizontal transfer, convergent or parallel evolution…) • Solution • integrate the phylogenetic information from the different gene families to form a single species phylogeny

Construction of species trees • Define groups of orthologous sequences • Then use: • Whole genome features (complete genome alignment, gene content) • Supermatrix (simultaneous-analysis, combined-analysis) • Supertree (separate analysis) Delsuc et al, Nature reviews, 2005

Gene content • No multiple alignments, but sequence information is used to determine the orthologous genes • Build a matrix indicating the presence or absence of OGs in all species (phylogenetic profile) • Binary matrix can be treated in the same way as a multiple sequence alignment • Infer a phylogenomic tree from matrix (alignment) 2 states: P present, A absent 4 states: ACGT

Gene content • Distance methods: • Snel, Bork & Huynen. (1999) Nature Genet. • Tekaia, Lazcano & Dujon. (1999) Genome Res. • Lin & Gerstein. (2000) Genome Res. • Maximum parsimony: • Wolf, Rogozin, Grishin, Tatusov & Koonin. (2001) BMC Evol. Biol. • Fitz-Gibbon & House. (1999) Nucleic Acids Res.

Gene order (synteny) • Estimate evolutionary distance from the number of rearrangements necessary to transform one genome into another (computationally complex) • construct phylogenetic trees by minimizing the number of breakpoints between genomes (Blanchette et al 1999) • More practical solution: simply score the presence or absence of pairs of orthologous genes (Korbel et al. 2002, Wolf et al 2001)

Gene content / gene order • Problems • Orthology assessment • ‘big genome attraction’: distantly related species with large genomes may share more genes than closer related species with small genomes. • Sequence information is lost

Superalignments (supermatrix) • multiple alignments for each gene are concatenated to form a superalignment • Use conventional phylogenetic reconstruction methods (e.g. distance or MP) (Brown et al. 2001, Wolf, et al 2001)

Superalignments • Example: RibAlign • analysis of 16S ribosomal RNA (rRNA) sequences has been the de-facto gold standard for the assessment of phylogenetic relationships among prokaryotes • concatenation of ribosomal protein sequences (MAFFT, Phylip: ProML, MrBayes)

Superdistance (supermatrix) • Superdistance methods first calculate distance matrices for all gene families. • The phylogenomic distance between two species is then defined as the average distance between all the shared gene families (Kunin et al., 2005)

Supertree • Reconstruct phylogenetic trees for each gene family separately • Combine the multiple gene family trees to form a single phylogenomic tree (Gene Tree Reconciliation) (Bininda-Emonds, 2004; Daubin et al., 2002)

Gene tree reconciliation methods • Consensus tree methods are used to combine fully overlapping source trees (strict, majority consensus rules, …) • (eg. Mincut Semple and Steele 2000) From de Queiroz and Gatesy, Trends Ecol Evol, 2007

Gene tree reconciliation methods • Indirect supertree construction represents individual source trees as matrices, then combines them using an optimization criterion : • Matrix representation using parsimony (MRP) • “flip” supertrees • Average consensus procedure • Most Similar Supertree (MSSA) • Maximum Quartet Fit (QFIT) • Maximum Splits Fit (SFIT). From Bininda-Emonds et al, 2002 Software Clann, http://bioinf.may.ie/software/clann/

Introduction to phylogenomics

Introduction to phylogenomics

Presentation Transcript

Protein Molecular Function Prediction by Bayesian Phylogenomics

INTRODUCTION TO…

Introduction to

Phylogenomics Symposium and Software School

Genome-scale phylogenomics

Introduction to Phylogenomics and Metagenomics

Introduction to

Introduction to introduction to introduction to … Optimization

Introduction to

Introduction to Bioinformatics Introduction to Databases

Computational Phylogenomics and Metagenomics

Introduction to Engineering Introduction to CAD

Emerging Data Standards for Phylogenomics Research

Introduction to Introduction to Database Systems

Introduction to Introduction to Psychology

Introduction to Concurrency: Introduction to Concurrency

Emerging Data Standards for Phylogenomics Research

Introduction to Psychophysiology Lecture 1- introduction to introduction

Introduction to Introduction to Artificial Intelligence