1 / 59

Introduction to phylogenomics

Introduction to phylogenomics. Julie Thompson Laboratory of Integrative Bioinformatics and Genomics IGBMC, Strasbourg, France julie@igbmc.fr. Phylogenomics. A combination of : genomics (study of function and structure of genes and genomes)

clyde
Download Presentation

Introduction to phylogenomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to phylogenomics Julie Thompson Laboratory of Integrative Bioinformatics and Genomics IGBMC, Strasbourg, France julie@igbmc.fr

  2. Phylogenomics • A combination of : • genomics (study of function and structure of genes and genomes) • molecular phylogenetics (study of evolutionary relationships among organisms) • Two different aspects : • using phylogenetic data to infer functions for DNA and protein sequences (Eisen. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 1998) • using genomic data to infer phylogenetic relationships (species trees) and to gain insights into the mechanisms of molecular evolution (O'Brien and Stanyon. Phylogenomics. Ancestral primate viewed. Nature 1999)

  3. 1. Phylogeny-based functional inference • Homology based methods • Non-homology based methods

  4. Phylogeny-based functional inference • Used in molecular biology, genetics, development, behavior, epidemiology, ecology, systematics, conservation biology, forensics… • draw structural/functional inferences from the structure of the tree or from the way the character states map onto the tree • use these clues to build hypotheses and models of important events, systems, predict behavior, etc. Reviewed in: Brown & Sjölander, PLoS Comput Biol. 2006 Levasseur et al, Evolutionary Bioinformatics, 2008

  5. Complexes & Networks : Copresence/coabsence Fusion/fission Sequence Interactome : Interologous approach Model Systems (Human, mouse, drosophila, yeast …) New Systems (human, partial experimental data) New Systems (partial experimental Data) Structure Interfaceome : Conserved residues interactions Function Evolution Promotome : Phylogenetic footprint Propagation Modelling Propagation Modelling Transcriptome, proteome… Similar Expression Phylogeny-based functional inference A two-step process: Model Systems (Human, mouse, drosophila, yeast …) Knowledge extraction 2 Data & Information inference 1 Phylogenetic Inference

  6. Errors : • gene duplications • (ortholog/paralog) • multi-domain proteins • existing database errors human mouse1 Transfer function from highest scoring sequence with known function mouse2 worm yeast human mouse1 Propagation from model systems Classical method : similarity-based functional annotation (from Blast best hit) Perform Blast search to detect similar sequences Perform Blast search to detect similar sequences

  7. Propagation from model systems Problems : • distantly related sequences may have different functions • spurious hits in low complexity regions • propagation of existing database annotation errors Classical method : similarity-based functional annotation (from Blast best hit) Example : complex domain organisation

  8. Problems : domain organisation SW:Y449_MYCGE SW:Y663_MYCPN SW:SYFB_IDILO SPT:A5IAL4_LEGPC RNA binding domain phenylalanyl-tRNA synthetase

  9. Annotation errors • Sequence prediction errors : • 65% of the sequences are in silico predictions • 44% of eukaryote predicted proteins are partially incorrect: at least one suspicious indel or divergent segment (Bianchetti et al, 2005) • Function annotation errors : • 66% of sequences in the UniProt database have GO annotations, but only 3% have evidence codes indicating experimental support (Krishnamurthy et al, 2006) • 10-30% of genome functional annotations are erroneous (Devos, Valencia, 2000)

  10. human mouse1 Perform multiple alignment of sequences representing potential homologs mouse2 worm yeast human mouse2 mouse1 worm human yeast mouse2 mouse1 duplication worm fusion yeast Propagation from model systems Phylogeny-based inference Construct phylogenetic tree and identify orthologs Perform Blast search to detect similar sequences Perform Blast search to detect similar sequences Infer function from set of orthologs, domain organisation, conserved motifs (also 3D structure, etc.)

  11. Assumption • We can identify set of homologous sequences and differentiate orthologs from paralogs • orthologous sequences (diverged by speciation) are more reliable predictors of protein function than paralogous sequences (that diverged by gene duplication) ancestor gene A speciation mouse gene A human gene A orthologs duplication paralogs human gene A’ human gene A mouse gene A

  12. Define orthologous groups • pairwise orthology: reciprocal best hits (RBH) • Inparanoid (Remm et al., 2001) • COGs: Clusters of Orthologous Groups (Tatusov et al., 1997) • orthoMCL (Li et al., 2003) • EggNOG (Jensen et al., 2008) ancestor gene A speciation mouse gene A human gene A orthologs duplication paralogs human gene A’ human gene A mouse gene A

  13. Problems: • leading to wrong orthology assumptions • varying rates of evolution • sequencing errors, non-predicted sequences • gene duplication followed by differential gene loss rat human Sub-family X mouse Sub-family A worm gene loss (human ) rat Sub-family Y fly human Sub-family B duplication RBH: human B mouse A RBH: human X rat Y

  14. Unknown sequence Unknown sequence More general prediction: GPCR of unknown specificity Define orthologous groups • Tree-based orthology: build a phylogenetic tree of a group of genes and compare gene tree to species tree to define speciation, duplication events • Resampled Inference of Orthologs (RIO) (Zmasek and Eddy, 2002) • Orthostrapper (Storm and Sonnhammer, 2002) • Levels Of Orthology From Trees (LOFT) (Van de Heijden et al, 2007) • Example: G protein-coupled receptors Prediction: Opiod receptor

  15. Large scale analysis pipelines • FIGENIX(Gouret et al, 2005): automatic pipeline for structural/functional annotation and phylogeny • SIFTER (Engelhardt et al, 2005): statistical inference algorithms to propagate function annotations within a phylogeny • PhyloFacts(Krishnamurthy et al, 2006): database of protein families, integrating different predictions and experimental data in a phylogeny • MACSIMS (Thompson et al, 2006): information management system, to propagate structural/functional data within a multiple alignment

  16. Large scale analysis: example • Phylogenies of peroxisomal proteins (yeast, rat) were reconstructed to determine their origin : eukaryotic, bacterial or archaeal • 39–58% were of eukaryotic origin (biogenesis or maintenance) • 13–18% were of bacterial origin (enzymes) by recruitment of proteins originally targeted to mitochondria bacteria archaea Gabaldón et al. Biology Direct, 2006.

  17. Large scale analysis : example • Figenix functional analysis of genes lost in mammals/vertebrates, but present in other animals • More than 50% of lost genes are involved in biomolecular metabolism/catabolism • e.g. TPS biosynthesizes Trehalose 6P from UDP-glucose, a disaccharide crucial for the survival of species in dry and freezing periods and other stress conditions Danchin, Gouret and Pontarotti. BMC Evolutionary Biology 2006

  18. Online resource : PhyloFacts • The Berkeley Phylogenomics Group : a phylogenomic encyclopedia containing 10,000 'books' for protein families, pre-calculated structural, functional and evolutionary analyses. FlowerPower MUSCLE NJ MP ML SAM Blast to PDB MSA analysis SCI-PHY PFAM http://phylogenomics.berkeley.edu/phylofacts

  19. Online resource : PhyloFacts • Search with fasta sequence http://phylogenomics.berkeley.edu/phylofacts

  20. http://phylogenomics.berkeley.edu/phylofacts

  21. Phylogeny-based inference • Warning: inference accuracy depends on evolutionary distance and the particular functional attribute under consideration • Some attributes of protein families, such as the 3D structure, are conserved across large evolutionary distances • Other attributes, such as substrate specificity, can be modified by a few amino acid substitutions in critical positions

  22. MACSIMS : Information Management System http://bips.u-strasbg.fr/MACSIMS/ • Data collection : • creation of a relational database • (BIRD, H. Nguyen) • Information management: • data validation • reliable propagation • Efficient exploitation : • automatic, high-throughput processing • (XML format) • visualisation (JalView editor) Thompson et al, 2006

  23. Substrate specificity rhodocoxin reductase *** FAD binding thioredoxin reductase http://bips.u-strasbg.fr/MACSIMS/

  24. MACSIMS : Information Management System Sulfatase protein family : GALNS • Mutations in GALNS gene are implicated in Morquio A syndrome : • mutation C79Y -> severe phenotype • others -> milder phenotypes http://bips.u-strasbg.fr/MACSIMS/

  25. “non-homology” based methods • They predict functional associations between proteins : • physical interactions • co-membership in pathways, regulons or other cellular processes • When no characterised homologs are available, 'nonhomology' methods can be used to analyze other patterns : • gene co-inheritance (phylogenetic profiling) • gene context • domain fusion • gene neighborhood (operon, synteny, …) • gene regulation (phylogenetic footprinting / shadowing)

  26. Phylogenetic profiling • Joint presence or joint absence of two traits across large numbers of species can be used to infer a biological connection • e.g. involvement of two different proteins in the same biological pathway (Pelligrini et al., 1999) • Hypothesis: • A biological process (photosynthesis, methanogenesis, histidine biosynthesis, …) may require the concerted action of many proteins • If some protein critical to a process is lost, other proteins dedicated to that process would become useless; natural selection makes it unlikely they will be retained over evolutionary time • Therefore, genes that are functionally related should be gained and lost together from genomes during evolution, which results in a correlation of their occurrence vectors

  27. Phylogenetic profiling • For each gene, code Presence (1) or Absence (0) in each species • Group genes with same or similar profiles • Genes with similar profiles are likely to be functionally related

  28. Phylogenetic profiling: example Comparative Genomics Identifies a Flagellar and Basal Body Proteome that Includes the BBS5 Human Disease Gene Li et al, Cell, 2004

  29. Phylogenetic profiling • Other methods: • Similarity-based methods (correlating rates of evolution) (eg. Marcotte, 2000) • Comparison of trees, rather than simple co-presence/co-absence (eg. for STRING database, von Mering et al, 2003) • Problems/limitations : • Need to include a large number of genomes • Genes may not be predicted (or badly predicted) • Functional link is inferred, but no clues to exact gene functions

  30. Domain fusion (Rosetta stone) • Hypothesis: • some pairs of interacting proteins have homologs in another organism fused into a single protein chain • A comparison of sequence homologs from multiple organisms can reveal these fused sequences • called Rosetta Stone sequences because they decipher the interactions between the protein pairs (Marcotte et al, 1999) • Example:

  31. Rosetta stone : genome analysis • Prediction of E. coli genome-wide gene network • Problems : • The networks generated are sparse, but begin to define cellular systems • May not be scaleable to higher eukaryotes due to large numbers of duplicate genes, promiscuous domains

  32. Gene neighborhood methods • genes that frequently co-occur in the same operon (genomic region) in a diverse set of species are more likely to physically interact or be involved in the same pathway (Dandekar et al, 1998; Huynen et al, 2000;…) • Example: fatty acid biosynthesis fatty acid degradation predicted transcription factor TF may regulate fatty acid degradation and biosynthesis From Harrington et al, PNAS 2007

  33. Protein function prediction using combined methods • E.g. PLEX (Date and Marcotte, 2005) • mySQL relational database, with gene sequences, chromosomal positions, pre-computed phylogenetic profiles and Rosetta Stone linkages, accessible via a web-based interface http://bioinformatics.icmb.utexas.edu/plex/

  34. specific function non-specific function conserved protein singleton Protein function prediction using combined methods • Study of protein function prediction in genomes and metagenomes • Combination of homology and non-homology approaches From Harrington et al, PNAS 2007)

  35. Phylogenetic footprinting • Used to identify Transcription Factor Binding Sites (TFBS) within a non-coding region of DNA • Hypothesis: selective pressure causes regulatory elements to evolve at a slower rate than the non-functional surrounding sequence • Phylogenetic shadowing : a related technique used with closely related species Tagle et al, 1988

  36. Phylogenetic footprinting • Protocol: • Carefully choose species with orthologous genes to provide enough sequence divergence • Decide on the length of the upstream / downstream region to be analysed • Align the sequences • Look for conserved regions and analyse them • Example: From Zhang and Gerstein Journal of Biology 2003

  37. Footprinting programs… • Multiple alignment of genomic regions: PipMaker, AVID, Multiz • Experimentally validated motif databases: DBTSS, EPD • Motif prediction: First EF, Eponine and GenScan • Integrated systems: CONREAL, ConSite, Footer, PHYLONET, PromAn, PhyloScan • Problems: • Species specific binding sites • Very short binding sites • Less specific binding factors • Compound binding regions

  38. 2. Construction of species trees

  39. 2. Construction of species trees • Problem • phylogenetic trees based on single gene families, may show conflict due to a variety of causes (gene duplication, loss, horizontal transfer, convergent or parallel evolution…) • Solution • integrate the phylogenetic information from the different gene families to form a single species phylogeny

  40. Construction of species trees • Define groups of orthologous sequences • Then use: • Whole genome features (complete genome alignment, gene content) • Supermatrix (simultaneous-analysis, combined-analysis) • Supertree (separate analysis) Delsuc et al, Nature reviews, 2005

  41. Gene content • No multiple alignments, but sequence information is used to determine the orthologous genes • Build a matrix indicating the presence or absence of OGs in all species (phylogenetic profile) • Binary matrix can be treated in the same way as a multiple sequence alignment • Infer a phylogenomic tree from matrix (alignment) 2 states: P present, A absent 4 states: ACGT

  42. Gene content • Distance methods: • Snel, Bork & Huynen. (1999) Nature Genet. • Tekaia, Lazcano & Dujon. (1999) Genome Res. • Lin & Gerstein. (2000) Genome Res. • Maximum parsimony: • Wolf, Rogozin, Grishin, Tatusov & Koonin. (2001) BMC Evol. Biol. • Fitz-Gibbon & House. (1999) Nucleic Acids Res.

  43. Gene order (synteny) • Estimate evolutionary distance from the number of rearrangements necessary to transform one genome into another (computationally complex) • construct phylogenetic trees by minimizing the number of breakpoints between genomes (Blanchette et al 1999) • More practical solution: simply score the presence or absence of pairs of orthologous genes (Korbel et al. 2002, Wolf et al 2001)

  44. Gene content / gene order • Problems • Orthology assessment • ‘big genome attraction’: distantly related species with large genomes may share more genes than closer related species with small genomes. • Sequence information is lost

  45. Superalignments (supermatrix) • multiple alignments for each gene are concatenated to form a superalignment • Use conventional phylogenetic reconstruction methods (e.g. distance or MP) (Brown et al. 2001, Wolf, et al 2001)

  46. Superalignments • Example: RibAlign • analysis of 16S ribosomal RNA (rRNA) sequences has been the de-facto gold standard for the assessment of phylogenetic relationships among prokaryotes • concatenation of ribosomal protein sequences (MAFFT, Phylip: ProML, MrBayes)

  47. Superdistance (supermatrix) • Superdistance methods first calculate distance matrices for all gene families. • The phylogenomic distance between two species is then defined as the average distance between all the shared gene families (Kunin et al., 2005)

  48. Supertree • Reconstruct phylogenetic trees for each gene family separately • Combine the multiple gene family trees to form a single phylogenomic tree (Gene Tree Reconciliation) (Bininda-Emonds, 2004; Daubin et al., 2002)

  49. Gene tree reconciliation methods • Consensus tree methods are used to combine fully overlapping source trees (strict, majority consensus rules, …) • (eg. Mincut Semple and Steele 2000) From de Queiroz and Gatesy, Trends Ecol Evol, 2007

  50. Gene tree reconciliation methods • Indirect supertree construction represents individual source trees as matrices, then combines them using an optimization criterion : • Matrix representation using parsimony (MRP) • “flip” supertrees • Average consensus procedure • Most Similar Supertree (MSSA) • Maximum Quartet Fit (QFIT) • Maximum Splits Fit (SFIT). From Bininda-Emonds et al, 2002 Software Clann, http://bioinf.may.ie/software/clann/

More Related