200 likes | 316 Views
Scale of the ‘unknown’ gene problem. Principles of comparative genomics. Shared plant-prokaryote genes. Comparative genomics When Blast tells you nothing…. The ‘guilt by association’ principle ‘Two-dimensional’ gene annotation SEED subsystems. Plant-prokaryote examples
E N D
Scale of the ‘unknown’ gene problem Principles of comparative genomics • Shared plant-prokaryote genes • Comparative genomics • When Blast tells you nothing…. • The ‘guilt by association’ principle • ‘Two-dimensional’ gene annotation • SEED subsystems • Plant-prokaryote examples • Filling ‘pathway holes’ – FolQ • Linking new functions to known systems – COG0354
10000 www.genomesonline.org 9000 8000 7000 6000 Number of genomes 5000 4000 3000 Ongoing Complete 2000 1000 0 Jul 1999 Jul 2002 Apr 2005 Apr 2003 Oct 2004 Oct 2005 Jun 2000 Jan 2001 Jan 2003 Jun 2004 Feb 2004 Mar 2011 Aug 2006 Sep 2003 May 2007 May 2008 Aug 2009 Dec 1997 Sep 2001 Whole genome sequencing progress ● Functional annotation of genes has nowhere near kept pace ● Functional annotations are often absent, vague, or wrong
1437/3736 enzymes (38%) with EC numbers have no associated genes Orphan genes Orphan enzymes • 20-60% of genes in any given genome have no known function or only a vague one (‘esterase’ etc)
Percentage of unknown proteins encoded by diverse genomes 100 80 Unknown Known 60 Percent of proteins 40 20 0 Human Solibacter usitatus Escherichia coli Chlamydia trachomatis Pyrococcus abyssi Haloarcula marismortui Arabidopsis Lactobacillus casei Synechocystis Staphylococcus aureus Acidobacterium Bacteria Archaea Eukarya The unknown protein problem in various groups Data from The SEED http://theseed.uchicago.edu/
Source of genes Number of genes % of genome Cyanobacteria 5470 21.0 Proteobacteria 1170 4.6 Total 11170 43.4 Gram+ bacteria 2280 9.1 Other bacteria 1160 4.6 Archaea 1090 4.4 Plants & prokaryotes share many (unknown) genes ● Estimates for Arabidopsis vary – but all are many thousands ● Functions of most shared genes are metabolic From de Crecy-Lagard & Hanson Trends Microbiol 15: 563 (2007) ● Shared genes identifiably from various groups ● Plants are conglomerates of microbial metabolic genes ● Many opportunities for comparative genomics
The power of comparative genomics ●Suppose you have an unknown plant protein: ●BlastP search gives various prokaryote hits ●None of them have clear functions Dead end ●No! This is the beginning of comparative genomics ●Predicts functions via ‘guilt by association’ principle ●Genes of related function are associated in various ways ●e.g. Enzymes in a pathway, proteins in a complex ●Whatever a gene’s associates do, it probably does too
Association evidence Gene W A B C D Gene X Gene Y Gene Z Gene clustering Co-expression Orf X Orf Y Orf XY A Gene fusion B A C V M Predictions XYYX Organelle proteomes Protein-protein interactions B XYYX C XYYX D XYYX Testing (genetics, biochemistry) Shared regulatory sites Essentiality & other phenome data + + + – – + + – – Structures Phylogenetic occurrence Genomic evidence Post-genomic evidence
Two-dimensional gene annotation • ‘Dimensions’ are: • Molecular function (e.g., an enzyme activity with EC no.) • Functional context (e.g., other enzymes of a pathway) • ‘2-Dimensions good, 1-dimension bad’ • Even an EC no. function may be wrong if pathway not there • Pathway context may be wrong if certain enzymes missing • GenBank etc annotations are 1-dimensional (mol. function)
Folate biosynthesis subsystem Pathway hole SEED subsystems • Subsystems (SSs) capture both annotation dimensions • SSs cover many genomes, have form of spreadsheet: • Columns are molecular functions • Rows are genomes • Each cell identifies the genes for proteins with the specific molecular functional role in the designated genome • Sets of molecular functions (e.g. enzymes) that together implement a specific biological process (e.g. a pathway)
Plant – prokaryote examples • Prokaryote association evidence is mainly genomic • Plant association evidence is mainly post-genomic • Post-genomic evidence is noisier but very useful • Superb plant post-genomic resources: • Microarrays, RNAseq (organ- and environment-specific) • Organellar targeting prediction, proteomics (location can r/o function) • Phenome databases (chlorosis, lethality can support function) • Vast plant metabolism bibliome
Folate synthesis pathway FolE FolQ [P-ase] FolB FolK FolP FolC FolA HMDHP-P2 THF HMDHP GTP DHN-P DHN DHP DHF DHN-P3 Glu PabAB PabC pABA Chrorismate ADC folEK folP ylgG folC Lactococcus lactis folate gene cluster FolQ – Filling a pathway hole • FolQ universally missing (prokaryotes, plants, fungi, protists) • Missing step known to be a pyrophosphohydrolase, ~17 kDa • Search genomes for small hydrolase clustered with fol genes • YlgG candidate in Firmicutes, Nudix hydrolase family, 19 kDa • YlgG has a plant homolog – At1g68760
Folate synthesis pathway 0.9 FolE FolQ [P-ase] FolB FolK FolP FolC FolA HMDHP-P2 THF HMDHP GTP DHN-P DHN DHP DHF DHN-P3 0.6 Glu PabAB PabC pABA Chrorismate ADC 0.3 0 Recombinant proteins release DHN-P + PPi WT KO 240 YlgG At1g68760 DHN-P3 200 1.5 160 1.0 Fluorescence 120 Product formation (nmol/assay) DHNP3 80 0.5 40 0 0 DHNP DHNP Pi Pi PPi PPi 2 4 6 2 4 6 Minutes FolQ – Experimental tests • YlgG& At1g68760 act on DHN-P3 • ylgG KO accumulates DHN-P3
Rickettsia Ehrlichia Anaplasma Bradyrhizobium Burkholderia Neisseria Xanthomonas Psychrobacter E. coli Shewanella Thermus Deinococcus Synechocystis Synechococcus Nostoc Haloarcula Natronomonas Corynebacterium Streptomyces Solibacter GcvT Blastopirellula Yeast GcvT Pirellula Mouse GcvT Arabidopsis GcvT Rice GcvT COG0354 – Linking a new function to known system COG0354 – A folate protein for Fe/S cluster repair in oxidative stress • In all kingdoms of life Mouse Fly • - Bacteria Yeast Leishmania • - Archaea At4g12130 • Plants • Animals • - Fungi • 2 plant proteins • - 1 related to rickettsias (mitochondria) • - 1 related to cyanobacteria (plastids) • Homolog of GcvT protein At1g60990 • - But clearly a distinct clade Folate-dependent
Mitochondrial COG0354 Ferritin 2 Mitochondrial Frataxin Mitochondrial COG0354 COG0354 – Comparative genomics & post-genomic data • Co-expression in Arabidopsis Arabidopsis Transcriptome DB (Max Planck Institute, Golm) • - Mitochondrial COG0354 expression correlates with frataxin (Fe/S assembly) Developmental series • - And with ferritin 2 (Fe storage)
COG0354 Fe/S protein Fe/S partner ● Nif cluster in Methylococcus capsulatus ● Suf cluster in Rubrobacter xylanophilus 0354 sufC sufB sufS sufD thiC 0354 nifQ fd nifX nifN nifE fd nifK nifD nifH ● Sdh operon in Stenotrophomonas maltophila 0354 sdhC sdhD sdhB sdhA ● NAD synthesis cluster in Pelagibacter ubique 0354 nadA nadC ● MiaB (Radical SAM) in Buchnera aphidicola 0354 MiaB COG0354 – Comparative genomics & post-genomic data • Co-expression in Arabidopsis • Clusters with Fe/S proteins
COG0354 IscA Bacteria Clostridiales Firmicutes Mollicutes Lactobacillales Staphylococcaceae Listeriaceae Bacillaceae Fusobacteria Actinobacteria Bifidobacterium Cyanobacteria Acidobacteria Campylobacterales δ/ε-Proteobacteria Bdellovibrionales Desulfobacterales Desulfovibrionales Desulfuromonadales Myxococcales Syntrophobacterales α-Proteobacteria β-Proteobacteria γ-Proteobacteria Magnetococcus Spirochaetes Planctomycetes Chlamydiales Chlorobi Bacteroidetes Bacteroidales Flavobacteria Sphingobacteria Deinococcus/Thermus Chloroflexi Thermotogae Archaea Nanoarcheota Crenarchaeota Euryarchaeota Archaeoglobi Halobacteria Methanobacteria Methanococci Gene present Methanomicrobia Gene absent Methanopyri Thermococci Thermoplasmata COG0354 – Comparative genomics & post-genomic data • Co-expression in Arabidopsis • Clusters with Fe/S proteins • Only occurs if IscA is present • - IscA proteins are scaffolds in Fe/S cluster assembly
COG0354 – Comparative genomics & post-genomic data • Co-expression in Arabidopsis • Clusters with Fe/S proteins • Only occurs if IscA is present • Associated with aerobic lifestyle
–Mycobacterium tuberculosis –Haemophilus influenzae –Pseudomonas aeruginosa –E. coli (slow growth) – Yeast (petite) COG0354 – Comparative genomics & post-genomic data • Co-expression in Arabidopsis ● Essential gene in: • Clusters with Fe/S proteins • Only occurs if IscA is present ● Important gene in: • Associated with aerobic lifestyle • H2O2-induced in E. coli ● Plant proteins both expressed ●Cyano-like protein in plastids • High-throughput screens ●E. coli protein has folate site • - Essentiality & phenomics • - Proteomics
Controls Plant & mammal Fungi, protist, Archaea Plant C Protist Archaea E. coli Vector Plant M E. coli Mammal Yeast LB + plumbagin (oxidative stress) COG0354 – Predictions & Experimental Validation COG0354 PREDICTIONS ● Folate mutations abolish activity ● Is a folate-dependent enzyme ● Combats oxidative stress ● Mutant oxidative stress-sensitive ● Mutant many Fe/S enzyme defects ● Helps make/repair Fe/S clusters ● Function is ancient & ubiquitous (like Fe/S proteins themselves) ● Complementation by all kingdoms
Hypothesis that connects and unifies observations The power of comparative genomics “The facts are known but they are insulated and unconnected…. The pearls are there but they will not hang together until some one provides the string” William Whewell (1794-1866) English Scientist, Philosopher, Anglican priest An early influence on Charles Darwin Coined the term “scientist”