640 likes | 762 Views
SIO, San Diego, May 2006. What's going on in the environment? Getting a grip on microbial physiology with genomics and metagenomics. Rob Edwards http://phage.sdsu.edu/~rob Fellowship for Interpretation of Genomes, San Diego State University, Burnham Institute for Medical Research,
E N D
SIO, San Diego, May 2006 What's going on in the environment? Getting a grip on microbial physiology with genomics and metagenomics Rob Edwards http://phage.sdsu.edu/~rob Fellowship for Interpretation of Genomes, San Diego State University, Burnham Institute for Medical Research, IMEC, LLC
Outline • Sequencing statistics scare skeptics • The SEED database • Some simply stunning Subsystems • Mysterious missing methionine metabolism • Marine metabolism mined from metagenomics • Fabulous four-five-four for facile functional findings • Marine phage most puzzling
The Players • FIG: Fellowship for Interpretation of Genomes • NMPDR: Natl. Microbial Pathogen Data Resource • BRC: NIH Bioinformatics Resource Centers • SEED: The SEED database.
5,000 4,000 Complete Genomes 3,000 2,000 1,000 X X X X X X X X X X 1996 2000 2004 2008 Year When will the 1,000thmicrobial genome be sequenced?
Outline • Sequencing statistics scare skeptics • The SEED database • Some simply stunning Subsystems • Mysterious missing methionine metabolism • Marine metabolism mined from metagenomics • Fabulous four-five-four for facile functional findings • Marine phage most puzzling
The SEED database developed by FIG http://theseed.uchicago.edu/FIG/index.cgi • Current version: • 580 Bacteria (342 complete) • 38 Archaea (26 complete) • 562 Eukarya (29 complete) • 1335 Viruses • 2 Environmental Genomes
The problem: How do you generate consistent annotations for 1,000 genomes?
Basic biology lacI lacZ lacY lacA
< 80 % < 80% < 80 % Different types of clustering
Occurrence of clustering in different genomes 1 Clusters of genes w/ maximum 80% identity Genes in subsystems in clusters Total number of genomes in group 120 0.8 Fraction of genes in clusters 0.6 80 Number of genomes 0.4 40 0.2 0 0 Average Aquificae Firmicutes Chloroflexi Chlamydiae Deinococcus- Thermus Spirochaetes Thermotogae Bacteroidetes Cyanobacteria Actinobacteria Proteobacteria
Outline • Sequencing statistics scare skeptics • The SEED database • Some simply stunning Subsystems • Mysterious missing methionine metabolism • Marine metabolism mined from metagenomics • Fabulous four-five-four for facile functional findings • Marine phage most puzzling
The Subsystems Approach to Annotation • Subsystem is a generalization of “pathway” • collection of functional roles jointly involved in a biological process or complex • Functional Role is the abstract biological function of a gene product • atomic, or user-defined, examples: • 6-phosphofructokinase (EC 2.7.1.11) • LSU ribosomal protein L31p • Streptococcal virulence factors • Does not contain “putative”, “thermostable”, etc • Populated subsystem is complete spreadsheet of functions and roles
Subsystems developed based on • Wet lab • Chromosomal context • Metabolic context • Phylogenetic context • Microarray data • Proteomics data • …
Example Subsystem: Histidine Degradation • Conversion of histidine to glutamate • Functional roles defined in table • Inclusion in subsystem is only by functional role • Controlled vocabulary …
Subsystem Spreadsheet HutH HutU HutI GluF HutG NfoD ForI Organism Variant Bacteroides thetaiotaomicron Q8A4B3 Q8A4A9 Q8A4B1 Q8A4B0 1 Desulfotela psychrophila gi51246205 gi51246204 gi51246203 gi51246202 1 Halobacterium sp. Q9HQD5 Q9HQD8 Q9HQD6 Q9HQD7 2 Deinococcus radiodurans Q9RZ06 Q9RZ02 Q9RZ05 Q9RZ04 2 Bacillus subtilis P10944 P25503 P42084 P42068 2 Caulobacter crescentus P58082 Q9A9MI P58079 Q9A9M0 Q9A9L9 3 Pseudomonas putida Q88CZ7 Q88CZ6 Q88CZ9 Q88D00 Q88CZ3 3 Xanthomonas campestris Q8PAA7 P58988 Q8PAA6 Q8PAA8 Q8PAA5 3 Listeria monocytogenes -1 Subsystem Spreadsheet • Column headers taken from table of functional roles • Rows are selected genomes or organisms • Cells are populated with specific, annotated genes • Functional variants defined by the annotated roles • Variant code -1 indicates subsystem is not functional • Clustering shown by color
Subsystem Spreadsheet HutH HutU HutI GluF HutG NfoD ForI Organism Variant Bacteroides thetaiotaomicron Q8A4B3 Q8A4A9 Q8A4B1 Q8A4B0 1 Desulfotela psychrophila gi51246205 gi51246204 gi51246203 gi51246202 1 Halobacterium sp. Q9HQD5 Q9HQD8 Q9HQD6 Q9HQD7 2 Deinococcus radiodurans Q9RZ06 Q9RZ02 Q9RZ05 Q9RZ04 2 Bacillus subtilis P10944 P25503 P42084 P42068 2 Caulobacter crescentus P58082 Q9A9MI P58079 Q9A9M0 Q9A9L9 3 Pseudomonas putida Q88CZ7 Q88CZ6 Q88CZ9 Q88D00 Q88CZ3 3 Xanthomonas campestris Q8PAA7 P58988 Q8PAA6 Q8PAA8 Q8PAA5 3 Listeria monocytogenes -1 “The Populated Subsystem”
Subsystem Diagram • Three functional variants • Universal subset has three roles, followed by three alternative paths from IV to VI • No ForI known experimentally
Subsystem Spreadsheet HutH HutU HutI GluF HutG NfoD ForI Organism Variant Bacteroides thetaiotaomicron Q8A4B3 Q8A4A9 Q8A4B1 Q8A4B0 1 Desulfotela psychrophila gi51246205 gi51246204 gi51246203 gi51246202 1 Halobacterium sp. Q9HQD5 Q9HQD8 Q9HQD6 Q9HQD7 2 Deinococcus radiodurans Q9RZ06 Q9RZ02 Q9RZ05 Q9RZ04 2 Bacillus subtilis P10944 P25503 P42084 P42068 2 Caulobacter crescentus P58082 Q9A9MI P58079 Q9A9M0 Q9A9L9 3 Pseudomonas putida Q88CZ7 Q88CZ6 Q88CZ9 Q88D00 Q88CZ3 3 Xanthomonas campestris Q8PAA7 P58988 Q8PAA6 Q8PAA8 Q8PAA5 3 Listeria monocytogenes -1 Subsystem Spreadsheet • Prediction from subsystems confirmed experimentally
Outline • Sequencing statistics scare skeptics • The SEED database • Some simply stunning Subsystems • Mysterious missing methionine metabolism • Marine metabolism mined from metagenomics • Fabulous four-five-four for facile functional findings • Marine phage most puzzling
sulfur and acetylhomoserine sulfhydralase acquire met or convert homocysteine to methionine convert cystathione to homocysteine convert cysteine to cystathione acquire homoserine How do bacteria make methionine?
? Missing genes ?
Cyanoseed: http://cyanoseed.theFIG.info
Marineseed: http://theseed.uchicago.edu/FIG/organisms.cgi?show=marine
genome context (virulence islands, prophages, conserved gene clusters) virulence mechanism enzymatic activity cellular localization predicted or measured co-regulation common phenotype combinations of criteria Subsystems are not just for gene clusters
How much progress has been made? • 541 subsystems encoded • 80 – 85% of the genes in core machinery are contained in subsystems • 30 – 35% of genes in NMPDR organism genomes, • 20 – 30% of other genomes contained in subsystems
Outline • Sequencing statistics scare skeptics • The SEED database • Some simply stunning Subsystems • Mysterious missing methionine metabolism • Marine metabolism mined from metagenomics • Fabulous four-five-four for facile functional findings • Marine phage most puzzling
Metagenomics 200 liters water 5-500 g fresh fecal matter Concentrate and purify viruses Epifluorescent Microscopy Extract nucleic acids DNA/RNA LASL Sequence Breitbart et al., multiple papers
Control datasets for metagenome comparisons Number of proteins in different datasets
Determination of Statistical DifferencesBetween Metagenomes • Take 10,000 proteins from sample 1 • Count frequency of each subsystem • Repeat 20,000 times • Repeat for sample 2 • Combine both samples • Sample 10,000 proteins 20,000 times • Build 95% CI • Compare medians from samples 1 and 2 with 95% CI Rodriguez-Brito (2006). BMC Bioinformatics
Comparison of all Subsystems More in Sargasso More in SEED
Is serine being used as an osmolyte? • Few trehalose, proline, sucrose synthetic genes • Serine is most abundant amino acid in ocean (Suttle, Keil) • Serine is more effective osmoprotectant than glycine betaine • (Yancey)
Outline • Sequencing statistics scare skeptics • The SEED database • Some simply stunning Subsystems • Mysterious missing methionine metabolism • Marine metabolism mined from metagenomics • Fabulous four-five-four for facile functional findings • Marine phage most puzzling
So 2004 454 Metagenomics 200 liters water 5-500 g fresh fecal matter Concentrate and purify viruses Epifluorescent Microscopy Extract nucleic acids DNA/RNA LASL Sequence Breitbart et al., multiple papers
454 Sequence Data(Only from Rohwer Lab, in one year) • 42 libraries • 22 microbial, 20 phage • 1,028,563,420 bp total • 33% of the human genome • 95% of all complete and partial bacterial genomes • 10% of community sequencing of JGI per year • 9,933,184 sequences • Average 236,511 per library • Average read length 103.5 bp • Av. read length has not increased in 12 months
The Soudan Mine, Minnesota Red Stuff Oxidized Black Stuff Reduced
Red and Black Samples Are Different Black stuff Cloned and 454 sequenced 16S are indistinguishable Cloned Red Red
There are different amounts of metabolism in each environment
There are different amounts ofsubstrates in each environment Red Stuff Black Stuff
But are the differences significant? • Sample 10,000 proteins from site 1 • Count frequency of each “subsystem” • Repeat 20,000 times • Repeat for sample 2 • Combine both samples • Sample 10,000 proteins 20,000 times • Build 95% CI • Compare medians from sites 1 and 2 with 95% CI Rodriguez-Brito (2006). BMC Bioinformatics
Subsystem differences & metabolismIron acquisition Black Stuff Siderophore enterobactin biosynthesis ferric enterobactin transport ABC transporter ferrichrome ABC transporter heme Black stuff: ferrous iron (Fe2+, ferroan [(Mg,Fe)6(Si,Al)4O10(OH)8]) Red stuff: ferric iron (goethite [FeO(OH)])
Nitrification differentiates the samples Edwards (2006) BMC Genomics
Red Sample Arg, Trp, His Ubiquinone FA oxidation Chemotaxis, Flagella Methylglyoxal metabolism Black Sample Ile, Leu, Val Siderophores Glycerolipids NiFe hydrogenase Phenylpropionate degradation The challenge is explaining the differences between samples
We can cheaply compare the important biochemistry happening in different environments We don’t care which organisms are doing the metabolism but we know what organisms are there