640 likes | 750 Views
Challenges for metagenomic data analysis and lessons from viral metagenomes [What would you do if sequencing were free?]. Rob Edwards San Diego State University Fellowship for Interpretation of Genomes The Burnham Institute for Medical Research. Outline.
E N D
Challenges for metagenomic data analysis and lessons from viral metagenomes[What would you do if sequencing were free?] Rob Edwards San Diego State University Fellowship for Interpretation of Genomes The Burnham Institute for Medical Research
Outline • How and why we sequence environments • Viral metagenomics • Marine stories • Human stories • Pyrosequencing • Mine story • Is there a Future?
Why Metagenomics? • What is there? • How many are there? • What are they doing?
How do you sequence the environment? • Extract DNA
CsCl step gradient 1.1 g ml-1 1.35 g ml-1 1.5 g ml-1 1.7 g ml-1
How do you sequence the environment? • Extract DNA • Create library
David Mead - Linker-Amplified Shotgun Libraries (LASLs) Soil Extraction Kit This method produces high coverage libraries of over 1 million clones from as little as 1 ng DNA Breitbart (2002) PNAS
How do you sequence the environment? • Extract DNA • Create library • Sequence fragments
Outline • How and why we sequence environments • Viral metagenomics • Marine stories • Human stories • Pyrosequencing • Mine story • Is there a Future?
Why Phages? • Phages are viruses that infect bacteria • 10:1 ratio of phages:bacteria • 1031 phages on the planet • Specific interactions (probably) • one virus : one host • Small genome size • Higher coverage • Horizontal gene transfer • 1025-1028 bp DNA per year in the oceans
Uncultured Viruses 200 liters water 5-500 g fresh fecal matter Concentrate and purify viruses Epifluorescent Microscopy Extract nucleic acids DNA/RNA LASL Sequence
Bioinformatics • BLASTagainst NR • blastx, tblastn, tblastx • BLAST against boutique databases • Complete phage genomes, ACLAME, Other libraries, 16S • Parsing to present data in a useful format
BLAST and Parsing • http://phage.sdsu.edu/blast • Submit BLAST to local and remote databases • Local (as fast as possible) • NCBI (one search every 3 seconds) • Many concurrent searches • One search versus 1,000 searches • Parse data into tables • Access to taxonomy etc
Most Viral Genes are Unknown Known 22% Unknown 78% TBLAST (E<0.001) 3,093 sequences Breitbart (2002) PNAS Rohwer (2003) Cell
GenBank has more than doubled since 2001 … 60 billion base pairs 60 million sequences
GenBank has more than doubled since 2001 …but the fraction of unknowns remains constant Edwards (2005) Nature Rev. Microbiol.
All of the new genes in the databases are coming from environmental sequences
Outline • How and why we sequence environments • Viral metagenomics • Marine stories • Human stories • Pyrosequencing • Mine story • Is there a Future?
Human-associated viruses • More bacteria than somatic cells by at least an order of magnitude • More phages than bacteria by an order of magnitude • Sample the bacteria in the intestine by sampling their phage
Most Viral DNA Sequences in Adult Human Feces are Unknown Phages Eukaryotic Viruses 6% Known 40% Unknown 60% Phages 94% TBLAST (E<0.001) 532 sequences Breitbart (2003) J. Bacteriol.
Adults Versus Babies No bacteria or viruses in 1st fecal sample Abundant bacterial and viral communities by 1 week of age • >108 VLP/g feces
Baby Feces Viruses • Most sequences are unknown (≈70%) • Similarities to phages from Lactococcus, Lactobacillus, Listeria, Streptococcus, and other Gram positive hosts • From microarray studies, sequences are stable in the baby over a 3 month period • Same types of phage as present in adult feces • one identical sequence to an unrelated adult!
DNA viruses in feces are phages. Feces ≠ intestines. RNA viruses?
Most Human RNA Viruses are Known Other Plant Viruses 9% Unknown 8% Other 26% Pepper Mild Mottle Virus 65% Known 92% TBLAST (E<0.001) ≈36,000 sequences Zhang (2006) PLoS Biology
Pepper Mild Mottle Virus (PMMV) • ssRNA virus; ≈6 kb genome • Related to Tobacco Mosaic Virus • Infects members of Capsicum family • Widely distributed – spread through seeds • Fruits are small, malformed, mottled • Rod-shaped virions Viral particles in fecal sample TOBACCO MOSAIC VIRUS http://www.rothamsted.bbsrc.ac.uk/ppi/links/pplinks/virusems/
S1 S2 S3 S4 S5 S6 S7 S8 S9 PMMV PMMV is common in Human Feces Fecal samples Extract total RNA RT-PCR for PMMV San Diego : 78% people are positive Singapore : 67% people are positive 10-50 fold increase in feces compared to food 106-109 PMMV copies per gram dry weight of feces
Which Foods Contain PMMV? Chili powder Chili sauces NOT FOUND IN FRESH PEPPERS Hong Kong chili sauce Hong Kong green chili Pork noodle red chili Vegetarian chili Indian curry Chinese food Chicken rice
Koch’s Postulates Thesunmachine.net http://www.sweatnspice.com
Human microbial metagenome is more important than human genome
Outline • How and why we sequence environments • Viral metagenomics • Marine stories • Human stories • Pyrosequencing • Mine story • Is there a Future?
How do you sequence the environment? • Extract DNA • Create library • Sequence fragments Everything so far from 40,000 sequences This is so 2004
How do you sequence the environment? • Extract DNA • Pyrosequence
454 Pyrosequencing • DNA extraction from environment • Whole genome amplification } • Emulsion-based PCR • Luciferase-based sequencing SDSU } 454 Inc. Margulies (2005) Nature
454 Sequence Data(Only from Rohwer Lab) • 21 libraries • 10 microbial, 11 phage • 597,340,328 bp total • 20% of the human genome • 50% of all complete and partial microbial genomes • 5,769,035 sequences • Average 274,716 per library • Average read length 103.5 bp • Av. read length has not increased in 7 months
Growth of sequence data 600 million bp 6 million reads
Cost of sequencing • One reaction = $10,000 • One reaction = 250,000 reads • 250 reads = $10 • 1 read = 4¢ • 1 read = 100bp • 1 bp = 0.04¢ ($400 per 1x 1,000,000 bp) • Sanger sequencing ca. $1/rxn, 0.2¢/bp • real cost ca. $5/rxn, 1¢/bp 454 sequencing does cot require cloning, arraying etc.
Bioinformatics • 597,340,328 bp total • 5,769,035 sequences • 7 months • Existing tools are not sufficient
Current Pipeline http://phage.sdsu.edu/~rob/Pyrosequencing/ • Dereplicate • BLAST against • 16S • Complete phage • nr (SEED) • subsystems
Sequencing is cheap and easy. Bioinformatics is neither.
Outline • How and why we sequence environments • Viral metagenomics • Marine stories • Human stories • Pyrosequencing • Mine story • Is there a Future?
The Soudan Mine, Minnesota Red Stuff Oxidized Black Stuff Reduced
Red and Black Samples Are Different Black stuff Cloned and 454 sequenced 16S are indistinguishable Cloned Red Red
Annotation of metagenomes by subsystems A subsystem is a group of genes that work together • Metabolism • Pathway • Cellular structures • Anything an annotator thinks is interesting
There are different amounts of metabolism in each environment
There are different amounts ofsubstrates in each environment Red Stuff Black Stuff
But are the differences significant? • Sample 10,000 proteins from site 1 • Count frequency of each subsystem • Repeat 20,000 times • Repeat for sample 2 • Combine both samples • Sample 10,000 proteins 20,000 times • Build 95% CI • Compare medians from sites 1 and 2 with 95% CI Rodriguez-Brito (2006). In Review