210 likes | 229 Views
Delve into metagenome analysis through Minoan eruption data, sample sequencing techniques, assembly challenges, and functional analyses. Discover community composition disparities and interpret abundant populations in diverse samples.
E N D
Metagenome analysis: use case Natalia Ivanova MGM Workshop September 29, 2011
…it seemed as though the sea was being sucked backwards, as if it were being pushed back by the shaking of the land…Behind us were frightening dark clouds, rent by lightning twisted and hurled, opening to reveal huge figures of flame. These were like lightning, but bigger. From Pliny the Younger’s Letter Minoan eruption and metagenomics
Diagram by Gary Massoth/PMEL Apart from Minoan eruption… from Chernicoff & Stanley, Geology, 2007
white mat red mat Key gradients white vs red: Temperature 60 vs 18oC CO2 tension >99% vs <1% Sampling sites
Standard JGI metagenome pipeline 454 standard DNA sample shotgun libraries 454 long mate pair DNA QC Illumina standard SSU pyrotags Illumina long mate pair Assembly http://pyrotagger.jgi-psf.org • Community composition • Semi-quantitative – OTU abundance Metagenome IMG/M-ER contigs + unassembled reads • Community composition • Functional analysis Analysis
PhyloDistribution results – BLASTp of metagenome CDSs against isolates in IMG
Pyrotags vs PhyloDistribution – white mat Big differences in abundance (an order of magnitude or more) of Bacteroidetes and Thermotogae
Primer bias in pyrotags (against Proteobacteria)? Amplification artifacts in pyrotags – well known for metagenome data Sequencing GC bias in the metagenome – low and high (<30% and >65%) are underrepresented in Illumina data K-mer assembler problems: abundant populations may be undrrepresented in assembly if incorrect k-mer/coverage parameters selected Possible explanations
PCR artifacts in metagenome data 454 technology includes an emulsion PCR step, which may lead to artificial overrepresentation of certain sequences Reason: presence of free beads during the library prep step; escaped emPCR products bind to free beads and are disproportionately amplified 12
What about GC bias? Medium GC (Arcanobacterium) High GC (Cellulomonas) Low GC (Brachyspira) Question: how do you find average/max/min GC content for a clade? Answer: IMG=>Genome Browser=>View Phylogenetically=>click on green + to select the clade, then “Add selected to Genome Cart”=>Compare Genomes=>Genome Statistics Result: Thermotogae GC percent 41 average/47 max/31 min Bacteroidetes GC percent 42.5 average/66 max/31 min
Are there any abundant populations that could be filtered out in assembly? Typical Pyrotagger output There are 2 highly abundant populations – just 2 clusters account for nearly all Bacteroidetes and Thermotogae in the sample
Let’s take a closer look at the assemblies and unassembled reads
Red mat is taxonomically more diverse Is it more diverse functionally? Rarefaction curves: white mat is expected to have ~4000 different Pfams; red mat ~3600 Functional analysis: metagenome as a bag of functions Question: where do you find this information? Answer: IMG=>Taxon Details=>Metagenome Statistics; Genes with Pfam=>Display as a list =>Export
Abundance Comparisons Motility and chemotaxis genes are overrepresented in white mat (detected by both Pfams and COG Categories) white mat red mat
Scenario 1: the function/pathway is overrepresented because it is present in all members of the community, possibly at higher copy number Scenario 2: the function/pathway is overrepresented because it is present in one clade, which is absent from the second sample Is motility/chemotaxis common to all organisms in white mat? Question: can we distinguish between the two scenarios? Answer: click on the gene count for protein family/functional category, add all genes to Gene Cart=>add scaffolds to Scaffold Cart=>PhyloDistribution of all scaffolds in the Scaffold Cart
The total number of sequences in all clusters assigned to Epsilonproteobacteria is 50 in white mat and 66 in red mat Largest cluster in white mat includes 125K+ sequences Largest cluster in red mat includes 14K+ sequences Question: what about the presence of Sulfurimonas-like bacteria in the metagenomes? Answer: go to Compare Genomes=>PhyloDistribution=>Genome vs Metagenomes, select the genome; the histogram shows the number of BLASTp hits from CDSs in all metagenomes to this genome Are Sulfurimonas-like bacteria present in both samples?
Two communities have different composition; white mat sampled next to the hydrothermal vent has lower complexity Community composition as sampled by pyrotags and the metagenome may be quite different due to a number of biases Some protein families/functional categories are more abundant in one sample as compared to the other because of different community composition, and not necessarily because they are more important in this environment Conclusions