220 likes | 367 Views
Metagenomic dataset preprocessing – data reduction. Konstantinos Mavrommatis KMavrommatis@lbl.gov. Complexity. Acid Mine Drainage. Sargasso Sea. Termite Hindgut. Cow rumen. Soil.
E N D
Metagenomic dataset preprocessing – data reduction Konstantinos Mavrommatis KMavrommatis@lbl.gov
Complexity Acid Mine Drainage Sargasso Sea Termite Hindgut Cow rumen Soil The total metagenome is the result of a cell community. Cells belong to different organisms ranging from strains to domains. Who is there? (phylogenetic content) What does it do? (Functional content) Why is it there? (Comparative study) Species complexity 1 10 100 1000 10000
Dataset processing Sample preparation High throughput sequencing Assemble reads Analysis Feature prediction ? QC Functional annotation and comparative analysis Binning
Dataset processing (v 3.0a) Submitted file Assembled contigs Submitted file 454 reads Submitted file Illumina reads Fasta/fastq File QC. Check character set and contig name. Remove trailing Ns. Trimming. Q=20 Trimming. Q=13 Fasta Low complexity. Size of 80 bp Dereplication. Prefix = 5, identity 95%, Clustering. 100% identity fasta File for gene calling
File for gene calling fasta CRISPR detection. crt / pilercr Conflict resolution RNA detection. tRNAscan / hmmer / Blast / (isolates:Rfam) Concatenation of all results. Creation of final output file File for IMG IMG Dataset processingFeature prediction pipeline (v 3.0a) CDS detection. Isolates: prodigal Metagenomes: varies Unassembled reads + assembled contigs
Dataset processingQuality trimming Courtesy Alex Copeland http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ Remove sequences from the ends of the reads. lucy for 454 datasets. Illumina (longest high quality string)
Dataset processingLow complexity filter tatatatatatatatatat aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa • using dust (NCBI) • Remove sequences with less than 80 informative bases
Dataset processingSequence dereplication atcccat atc-cat atcccat atcccat atcccat gctacat gctncat gctacat Not dereplicated gctacat • using uclust • 95% identity (global alignment). • Identical prefix (5nt)
Dataset processingEvaluation of processing tools • Unassembled sequences due to their small size, quality problems, and large number need to be processed with efficient pipelines. • Simulated datasets: • Using sequences extracted from finished genomes (Perfect sequences) • Using reads that have been used to assemble finished genomes (Real errors). • Evaluation and development of new tools/wrappers.
Dataset processingFeature prediction Available methods: Ab initio: Metagene, MetaGeneMark, FragGeneScan, Prodigal. Similarity based: Blastx, USEARCH. isolate MISSED CORRECT WRONG NEW metagenome
Contigs frameshift Wrong prediction
Why annotate unassembled reads? Additional information about functions and phylogeny Assembled only More accurate statistics based on unassembled + assembled Unassembled + assembled + real metagenome