Metagenomic dataset preprocessing – data reduction

Metagenomic dataset preprocessing – data reduction Konstantinos Mavrommatis KMavrommatis@lbl.gov

Complexity Acid Mine Drainage Sargasso Sea Termite Hindgut Cow rumen Soil The total metagenome is the result of a cell community. Cells belong to different organisms ranging from strains to domains. Who is there? (phylogenetic content) What does it do? (Functional content) Why is it there? (Comparative study) Species complexity 1 10 100 1000 10000

Dataset processing Sample preparation High throughput sequencing Assemble reads Analysis Feature prediction ? QC Functional annotation and comparative analysis Binning

Dataset processing (v 3.0a) Submitted file Assembled contigs Submitted file 454 reads Submitted file Illumina reads Fasta/fastq File QC. Check character set and contig name. Remove trailing Ns. Trimming. Q=20 Trimming. Q=13 Fasta Low complexity. Size of 80 bp Dereplication. Prefix = 5, identity 95%, Clustering. 100% identity fasta File for gene calling

File for gene calling fasta CRISPR detection. crt / pilercr Conflict resolution RNA detection. tRNAscan / hmmer / Blast / (isolates:Rfam) Concatenation of all results. Creation of final output file File for IMG IMG Dataset processingFeature prediction pipeline (v 3.0a) CDS detection. Isolates: prodigal Metagenomes: varies Unassembled reads + assembled contigs

Dataset processingQuality trimming Courtesy Alex Copeland http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ Remove sequences from the ends of the reads. lucy for 454 datasets. Illumina (longest high quality string)

Dataset processingLow complexity filter tatatatatatatatatat aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa • using dust (NCBI) • Remove sequences with less than 80 informative bases

Dataset processingDereplication

Dataset processingSequence dereplication atcccat atc-cat atcccat atcccat atcccat gctacat gctncat gctacat Not dereplicated gctacat • using uclust • 95% identity (global alignment). • Identical prefix (5nt)

Dataset processingEvaluation of processing tools • Unassembled sequences due to their small size, quality problems, and large number need to be processed with efficient pipelines. • Simulated datasets: • Using sequences extracted from finished genomes (Perfect sequences) • Using reads that have been used to assemble finished genomes (Real errors). • Evaluation and development of new tools/wrappers.

Dataset processingFeature prediction Available methods: Ab initio: Metagene, MetaGeneMark, FragGeneScan, Prodigal. Similarity based: Blastx, USEARCH. isolate MISSED CORRECT WRONG NEW metagenome

Trimming

454 Ti(no errors)

454Ti(with errors)

Illumina 115 bp

Illumina 74 bp

Contigs frameshift Wrong prediction

Why annotate unassembled reads? Additional information about functions and phylogeny Assembled only More accurate statistics based on unassembled + assembled Unassembled + assembled + real metagenome

Processing time(metagenomes)

Processing time(isolates)

Thank you for your attention

Metagenomic dataset preprocessing – data reduction