1 / 56

Most pipelines work the same way!

Most pipelines work the same way!. Metagenomics Processing. Merge paired-end reads. Preprocessing. Functional Assignments. Taxonomic assignments. Contamination removal. Gene Prediction. Contig Clustering. Binning reads. Metagenomics. Quality control – Prinseq Deconseq Annotation

dearth
Download Presentation

Most pipelines work the same way!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Most pipelines work the same way!

  2. Metagenomics Processing Merge paired-end reads Preprocessing Functional Assignments Taxonomic assignments Contamination removal Gene Prediction Contig Clustering Binning reads

  3. Metagenomics Quality control – Prinseq Deconseq Annotation FOCUS Real time metagenomics mg-rast Super FOCUS Statistics STAMP Population genomes crAss metabat ContigClustering

  4. Metagenomics Processing Contig clustering Preprocessing Gene Prediction FragGeneScan GlimmerMG MetaGeneAnnotator MetaGeneMark MetaGun Orphelia Prodigal FASTQC FastX Toolkit fitGCP NGS QC Toolkit Non-pareil Prinseq QC-Chain Streaming Trim AbundanceBin CompostBin concoct crAss tetra Taxonomic assignment Functional assignment CLAMS Sequedex DiScRIBinATE SORT-ITEMS genometa SPANNER GSMer SPHINX PPLACER TaxSOM RTMg Treephyler CARMA myTaxa FOCUS PhylopythiaS KRAKEN phymmbl LMAT RAIphy MEGAN TACOA Metaplan Taxy

  5. Bad data analysis

  6. Preprocessing Data Rob Schmieder

  7. Good data analysis Quality control & Preprocessing Similarity search New dataset Assembly

  8. 3 Tools for metagenomic data http://prinseq.sourceforge.net http://tagcleaner.sourceforge.net http://deconseq.sourceforge.net

  9. Quality control and data preprocessing http://edwards.sdsu.edu/prinseq Rob Schmieder

  10. Number and length of sequences Bad Reads should be approx. same length (same number of cycles) → Short reads are likely lower quality Good

  11. Linearly degrading quality across the read Trim low quality ends

  12. High quality throughout the sequence Good quality through the length of the sequence Sequence quality falls off quickly → Bad sequence data

  13. Ion quality scores

  14. Low quality sequence issues • Most assemblers or aligners do not take into account quality scores • Errors in reads complicate assembly, might cause misassembly, or make assembly impossible

  15. What if quality scores are not available ? Alternative: • Infer quality from the percent of Ns found in the sequence • Removes regions with a high number of Ns • Huse et al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel • DNA pyrosequencing. Genome Biology (2007)

  16. What if quality scores are not available ? Alternative: • Infer quality from the percent of Ns found in the sequence • Removes regions with a high number of Ns • Huse et al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel • DNA pyrosequencing. Genome Biology (2007)

  17. Ambiguous bases • If you can afford the loss, filter out all reads containing Ns • Assemblers (e.g. Velvet) and aligners (SHAHA2, BWA, …) use 2-bit encoding system for nucleotides • some replace Ns with random base, some with fixed base (e.g. SHAHA2 & Velvet = A) 2-bit example: 00 – A, 01 – C, 10 – G, 11 - T

  18. Quality filtering • Any region with homopolymer will tend to have a lower quality score • Huseet al. found that sequences with an average score below 25 had more errors than those with higher averages Huse et al.: Accuracy and quality of massively • parallel DNA pyrosequencing. Genome Biology (2007)

  19. Sequence duplicates

  20. Real or artificial duplicate ? • Metagenomics = random sampling of genomic material • Why do reads start at the same position? • Why do these reads have the same errors? • No specific pattern or location on sequencing plate • Gomez-Alvarez et al.: Systematic artifacts in metagenomes from • complex microbial communities. ISME (2009)

  21. One micro-reactor – Many beads Martine Yerle (Laboratory of Cellular Genetics, INRA, France)

  22. Impacts of duplicates • False variant (SNP) calling • Require more computing resources • Find similar database sequences for same query sequence • Assembly process takes longer • Increase in memory requirements • Abundance or expression measures can be wrong

  23. Impacts of duplicates • False variant (SNP) calling • Require more computing resources • Find similar database sequences for same query sequence • Assembly process takes longer • Increase in memory requirements • Abundance or expression measures can be wrong Reference ...ACCACACGTGTTGTGTACATGAACACAGTATATGAGCATACAGAT... GTGTTGTGTACATGAACACAGTATATGAGCATACAGAT... GTGTACATGAACACAGTATATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT... TGAACACAGTCTATGAGCATACAGAT...

  24. Impacts of duplicates • False variant (SNP) calling • Require more computing resources • Find similar database sequences for same query sequence • Assembly process takes longer • Increase in memory requirements • Abundance or expression measures can be wrong

  25. Detect and remove tag sequences http://edwards.sdsu.edu/tagcleaner

  26. No tag MID tag WTA tags

  27. Imperfect primer annealing

  28. Fragment-to-fragment concatenations

  29. Data upload Tag sequence definition

  30. Tag sequence prediction

  31. Parameter definition Download results

  32. Identification and removal of sequence contamination http://edwards.sdsu.edu/deconseq

  33. Contaminant identification • Previous methods had critical limitations • Dinucleotide relative abundance uses information content in sequences  can not identify single contaminant sequences • Sequence similarity seems to be only reliable option to identify single contaminant sequences • BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, …) • Novel sequences in every new human genome sequenced* * Li et al.: Building the sequence map of the human • pan-genome. Nature Biotechnology (2010)

  34. Faster algorithms for Next-gen data

  35. Principal component analysis (PCA) of dinucleotide relative abundance Microbial metagenomes Viral metagenomes

  36. Current methods have critical limitations Dinucleotide relative abundance uses information content in sequences  can not identify single contaminant sequences Sequence similarity seems to be only reliable option to identify single contaminant sequences BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, …) Novel sequences in every new human genome sequenced* Contaminant identification * Li et al.: Building the sequence map of the human pan-genome. Nature Biotechnology (2010)

  37. DeconSeq web interface Two types of reference databases Remove Retain

  38. DeconSeq web interface (cont.)

  39. DeconSeq Identity = How similar is the query sequence to the reference sequence How much of query sequence is similar to reference sequence Coverage =

  40. DeconSeq Blue = More similar to “retain” Red = More similar to “remove”

  41. Human DNA contamination identified in145 out of 202 metagenomes

  42. http://prinseq.sourceforge.net/manual.html

  43. Pairing Data

  44. Two types of paired ends Mate pairs Paired end reads

  45. Repeats Paired end reads or mate pairs A B C

  46. Mate pair sequencing

  47. Mate pair Sequencing Add linkers

  48. Mate pair sequencing Sequencing Nick migration

  49. Paired end sequencing

More Related