1 / 26

Canadian Bioinformatics Workshops

Learn how to determine the taxonomic composition of metagenomic samples and statistically analyze differences in taxonomic abundance. Understand the pros and cons of 16S and metagenomic sequencing methods.

keener
Download Presentation

Canadian Bioinformatics Workshops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Canadian Bioinformatics Workshops www.bioinformatics.ca

  2. Module #: Title of Module 2

  3. Module 3Metagenomic Taxonomic Composition Morgan Langille Analysis of Metagenomic Data June 24-26, 2015

  4. Learning Objectives of Module • Understand the pros and cons between 16S and metagenomic sequencing • Understand different approaches for determining the taxonomic composition of a metagenomics sample • Be able to run Metaphlan2 on one or more samples • Be able to determine statistically significant differences in taxonomic abundance across sample groups using STAMP

  5. 16S vs Metagenomics • 16S is targeted sequencing of a single gene which acts as a marker for identification • Pros • Well established • Sequencing costs are relatively cheap (~10,000 reads/sample) • Only amplifies what you want (no host contamination) • Cons • Primer choice can bias results towards certain organisms • Usually not enough resolution to identify to the strain level • Need different primers usually for archaea& eukaryotes (18S) • Doesn’t identify viruses

  6. 16S vs Metagenomics • Metagenomics: sequencing ALL the DNA in a sample • Pros • Less bias from sequencing • Can identify all microbes (euks, viruses, etc.) • Provides functional information (“What are they doing?”) • Cons • Host/site contamination can be signficant • Expensive (more sequencing depth is required) • May not be able to sequence “rare” microbes • Complex bioinformatics

  7. Metagenomics: Who is there? • Goal: Identify the relative abundance of different microbes in a sample given using metagenomics • Problems: • Reads are all mixed together • Reads can be short (~100bp) • Lateral gene transfer • Two broad approaches • Binning Based • Marker Based

  8. Binning Based • Attempts to “bin” reads into the genome from which they originated • Composition-based • Uses GC composition or k-mers (e.g. Naïve Bayes Classifier) • Generally not very precise and not recommended • Sequence-based • Compare reads to large reference database using BLAST (or some other similarity search method) • Reads are assigned based on “Best-hit” or “Lowest Common Ancestor” approach

  9. LCA: Lowest Common Ancestor • Use all BLAST hits above a threshold and assign taxonomy at the lowest level in the tree which covers these taxa. • Notable Examples: • MEGAN: http://ab.inf.uni-tuebingen.de/software/megan5/ • One of the first metagenomic tools • Does functional profiling too! • MG-RAST: https://metagenomics.anl.gov/ • Web-based pipeline (might need to wait awhile for results) • Kraken: https://ccb.jhu.edu/software/kraken/ • Fastest binning approach to date and very accurate. • Large computing requirements (e.g. >128GB RAM)

  10. Marker Based • Single Gene • Identify and extract reads hitting a single marker gene (e.g. 16S, cpn60, or other “universal” genes) • Use existing bioinformatics pipeline (e.g. QIIME, etc.) • Multiple Gene • Several universal genes • PhyloSift(Darling et al, 2014) • Uses 37 universal single-copy genes • Clade specific markers • MetaPhlAn (Segataet al, 2012)

  11. Marker or Binning? • Binning approaches • May be too computationally intensive • May not adequately reflect organism abundances due to genome size • Marker approaches • Doesn’t allow functions to be linked directly to organisms • Genome reconstruction is not possible • Very sensitive to choice of markers

  12. Why MetaPhlAn? • Fast (marker database is considerably smaller) • Markers for bacteria, archaea, eukaryotes, and viruses (since MetaPhlAn2 was released) • Being continuously updated and supported • Used by the Human Microbiome Project • Generally accepted as a robust method for taxonomy assignment • Main Disadvantage: not all reads are assigned a taxonomic label

  13. MetaPhlAn • Uses “clade-specific” gene markers • A clade represents a set of genomes that can be as broad as a phylum or as specific as a species • Uses ~1 million markers derived from 17,000 genomes • ~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic • Can identify down to the species level (and possibly even strain level) • Can handle millions of reads on a standard computer within a few minutes

  14. MetaPhlAn • Open-source: • https://bitbucket.org/biobakery/metaphlan2

  15. MetaPhlAn Marker Selection

  16. MetaPhlAn Marker Selection

  17. Using MetaPhlan • MetaPhlan uses Bowtie2 for sequence similarity searching (nucleotide sequences vs. nucleotide database) • Paired-end data can be used directly • Each sample is processed individually and then multiple sample can be combined together at the last step • Output is relative abundances at different taxonomic levels

  18. Absolute vs. Relative Abundance • Absolute abundance: Numbers represent real abundance of thing being measured (e.g. the actual quantity of a particular gene or organism) • Relative abundance: Numbers represent proportion of thing being measured within sample • In almost all cases microbiome studies are measuring relative abundance • This is due to DNA amplification during sequencing library preparation not being quantitative

  19. Relative Abundance Use Case • Sample A: • Has 108 bacterial cells (but we don’t know this from sequencing) • 25% of the microbiome from this sample is classified as Shigella • Sample B: • Has 106 bacterial cells (but we don’t know this from sequencing) • 50% of the microbiome from this sample is classified as Shigella • “Sample B contains twice as much Shigella as Sample A” • WRONG! (If quantified it we would find Sample A has more Shigella) • “Sample B contains a greater proportion of Shigella compared to Sample A” • Correct!

  20. Visualization and Statistics • Various tools are available to determine statistically significant taxonomic differences across groups of samples • Excel • SigmaPlot • R • MeV (MultiExperiment Viewer) • Python (matplotlib) • LefSe & Graphlan (Huttenhower Group) • STAMP

  21. STAMP

  22. STAMP Plots

  23. STAMP • Input • “Profile file”: Table of features (samples by OTUs, samples by functions, etc.) • Features can form a heirarchy (e.g. Phylum, Order, Class, etc) to allow data to be collapsed within the program • “Group file”: Contains different metadata for grouping samples • Can be two groups: (e.g. Healthy vs Sick) or multiple groups (e.g. Water depth at 2M, 4M, and 6M) • Output • PCA, heatmap, box, and bar plots • Tables of significantly different features

  24. Questions?

  25. We are on a Coffee Break & Networking Session

More Related