1 / 33

CompostBin : A DNA composition based metagenomic binning algorithm

CompostBin : A DNA composition based metagenomic binning algorithm. Sourav Chatterji * , Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu. Overview of Talk. Metagenomics and the binning problem. CompostBin. The Microbial World.

orrin
Download Presentation

CompostBin : A DNA composition based metagenomic binning algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CompostBin : A DNA composition based metagenomic binning algorithm SouravChatterji*, Ichitaro Yamazaki, ZhaojunBai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

  2. Overview of Talk • Metagenomics and the binning problem. • CompostBin

  3. The Microbial World

  4. Exploring the Microbial World • Culturing • Majority of microbes currently unculturable. • No ecological context. • Molecular Surveys (e.g. 16S rRNA) • “who is out there?” • “what are they doing?”

  5. Metagenomics

  6. Interpreting Metagenomic Data • Nature of Metagenomic Data • Mosaic • Intraspecies polymorphism • Fragmentary • New Sequencing Technologies • Enormous amount of data • Short Reads

  7. Metagenomic Binning Classification of sequences by taxa

  8. Binning in Action • Glassy Winged Sharpshooter (Homalodisca coagulata). • Feeds on plant xylem (poor in organic nutrients). • Microbial Endosymbionts

  9. Current Binning Methods • Assembly • Align with Reference Genome • Database Search [MEGAN, BLAST] • Phylogenetic Analysis • DNA Composition [TETRA,Phylopythia]

  10. Current Binning Methods • Need closely related reference genomes. • Poor performance on short fragments. • Sanger sequence reads 500-1000 bp long. • Current assembly methods unreliable • Complex Communities Hard to Bin.

  11. Overview of Talk • Metagenomics and the binning problem. • CompostBin

  12. Genome Signatures • Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? • Yes [Karlin et al. 1990s] • What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

  13. Imperfect World • Horizontal Gene Transfer • Recent Estimates [Ge et al. 2005] • Varies between 0-6% of genes. • Typically ~2%. • But… • Amelioration

  14. DNA-composition metrics The K-mer Frequency Metric CompostBin uses hexamers

  15. DNA-composition metrics • Working with K-mers for Binning. • Curse of Dimensionality : O(4K) independent dimensions. • Statistical noise increases with decreasing fragment lengths. • Project data into a lower dimensional space to decrease noise. • Principal Component Analysis.

  16. PCA separates species Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]

  17. Effect of Skewed Relative Abundance Abundance 20:1 Abundance 1:1 B. anthracis and L. monogocytes

  18. A Weighting Scheme For each read, find overlap with other sequences

  19. A Weighting Scheme 4 5 5 3 Calculate the redundancy of each position. Weight is inverse of average redundancy.

  20. N å = - - T M w (X μ ) (X μ ) w i i w i w = i 1 Weighted PCA • Calculate weighted mean µw : • Calculates weighted co-variance matrix Mw • PCs are eigenvectors of Mw. • Use first three PCs for further analysis. N å w X i i = = μ i 1 w N

  21. Weighted PCA separates species PCA Weighted PCA B. anthracis and L. monogocytes : 20:1

  22. Un-supervised Classification ?

  23. Semi-Supervised Classification • 31 Marker Genes [courtesy Martin Wu] • Omni-present • Relatively Immune to Lateral Gene Transfer • Reads containing these marker genes can be classified with high reliability.

  24. Semi-supervised Classification Use a semi-supervised version of the normalized cut algorithm

  25. The Semi-supervised Normalized Cut Algorithm • Calculate the K-nearest neighbor graph from the point set. • Update graph with marker information. • If two nodes are from the same species, add an edge between them. • If two nodes are from different species, remove any edge between them. • Bisect the graph using the normalized-cut algorithm.

  26. Apply algorithm recursively Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]

  27. Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]

  28. Testing • Simulate Metagenomic Sequencing • Sanger Reads • Variables • Number of species • Relative abundance • GC content • Phylogenetic Diversity • Test on a “real” dataset where answer is well-established.

  29. Results

  30. Conclusions/Future Directions • Satisfactory performance • No Training on Existing Genomes  • Sanger Reads  • Low number of Species  • Future Work • Holy Grail : Complex Communities • Semi-supervised projection? • Hybrid Assembly/Binning

  31. Acknowledgements UC Davis UC Berkeley LiorPachter Richard Karp AmbujTewari Narayanan Manikandan • Jonathan Eisen • Martin Wu • Dongying Wu • Ichitaro Yamazaki • Amber Hartman • Marcel Huntemann • Princeton University • Simon Levin • Josh Weitz • Jonathan Dushoff

More Related