330 likes | 523 Views
CompostBin : A DNA composition based metagenomic binning algorithm. Sourav Chatterji * , Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu. Overview of Talk. Metagenomics and the binning problem. CompostBin. The Microbial World.
E N D
CompostBin : A DNA composition based metagenomic binning algorithm SouravChatterji*, Ichitaro Yamazaki, ZhaojunBai and Jonathan Eisen UC Davis schatterji@ucdavis.edu
Overview of Talk • Metagenomics and the binning problem. • CompostBin
Exploring the Microbial World • Culturing • Majority of microbes currently unculturable. • No ecological context. • Molecular Surveys (e.g. 16S rRNA) • “who is out there?” • “what are they doing?”
Interpreting Metagenomic Data • Nature of Metagenomic Data • Mosaic • Intraspecies polymorphism • Fragmentary • New Sequencing Technologies • Enormous amount of data • Short Reads
Metagenomic Binning Classification of sequences by taxa
Binning in Action • Glassy Winged Sharpshooter (Homalodisca coagulata). • Feeds on plant xylem (poor in organic nutrients). • Microbial Endosymbionts
Current Binning Methods • Assembly • Align with Reference Genome • Database Search [MEGAN, BLAST] • Phylogenetic Analysis • DNA Composition [TETRA,Phylopythia]
Current Binning Methods • Need closely related reference genomes. • Poor performance on short fragments. • Sanger sequence reads 500-1000 bp long. • Current assembly methods unreliable • Complex Communities Hard to Bin.
Overview of Talk • Metagenomics and the binning problem. • CompostBin
Genome Signatures • Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? • Yes [Karlin et al. 1990s] • What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?
Imperfect World • Horizontal Gene Transfer • Recent Estimates [Ge et al. 2005] • Varies between 0-6% of genes. • Typically ~2%. • But… • Amelioration
DNA-composition metrics The K-mer Frequency Metric CompostBin uses hexamers
DNA-composition metrics • Working with K-mers for Binning. • Curse of Dimensionality : O(4K) independent dimensions. • Statistical noise increases with decreasing fragment lengths. • Project data into a lower dimensional space to decrease noise. • Principal Component Analysis.
PCA separates species Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]
Effect of Skewed Relative Abundance Abundance 20:1 Abundance 1:1 B. anthracis and L. monogocytes
A Weighting Scheme For each read, find overlap with other sequences
A Weighting Scheme 4 5 5 3 Calculate the redundancy of each position. Weight is inverse of average redundancy.
N å = - - T M w (X μ ) (X μ ) w i i w i w = i 1 Weighted PCA • Calculate weighted mean µw : • Calculates weighted co-variance matrix Mw • PCs are eigenvectors of Mw. • Use first three PCs for further analysis. N å w X i i = = μ i 1 w N
Weighted PCA separates species PCA Weighted PCA B. anthracis and L. monogocytes : 20:1
Semi-Supervised Classification • 31 Marker Genes [courtesy Martin Wu] • Omni-present • Relatively Immune to Lateral Gene Transfer • Reads containing these marker genes can be classified with high reliability.
Semi-supervised Classification Use a semi-supervised version of the normalized cut algorithm
The Semi-supervised Normalized Cut Algorithm • Calculate the K-nearest neighbor graph from the point set. • Update graph with marker information. • If two nodes are from the same species, add an edge between them. • If two nodes are from different species, remove any edge between them. • Bisect the graph using the normalized-cut algorithm.
Apply algorithm recursively Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]
Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]
Testing • Simulate Metagenomic Sequencing • Sanger Reads • Variables • Number of species • Relative abundance • GC content • Phylogenetic Diversity • Test on a “real” dataset where answer is well-established.
Conclusions/Future Directions • Satisfactory performance • No Training on Existing Genomes • Sanger Reads • Low number of Species • Future Work • Holy Grail : Complex Communities • Semi-supervised projection? • Hybrid Assembly/Binning
Acknowledgements UC Davis UC Berkeley LiorPachter Richard Karp AmbujTewari Narayanan Manikandan • Jonathan Eisen • Martin Wu • Dongying Wu • Ichitaro Yamazaki • Amber Hartman • Marcel Huntemann • Princeton University • Simon Levin • Josh Weitz • Jonathan Dushoff