CompostBin : A DNA composition based metagenomic binning algorithm

CompostBin : A DNA composition based metagenomic binning algorithm SouravChatterji*, Ichitaro Yamazaki, ZhaojunBai and Jonathan Eisen UC Davis schatterji@ucdavis.edu

Overview of Talk • Metagenomics and the binning problem. • CompostBin

The Microbial World

Exploring the Microbial World • Culturing • Majority of microbes currently unculturable. • No ecological context. • Molecular Surveys (e.g. 16S rRNA) • “who is out there?” • “what are they doing?”

Metagenomics

Interpreting Metagenomic Data • Nature of Metagenomic Data • Mosaic • Intraspecies polymorphism • Fragmentary • New Sequencing Technologies • Enormous amount of data • Short Reads

Metagenomic Binning Classification of sequences by taxa

Binning in Action • Glassy Winged Sharpshooter (Homalodisca coagulata). • Feeds on plant xylem (poor in organic nutrients). • Microbial Endosymbionts

Current Binning Methods • Assembly • Align with Reference Genome • Database Search [MEGAN, BLAST] • Phylogenetic Analysis • DNA Composition [TETRA,Phylopythia]

Current Binning Methods • Need closely related reference genomes. • Poor performance on short fragments. • Sanger sequence reads 500-1000 bp long. • Current assembly methods unreliable • Complex Communities Hard to Bin.

Overview of Talk • Metagenomics and the binning problem. • CompostBin

Genome Signatures • Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? • Yes [Karlin et al. 1990s] • What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?

Imperfect World • Horizontal Gene Transfer • Recent Estimates [Ge et al. 2005] • Varies between 0-6% of genes. • Typically ~2%. • But… • Amelioration

DNA-composition metrics The K-mer Frequency Metric CompostBin uses hexamers

DNA-composition metrics • Working with K-mers for Binning. • Curse of Dimensionality : O(4K) independent dimensions. • Statistical noise increases with decreasing fragment lengths. • Project data into a lower dimensional space to decrease noise. • Principal Component Analysis.

PCA separates species Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]

Effect of Skewed Relative Abundance Abundance 20:1 Abundance 1:1 B. anthracis and L. monogocytes

A Weighting Scheme For each read, find overlap with other sequences

A Weighting Scheme 4 5 5 3 Calculate the redundancy of each position. Weight is inverse of average redundancy.

N å = - - T M w (X μ ) (X μ ) w i i w i w = i 1 Weighted PCA • Calculate weighted mean µw : • Calculates weighted co-variance matrix Mw • PCs are eigenvectors of Mw. • Use first three PCs for further analysis. N å w X i i = = μ i 1 w N

Weighted PCA separates species PCA Weighted PCA B. anthracis and L. monogocytes : 20:1

Un-supervised Classification ?

Semi-Supervised Classification • 31 Marker Genes [courtesy Martin Wu] • Omni-present • Relatively Immune to Lateral Gene Transfer • Reads containing these marker genes can be classified with high reliability.

Semi-supervised Classification Use a semi-supervised version of the normalized cut algorithm

The Semi-supervised Normalized Cut Algorithm • Calculate the K-nearest neighbor graph from the point set. • Update graph with marker information. • If two nodes are from the same species, add an edge between them. • If two nodes are from different species, remove any edge between them. • Bisect the graph using the normalized-cut algorithm.

Apply algorithm recursively Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]

Generalization to multiple bins Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis [0.62]

Testing • Simulate Metagenomic Sequencing • Sanger Reads • Variables • Number of species • Relative abundance • GC content • Phylogenetic Diversity • Test on a “real” dataset where answer is well-established.

Results

Conclusions/Future Directions • Satisfactory performance • No Training on Existing Genomes  • Sanger Reads  • Low number of Species  • Future Work • Holy Grail : Complex Communities • Semi-supervised projection? • Hybrid Assembly/Binning

Acknowledgements UC Davis UC Berkeley LiorPachter Richard Karp AmbujTewari Narayanan Manikandan • Jonathan Eisen • Martin Wu • Dongying Wu • Ichitaro Yamazaki • Amber Hartman • Marcel Huntemann • Princeton University • Simon Levin • Josh Weitz • Jonathan Dushoff

CompostBin : A DNA composition based metagenomic binning algorithm

CompostBin : A DNA composition based metagenomic binning algorithm

Presentation Transcript

Algorithm-Based Fault Tolerance for Matrix Operations

Context-Sensitive Domain-Independent Algorithm Composition and Selection Troy A. Johnson and Rudi Eigenmann Purdue Univ

Methods for Initialization of Activation Based Inverse Electrocardiography Using Graphs Derived from Heart Surface Geome

A Simple Physically Based Snowfall Algorithm

Top-N Recommendation Algorithm Based on Item-Graph

Genetic Algorithm

Composition and Physical Layers of the Earth

Scalable metabolic reconstruction for metagenomic data and the human microbiome

TIPP: Taxon Identification and Phylogenetic Profiling

CPS 196.03: Information Management and Mining

A Genetic Algorithm-Based Approach to Content-Based Image Retrieval

DSL Composition for Model-Based Test Generation

Algorithm Analysis

Using AI Planning to Implement Algorithm Composition

Binning and Indexing Biometric Records

A NOVEL LEVEL-BASED IPV6 ROUTING LOOKUP ALGORITHM

Automatic Algorithm Configuration based on Local Search

Re-Imagining Composition Courses In Light of Best Research-Based Practices

BODY COMPOSITION

ICCad Contest