Genovo : De Novo Assembly for Metagenomes

Genovo: De Novo Assembly for Metagenomes Gao Song 2010/07/14

Outline • Overview of Metagenomices • Current Assemblers • Genovo Assembly

Overview of Metagemices

Motivation • Metagenomics is: • Why Do We Need Metagenomics? • Snapshot of bacterial community • Cannot be cultivated <1%

Applications • Monitoring the impact of pollutants on ecosystems • Discovery of new genes, enzymes… - Global Ocean Sampling Expedition • Human Microbiome Project • JGI sequenced Acid Mine Drainage sample

Two Paradigms • Marker Gene Sequencing • 16s rRNA: • Two ways • Other marker genes: RuBisCo, NifH • Only composition • Whole Genome Sequencing (WGS) • Detailed picture of community

Complex Communities X5000 >1000 200L 1million

Current Assembler

Current Status • Why not assemble reads? • ORFome assembler* • Three steps: • The putative ORFs are annotated for each read • ORFs are assembled using EULER • ORF homologs are searched for in Integrated Microbial Genomics (IMG) database • Existing WGS assemblers • Sanger reads: Phrap, Celera, Arachne, JAZZ… • Short reads: Velvet, Newbler… * Y. Ye and H. Tang, "An orfome assembly approach to metagenomics sequences analysis." Journal of bioinformatics and computational biology, vol. 7, no. 3, pp. 455-471, June 2009

Genovo: De Novo Assembly for Metagenomes Jonathan Laserson, Vladimir Jojicand Daphne Koller. RECOMB 2010, LNBI 6044, pp. 341-356, 2010

Main Idea • Propose a generative model for Metagenome data • Using iterated conditional modes (ICM) • Using hill-climbing steps iteratively • Design a score for evaluation

Model • Initialize contigs: • Infinite contigs with infinite length • Partition the reads • Using Chinese Restaurant Process

Model • Generate the starting point oi • Generate the length of read • Quality of assembly of each read

Algorithm • Using ICM • Starting from initial condition, hill-climbing moves are performed iteratively • Move 1: Consensus Sequence: • Select the most frequent base

Algorithm • Move 2: Read Mapping • For read i, first remove it, then recalculate its contig and alignment • First, for each potential location, compute alignment • Then, select the location according to possibility • Filtering: using common 10-mer

Algorithm • Move 3: update geometric variable -> • Globle moves: • Propose indels • Center • Merge contigs • Chimeric reads • Disassemble the dangling contigs

Evaluation • BLAST • PFAM • Designed score • 1stterm: quality of assembly • 2nd term: penalty for total length • 3rd term: prefer to merge when V>V0

Results • Using 454 reads • Compare with Newbler, Velvet and EULER-SR • Single Genome

Result • Metagenome data • Score • PFAM

Discussion • New idea • Apply a mature algorithm to assembly domain • Systematically describe and analyze the problem and algorithm • Results are better

Discussion • Slowly: minute vs. hours for 300k 454 reads • Main idea: try to extend as long as possible, so they will have more hits for BLAST • Why choose 20 for V0? • How to deal with branching? Repeats? • Model: • Why it can capture the property of metagenomic data? • How to argue the correctness of that model? • The distribution of starting points

Thank you

Genovo : De Novo Assembly for Metagenomes