240 likes | 411 Views
Genovo : De Novo Assembly for Metagenomes. Gao Song 2010/07/14. Outline. Overview of Metagenomices Current Assemblers Genovo Assembly. Overview of Metagemices. Motivation. Metagenomics is: Why Do We Need Metagenomics ? Snapshot of bacterial community Cannot be cultivated. <1%.
E N D
Genovo: De Novo Assembly for Metagenomes Gao Song 2010/07/14
Outline • Overview of Metagenomices • Current Assemblers • Genovo Assembly
Motivation • Metagenomics is: • Why Do We Need Metagenomics? • Snapshot of bacterial community • Cannot be cultivated <1%
Applications • Monitoring the impact of pollutants on ecosystems • Discovery of new genes, enzymes… - Global Ocean Sampling Expedition • Human Microbiome Project • JGI sequenced Acid Mine Drainage sample
Two Paradigms • Marker Gene Sequencing • 16s rRNA: • Two ways • Other marker genes: RuBisCo, NifH • Only composition • Whole Genome Sequencing (WGS) • Detailed picture of community
Complex Communities X5000 >1000 200L 1million
Current Status • Why not assemble reads? • ORFome assembler* • Three steps: • The putative ORFs are annotated for each read • ORFs are assembled using EULER • ORF homologs are searched for in Integrated Microbial Genomics (IMG) database • Existing WGS assemblers • Sanger reads: Phrap, Celera, Arachne, JAZZ… • Short reads: Velvet, Newbler… * Y. Ye and H. Tang, "An orfome assembly approach to metagenomics sequences analysis." Journal of bioinformatics and computational biology, vol. 7, no. 3, pp. 455-471, June 2009
Genovo: De Novo Assembly for Metagenomes Jonathan Laserson, Vladimir Jojicand Daphne Koller. RECOMB 2010, LNBI 6044, pp. 341-356, 2010
Main Idea • Propose a generative model for Metagenome data • Using iterated conditional modes (ICM) • Using hill-climbing steps iteratively • Design a score for evaluation
Model • Initialize contigs: • Infinite contigs with infinite length • Partition the reads • Using Chinese Restaurant Process
Model • Generate the starting point oi • Generate the length of read • Quality of assembly of each read
Algorithm • Using ICM • Starting from initial condition, hill-climbing moves are performed iteratively • Move 1: Consensus Sequence: • Select the most frequent base
Algorithm • Move 2: Read Mapping • For read i, first remove it, then recalculate its contig and alignment • First, for each potential location, compute alignment • Then, select the location according to possibility • Filtering: using common 10-mer
Algorithm • Move 3: update geometric variable -> • Globle moves: • Propose indels • Center • Merge contigs • Chimeric reads • Disassemble the dangling contigs
Evaluation • BLAST • PFAM • Designed score • 1stterm: quality of assembly • 2nd term: penalty for total length • 3rd term: prefer to merge when V>V0
Results • Using 454 reads • Compare with Newbler, Velvet and EULER-SR • Single Genome
Result • Metagenome data • Score • PFAM
Discussion • New idea • Apply a mature algorithm to assembly domain • Systematically describe and analyze the problem and algorithm • Results are better
Discussion • Slowly: minute vs. hours for 300k 454 reads • Main idea: try to extend as long as possible, so they will have more hits for BLAST • Why choose 20 for V0? • How to deal with branching? Repeats? • Model: • Why it can capture the property of metagenomic data? • How to argue the correctness of that model? • The distribution of starting points