230 likes | 256 Views
Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs. March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee. What is de Bruijn Graphs?. “De Bruijn graph” is a directed graph An edge represents overlap between sequences of symbols V=(s 1 , s 2 , …, s m )
E N D
Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee
What is de Bruijn Graphs? • “De Bruijn graph” is a directed graph • An edge represents overlap between sequences of symbols • V=(s1, s2, …, sm) • E={(v1,v2,…, vn),(w1,w2,…,wn)):v2=w1,v3=w2, …, vn=wn-1}
Introduction • New sequencing techniques are commercially available (e.g. 454 Sequencing, Solexa) • 454 Sequencing ~ 100 – 200bp • Solexa ~ 30bp • Algorithms whole genome shotgun (WGS) assembly are not suitable for short reads • Overlap graph with a node per read is extremely large • More ambiguous connections in assembly
Introduction (cont) • Euler assembler (Pevzner 2001) used k-mer for a node of de Bruijn graphs • Reads are mapped as a path through the de Brujin graph • High redundancy does not affect the number of nodes • “Velvet” effectively deals with experimental errors and repeats by using Brujin graphs with k-mers
De Bruijn Graphs - structure Structure
De Bruijn Graphs – structure (cont) • Adjacent k-mers overlap by k-1 nucleotides • Each node is attached to twin node • Reverse series of reverse complement k-mers • Overlap between reads from opposite strand • Union of a node and its twin node is called a “block” • Last k-mer overlaps with the first of its destination
De Bruijn Graphs – construction (cont) Construction • Reads are hashed with predefined k-mer length • Small k-mer → increase connectivity → more ambiguous repeats • Large k-mer → increase specificity → decrease connectivity • Determine k considering “sensitivity” and “specificity”
De Bruijn Graphs – construction (cont) • For each k-mer, hash table records ID of the first read and its position • Each k-mer is recorded with reverse complement • Node is created if there is distinct interruption points • Reads are traced through the graph • Create a directed arc if necessary
De Bruijn Graphs – simplification • Simplify the chains of blocks • No information loss • If node A has only one outgoing arc to node B, and if node B has only one ingoing arc → merge A B
De Bruijn Graphs – error removal Velvet focuses on “topological features” of the graph • First step: remove tips • Tip: chain of nodes disconnected on one end • Use two criteria: (1) length and (2) minority count • Length: remove a tip if < 2k bp since two nearby errors can create a tip up to 2k bp error error k k
De Bruijn Graphs – error removal (cont) • Minority count: multiplicity m < n • Starting from node B, going through the tip is an alternative to a more common path m A B tip C n
De Bruijn Graphs – error removal (cont) Second step: remove bubbles using Tour Bus • Redundant paths start and end at the same nodes • Bubbles are created by errors or biological variants such as SNP Bubble
De Bruijn Graphs – error removal (cont) Tour Bus • Detect redundant paths 2. Compare them using dynamic programming methods 3. If similar, merge them
De Bruijn Graphs – error removal (cont) Third step: remove erroneous connections • Remove erroneous connections after Tour Bus algorithm • Remove erroneous connections with basic coverage cutoff • Genuine short nodes which cannot be simplified in the graph should have high coverage
Breadcrumb: resolution of repeats • Using read pairs, pair up the long nodes • Flag paired reads using unambiguous long nodes unambiguous long nodes
Breadcrumb: resolution of repeats • Using read pairs, pair up the long nodes • Flag paired reads using unambiguous long nodes unambiguous long nodes
Breadcrumb: resolution of repeats • Extends the nodes as far as possible using flagged paired reads • All nodes between A and B are paired up to either A or B
Experimental Results Test error removal pipeline on simulated data • Simulate reads are from E. coli, S. cerevisiae, C.elegans, and H. sapiens • Coverage density vs N50 for H. sapiens • Limited by natural repetition of the reference genome Ideal + Error (1%) + SNP N50
Experimental Results (cont) Test error removal pipeline on experimental data • 173,428 bp human BAC was sequenced using Solexa machines • Reads were 35bp long, and k=31 • Tour Bus increased sensitivity by correcting errors and preserved the integrity of the graph structure
Conclusions • Velvet is a de Bruijn graph based sequence assembly method for short reads • Errors are handled by removing tips and Tour Bus algorithm • A large number of repeats are resolved by Breadcrumb algorithm • Velvet was assessed using simulated and real datasets and it performed well