Velvet: Algorithms for de novo short read assembly using de Bruijn graphs

Elena Helman CSCI2950 September 30, 2008 Velvet: Algorithms for de novo short read assembly using de Bruijn graphsDaniel R. Zerbino and Ewan Birney

Velvet • Set of algorithms that uses an altered form of de Bruijn graphs to: • eliminate errors • resolve repeats • in genome assembly with short reads

Terms • Node (N)‏ • sequence of a node s(N)‏ • block

Graph Construction • Hash table: for each k-mer, first read containing that k-mer and its position of occurrence within the read. • Each k-mer recorded with its reverse complement. • For each read, which of its original k-mers are overlapped by subsequent reads is recorded. • A node is created for each uninterrupted sequence of original overlapping k-mers. • Directed arcs are added knowing the correspondence between original k-mers and nodes. Simplification: When node A has only one outgoing arc that points to node B that has only one ingoing arc, the two nodes and their twins are merged.

Error Removal • Tips • A “tip” is a chain of nodes that is disconnected on one end. • Removed if shorter than 2k bp and if the path to the tip is less traveled than alternative routes • Bubbles with Tour Bus

Tour Bus • Identifies redundant paths • Two sequences are aligned and if similar enough, merged. • The consensus path is defined as the path that reached the end node first; i.e., the shortest path where the distance between two nodes, A and B, is defined as the length of s(B)/multiplicity of arc(A->B)‏ • Minority node is compared to the corresponding majority consensus node, using the local sequence alignment • All information attached to minority node is mapped onto majority node.

Testing Tour Bus on simulated data Reads = 35 bp k = 21 • N50 is the length of bp at which 50% of the genome sequence is contained in contigs of length N50 or greater • SNP: Single Nucleotide Polymorphism • Notice the height of the N50 plateau in E.coli versus H.sapiens

Testing Tour Bus on real data Reads = 35 bp k = 31

Breadcrumb • Extends and connects contigs through repeated regions • Using read pairs, Breadcrumb flags all the nodes containing the mate read of the reads in the established 'long nodes' • Breadcrumb then extends the unique node by going as far as possible from one connected flagged node to the next

Testing Breadcrumb on simulated data Why does N50 decrease as the insert length increases?

Complexity • Hash table • Graph construction • rate limiting step • enough memory to hold entire genome data • Error correction : O(N logN)‏ • Repeat resolution : dependent on N

Limitations/future directions • Actual assembly • no superpath problem • Test Breadcrumb on real data • Ideas?

Velvet: Algorithms for de novo short read assembly using de Bruijn graphs

Velvet: Algorithms for de novo short read assembly using de Bruijn graphs

Presentation Transcript

September 2008

Preliminary Results 30 September 2008

September 30, 2008

Volume 1, Issue 20 September 30, 2008

September 22-30, 2008

September 30

September 30

September 2008

September 2008

30 SEPTEMBER 2008

September, 2008.

2008 Oregon Flu Summit September 30, 2008 9:00-3:30 p.m. Monarch Hotel

September, 2008

September 30

September 2008

Cache Memories September 30, 2008

30 September 2008

September 30, 2008