120 likes | 139 Views
Velvet is a set of algorithms that utilize modified de Bruijn graphs to improve genome assembly with short reads by eliminating errors and resolving repeats.
E N D
Elena Helman CSCI2950 September 30, 2008 Velvet: Algorithms for de novo short read assembly using de Bruijn graphsDaniel R. Zerbino and Ewan Birney
Velvet • Set of algorithms that uses an altered form of de Bruijn graphs to: • eliminate errors • resolve repeats • in genome assembly with short reads
Terms • Node (N) • sequence of a node s(N) • block
Graph Construction • Hash table: for each k-mer, first read containing that k-mer and its position of occurrence within the read. • Each k-mer recorded with its reverse complement. • For each read, which of its original k-mers are overlapped by subsequent reads is recorded. • A node is created for each uninterrupted sequence of original overlapping k-mers. • Directed arcs are added knowing the correspondence between original k-mers and nodes. Simplification: When node A has only one outgoing arc that points to node B that has only one ingoing arc, the two nodes and their twins are merged.
Error Removal • Tips • A “tip” is a chain of nodes that is disconnected on one end. • Removed if shorter than 2k bp and if the path to the tip is less traveled than alternative routes • Bubbles with Tour Bus
Tour Bus • Identifies redundant paths • Two sequences are aligned and if similar enough, merged. • The consensus path is defined as the path that reached the end node first; i.e., the shortest path where the distance between two nodes, A and B, is defined as the length of s(B)/multiplicity of arc(A->B) • Minority node is compared to the corresponding majority consensus node, using the local sequence alignment • All information attached to minority node is mapped onto majority node.
Testing Tour Bus on simulated data Reads = 35 bp k = 21 • N50 is the length of bp at which 50% of the genome sequence is contained in contigs of length N50 or greater • SNP: Single Nucleotide Polymorphism • Notice the height of the N50 plateau in E.coli versus H.sapiens
Testing Tour Bus on real data Reads = 35 bp k = 31
Breadcrumb • Extends and connects contigs through repeated regions • Using read pairs, Breadcrumb flags all the nodes containing the mate read of the reads in the established 'long nodes' • Breadcrumb then extends the unique node by going as far as possible from one connected flagged node to the next
Testing Breadcrumb on simulated data Why does N50 decrease as the insert length increases?
Complexity • Hash table • Graph construction • rate limiting step • enough memory to hold entire genome data • Error correction : O(N logN) • Repeat resolution : dependent on N
Limitations/future directions • Actual assembly • no superpath problem • Test Breadcrumb on real data • Ideas?