ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter, Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum, and David B. Jaffe Presented by: Mohit Jain

Introduction Overview Algorithm Results Motivation (P) DNA Sequencing: • Chr length: ~1000 - ~250,000,000 bps • Longest sequence-able fragment: ~600 bps (S) Shotgun Method:determine sequence by breaking genome into many small segments (reads) (P) Sequence/Genome Assembly:combining these reads to reconstruct the genome Slide: 1/

Motivation (S) Original Genome: Shortest Common Superstring (SCS) Problem = The shortest sequence that contains every read as a substring (P) Repeats Genomes have repeats, and SCS represents repeats only once (S) Overlap Graph

(S)Overlap Graph Each read forms a Node Edge exists between two nodes if the reads overlap Algorithm: Step 1: Removing redundant edges, classify edges as required/optional Step 2: Find the shortest walk which includes all required edges Red/Thin: False Overlaps

(P)Overlap Graph Microreads =only 25-50 bases long (for HTS) shorter reads = shorter overlap => more reads => more overlaps - Very large number of (mostly false) overlaps - Large number of reads + short overlap + higher error rate

(S)De Bruijn Graph • To construct de Bruijn graph: all reads are broken in to overlapping subsequences of length k (k-mer) • Each k-1 subsequence represents a Node • A directed Edge e exists between two nodes a and b iff there exists a k-mer such that its prefix = a and its suffix = b

(S)De Bruijn Graph • Condensed by collapsing non-ambiguous paths • Genome: An Eulerian path (Superwalk: walk including all edges) in this graph

Paired Reads (Mate pairs) • Sequence two ends of a fragment of known size • Results better assemblies, but more complicated

Current Approaches • Velvet • EULER-USR • ALLPATHS (Velvet and Euler USR are based on De Bruijn Graph method)

ALLPATHS Step I. Builds Unipath Graph Step II. Localizes reads sequences before assembly Unipath: maximal unbranched sequence

Read Localization

Short Fragment Pair Merger • Fills the gap in between two paired reads • Builds a local unipath graph • Extend both ends (of all reads) based on the local unipath graph • For each pair, search for other pairs which overlap on both ends, and merge to obtain longer reads

Short Fragment Pair Merger • Repeat the process for all pairs. • Once sequence is complete, update the local unipath graph • Iteratively merge local unipath graphs to obtain a global unipath graph, representing the genome

ALLPATHS Paired-Read Assembly Algorithm Step 1: Creating Approximate Unipaths 1a: Error correction 1b: k-mer numbering and searchable data structure (Ignoring any overlaps between reads) 1c: Computing unipaths from the data structure by walking along the reads until a branch is encountered Read pairs  Unipaths  Localize

Step 2: Selecting Seeds Seeds = Unipaths around which assemblies of genomic regions are build Ideal seed: Long Unipaths with Low Copy Number (=1) Copy Number = Inferred from read coverage of the unipaths 2a: For each unipath, compute the closest unipaths in the set that are to the left and to the right of the given unipath 2b: If the distance between left and right neighbours is less than 4 kb, then the middle unipath is removed 2c: After all such unipaths are removed, remaining forms the seeds unipaths

Step 3: Assembling neighbourhoods around the seeds Neighbourhood = Seed + 10 kb on each side 3a: Define a collection of low-copy number unipaths, using iterative linking 3b: Construct two sets of read clouds: primary(B): only reads, whose true genomic locations are near the seed secondary(C): contains all the short-fragment read pairs (~0.5 kb) near the seed partners Problem of too-many closures persists, hence use Short-Fragment Pair Merger (progressively merge the secondary read cloud pairs) unipaths paired-read links C

Step 4: Finding All Paths compute the closures (include false closures) of all the merged short-fragment pairs Step 5: Gluing Together the Local Assembly sequence graph is formed by iteratively joining closures Step 6: Building the Global Assembly outputs of local assemblies are glued together to yield a single sequence graph

Step 7: Editing the Assembly To remove detritus, eliminate ambiguity, and pull apart regions where repeats are assembled on top of each other

Experiments • Simulated Data 10 reference genomes from bacteria and fungi, and 1 10-Mb segment of human genome; with introduced errors • Real Data Solexa

Results Simulated Data • Highly complete and contiguousassemblies (Proportion of genome covered > 96%) • Assembly ambiguities regions <20 per megabase • Assemblies of C.jejuni and E.coli have no errors. Very high accuracy, less than one error per 106 bases Real Data • High coverage (99.1%) • High continuity • High accuracy (Final assembly matches the reference sequence exactly, with only 12 exceptions)

+ / - + Read Localization + Multi-CPU compatible + Extremely good (accurate) results - Slow - Very memory intensive - Impractical assumptions on input data (500bp +/- 5bp insert size)

Thank you

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads