1 / 22

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads. Jonathan Butler, Iain MacCallum , Michael Kleber , Ilya A. Shlyakhter , Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum , and David B. Jaffe Presented by: Mohit Jain. Introduction. Overview. Algorithm. Results.

therese
Download Presentation

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter, Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum, and David B. Jaffe Presented by: Mohit Jain

  2. Introduction Overview Algorithm Results Motivation (P) DNA Sequencing: • Chr length: ~1000 - ~250,000,000 bps • Longest sequence-able fragment: ~600 bps (S) Shotgun Method:determine sequence by breaking genome into many small segments (reads) (P) Sequence/Genome Assembly:combining these reads to reconstruct the genome Slide: 1/

  3. Motivation (S) Original Genome: Shortest Common Superstring (SCS) Problem = The shortest sequence that contains every read as a substring (P) Repeats Genomes have repeats, and SCS represents repeats only once (S) Overlap Graph

  4. (S)Overlap Graph Each read forms a Node Edge exists between two nodes if the reads overlap Algorithm: Step 1: Removing redundant edges, classify edges as required/optional Step 2: Find the shortest walk which includes all required edges Red/Thin: False Overlaps

  5. (P)Overlap Graph Microreads =only 25-50 bases long (for HTS) shorter reads = shorter overlap => more reads => more overlaps - Very large number of (mostly false) overlaps - Large number of reads + short overlap + higher error rate

  6. (S)De Bruijn Graph • To construct de Bruijn graph: all reads are broken in to overlapping subsequences of length k (k-mer) • Each k-1 subsequence represents a Node • A directed Edge e exists between two nodes a and b iff there exists a k-mer such that its prefix = a and its suffix = b

  7. (S)De Bruijn Graph • Condensed by collapsing non-ambiguous paths • Genome: An Eulerian path (Superwalk: walk including all edges) in this graph

  8. Paired Reads (Mate pairs) • Sequence two ends of a fragment of known size • Results better assemblies, but more complicated

  9. Current Approaches • Velvet • EULER-USR • ALLPATHS (Velvet and Euler USR are based on De Bruijn Graph method)

  10. ALLPATHS Step I. Builds Unipath Graph Step II. Localizes reads sequences before assembly Unipath: maximal unbranched sequence

  11. Read Localization

  12. Short Fragment Pair Merger • Fills the gap in between two paired reads • Builds a local unipath graph • Extend both ends (of all reads) based on the local unipath graph • For each pair, search for other pairs which overlap on both ends, and merge to obtain longer reads

  13. Short Fragment Pair Merger • Repeat the process for all pairs. • Once sequence is complete, update the local unipath graph • Iteratively merge local unipath graphs to obtain a global unipath graph, representing the genome

  14. ALLPATHS Paired-Read Assembly Algorithm Step 1: Creating Approximate Unipaths 1a: Error correction 1b: k-mer numbering and searchable data structure (Ignoring any overlaps between reads) 1c: Computing unipaths from the data structure by walking along the reads until a branch is encountered Read pairs  Unipaths  Localize

  15. Step 2: Selecting Seeds Seeds = Unipaths around which assemblies of genomic regions are build Ideal seed: Long Unipaths with Low Copy Number (=1) Copy Number = Inferred from read coverage of the unipaths 2a: For each unipath, compute the closest unipaths in the set that are to the left and to the right of the given unipath 2b: If the distance between left and right neighbours is less than 4 kb, then the middle unipath is removed 2c: After all such unipaths are removed, remaining forms the seeds unipaths

  16. Step 3: Assembling neighbourhoods around the seeds Neighbourhood = Seed + 10 kb on each side 3a: Define a collection of low-copy number unipaths, using iterative linking 3b: Construct two sets of read clouds: primary(B): only reads, whose true genomic locations are near the seed secondary(C): contains all the short-fragment read pairs (~0.5 kb) near the seed partners Problem of too-many closures persists, hence use Short-Fragment Pair Merger (progressively merge the secondary read cloud pairs) unipaths paired-read links C

  17. Step 4: Finding All Paths compute the closures (include false closures) of all the merged short-fragment pairs Step 5: Gluing Together the Local Assembly sequence graph is formed by iteratively joining closures Step 6: Building the Global Assembly outputs of local assemblies are glued together to yield a single sequence graph

  18. Step 7: Editing the Assembly To remove detritus, eliminate ambiguity, and pull apart regions where repeats are assembled on top of each other

  19. Experiments • Simulated Data 10 reference genomes from bacteria and fungi, and 1 10-Mb segment of human genome; with introduced errors • Real Data Solexa

  20. Results Simulated Data • Highly complete and contiguousassemblies (Proportion of genome covered > 96%) • Assembly ambiguities regions <20 per megabase • Assemblies of C.jejuni and E.coli have no errors. Very high accuracy, less than one error per 106 bases Real Data • High coverage (99.1%) • High continuity • High accuracy (Final assembly matches the reference sequence exactly, with only 12 exceptions)

  21. + / - + Read Localization + Multi-CPU compatible + Extremely good (accurate) results - Slow - Very memory intensive - Impractical assumptions on input data (500bp +/- 5bp insert size)

  22. Thank you

More Related