1 / 16

SeqAn Mini Assembler Software Project

This software project aims to develop a mini assembler using the SeqAn C++ Sequence Analysis Library to construct large scaffolds for a whole genome shotgun assembly

cherrylw
Download Presentation

SeqAn Mini Assembler Software Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mini Assembler Software Project B: SeqAn David Weese and Prof. Knut Reinert

  2. SeqAn – The C++ Sequence Analysis Library Sequences Alignments Indices • strings- structured sequences- gapped sequences- alterators • alignment data structures- dynamic Programming- alignment heuristics- multidimensional chaining • q-gram hashes- (enhanced) suffix array- suffix trees- lazy indices, compress. ind. Searching Graphs • exact/approximate- search heuristics- filtering- motif search • (structural) align. graphs- word graphs- probabilistic automata- trees Probabilis. Algorithms • profiles, weight matrices- HMM, SCFG - p-value computations-… • FASTA- gQUASAR, SWIFT,..- MEME, PROJECTION,…- … Biologicals Integration Miscellan. • alphabets- scoring schemes- file formats- base pair probabilities- • using external tools- STL- LEDA and Boost graphs- friend libraries (LISA) • allocators- OS access and support- helper data struc. and algorithms

  3. DNA target sample SHEAR SIZE SELECT (LIBRARY) Reads (Mate Pair) LIGATE & CLONE Primer SEQUENCE Vector DNA Sequencing Shotgun DNA Sequencing (Technology)

  4. Unknown “Target” DNA Sequence Randomly Sample (“Shotgun”) Fragments Fragments • UNKNOWN ORIENTATION • SEQUENCING ERRORS • INCOMPLETE COVERAGE • REPEATS Layout Consensus Scaffold Ordered Contigs Shotgun DNA Sequencing Avg. Length 550 Avg. Error 1-2%

  5. Project Mini Assembler Input: Reads (mate pairs), generated by a simulator Goal:Construct large scaffolds to obtain a good assembly(N50 measure)

  6. Overlapper Unitigger Scaffolder RepeatRez Consensus Project Mini Assembler Whole Genome Shotgun Assembly Pipeline Screener

  7. Overlapper Unitigger Scaffolder RepeatRez Consensus Project Mini Assembler Whole Genome Shotgun Assembly Pipeline Screener Mask known and „de novo“ repeats Task: Build a repeat screener to help the Overlapper

  8. Overlapper A B Unitigger Scaffolder RepeatRez Consensus Project Mini Assembler Whole Genome Shotgun Assembly Pipeline Screener Find overlaps between reads Task: Construct an overlapper module to compute overlaps with affine gap costs.

  9. Overlapper Unitigger Scaffolder RepeatRez Consensus Project Mini Assembler Whole Genome Shotgun Assembly Pipeline Screener Compute consistent sub-assemblies (unitigs) Task: Construct overlap graph Construct a spanning tree based layout

  10. Overlapper unitigs Unitigger Scaffolder scaffold RepeatRez Consensus Project Mini Assembler Whole Genome Shotgun Assembly Pipeline Mated fragments Screener Build Scaffolds Task: Contruct mate pair contig graph Construct scaffold with greedy approach

  11. Overlapper Unitigger Stones Scaffolder Confirming o-path “Anchored” mate and confirming overlap path RepeatRez Consensus Project Mini Assembler Whole Genome Shotgun Assembly Pipeline Screener Fill Gaps Task: Extend contigs using mates and overlaps

  12. Overlapper AACGTGCATTAACTCGCGCAACTG TTCGGGTTGAACGTGCATTAACTCG Unitigger TTCGGGTTGAACGTGTATTAAATCGCGCAACTG GGGTTGAACGTGCATTAAATCGCGCAACTG Scaffolder RepeatRez Consensus Project Mini Assembler Whole Genome Shotgun Assembly Pipeline Screener TTCGGGTTGAACGTGCATTAAATCGCGCAACTG Compute consensus sequence and N50 score

  13. Names to Tasks

  14. Tasks • Preferable requirements: • C/C++ skills • „C++ für Fortgeschrittene“ block seminar (6.4. - 9.4.) • Your work: • Collaborate and develop a module interface • Make use of SeqAn - C++ Sequence Analysis Library (www.seqan.de) • Implement your module in C++ • Document your code • Present your results • Material: • Assembly lecture notes of „Alg. Bioinformatik“, Reinert WS07/08 • Links and documents on the homepage

  15. Schedule • Seminarblock 1(2.4.) • Introduction to principles of software design • Tools (IDE, Debugger, Profiler, Bug Tracker, SVN, …) • Seminar block 2(14.4. - 17.4.) • SeqAn tutorial (Sequences, Alignments, Graphs, Indices) • Assign names to tasks • Make and present your plan(until 27.4.) • Prepare a presentation • What are your data types and interfaces to other modules? • What algorithms you want to use? • What do you use from SeqAn what do you implement new? • Start Working • Present your results, write a final report (until 8.6.)

  16. End of Talk Questions? weese@inf.fu-berlin.de

More Related