160 likes | 170 Views
This software project aims to develop a mini assembler using the SeqAn C++ Sequence Analysis Library to construct large scaffolds for a whole genome shotgun assembly
E N D
Mini Assembler Software Project B: SeqAn David Weese and Prof. Knut Reinert
SeqAn – The C++ Sequence Analysis Library Sequences Alignments Indices • strings- structured sequences- gapped sequences- alterators • alignment data structures- dynamic Programming- alignment heuristics- multidimensional chaining • q-gram hashes- (enhanced) suffix array- suffix trees- lazy indices, compress. ind. Searching Graphs • exact/approximate- search heuristics- filtering- motif search • (structural) align. graphs- word graphs- probabilistic automata- trees Probabilis. Algorithms • profiles, weight matrices- HMM, SCFG - p-value computations-… • FASTA- gQUASAR, SWIFT,..- MEME, PROJECTION,…- … Biologicals Integration Miscellan. • alphabets- scoring schemes- file formats- base pair probabilities- • using external tools- STL- LEDA and Boost graphs- friend libraries (LISA) • allocators- OS access and support- helper data struc. and algorithms
DNA target sample SHEAR SIZE SELECT (LIBRARY) Reads (Mate Pair) LIGATE & CLONE Primer SEQUENCE Vector DNA Sequencing Shotgun DNA Sequencing (Technology)
Unknown “Target” DNA Sequence Randomly Sample (“Shotgun”) Fragments Fragments • UNKNOWN ORIENTATION • SEQUENCING ERRORS • INCOMPLETE COVERAGE • REPEATS Layout Consensus Scaffold Ordered Contigs Shotgun DNA Sequencing Avg. Length 550 Avg. Error 1-2%
Project Mini Assembler Input: Reads (mate pairs), generated by a simulator Goal:Construct large scaffolds to obtain a good assembly(N50 measure)
Overlapper Unitigger Scaffolder RepeatRez Consensus Project Mini Assembler Whole Genome Shotgun Assembly Pipeline Screener
Overlapper Unitigger Scaffolder RepeatRez Consensus Project Mini Assembler Whole Genome Shotgun Assembly Pipeline Screener Mask known and „de novo“ repeats Task: Build a repeat screener to help the Overlapper
Overlapper A B Unitigger Scaffolder RepeatRez Consensus Project Mini Assembler Whole Genome Shotgun Assembly Pipeline Screener Find overlaps between reads Task: Construct an overlapper module to compute overlaps with affine gap costs.
Overlapper Unitigger Scaffolder RepeatRez Consensus Project Mini Assembler Whole Genome Shotgun Assembly Pipeline Screener Compute consistent sub-assemblies (unitigs) Task: Construct overlap graph Construct a spanning tree based layout
Overlapper unitigs Unitigger Scaffolder scaffold RepeatRez Consensus Project Mini Assembler Whole Genome Shotgun Assembly Pipeline Mated fragments Screener Build Scaffolds Task: Contruct mate pair contig graph Construct scaffold with greedy approach
Overlapper Unitigger Stones Scaffolder Confirming o-path “Anchored” mate and confirming overlap path RepeatRez Consensus Project Mini Assembler Whole Genome Shotgun Assembly Pipeline Screener Fill Gaps Task: Extend contigs using mates and overlaps
Overlapper AACGTGCATTAACTCGCGCAACTG TTCGGGTTGAACGTGCATTAACTCG Unitigger TTCGGGTTGAACGTGTATTAAATCGCGCAACTG GGGTTGAACGTGCATTAAATCGCGCAACTG Scaffolder RepeatRez Consensus Project Mini Assembler Whole Genome Shotgun Assembly Pipeline Screener TTCGGGTTGAACGTGCATTAAATCGCGCAACTG Compute consensus sequence and N50 score
Tasks • Preferable requirements: • C/C++ skills • „C++ für Fortgeschrittene“ block seminar (6.4. - 9.4.) • Your work: • Collaborate and develop a module interface • Make use of SeqAn - C++ Sequence Analysis Library (www.seqan.de) • Implement your module in C++ • Document your code • Present your results • Material: • Assembly lecture notes of „Alg. Bioinformatik“, Reinert WS07/08 • Links and documents on the homepage
Schedule • Seminarblock 1(2.4.) • Introduction to principles of software design • Tools (IDE, Debugger, Profiler, Bug Tracker, SVN, …) • Seminar block 2(14.4. - 17.4.) • SeqAn tutorial (Sequences, Alignments, Graphs, Indices) • Assign names to tasks • Make and present your plan(until 27.4.) • Prepare a presentation • What are your data types and interfaces to other modules? • What algorithms you want to use? • What do you use from SeqAn what do you implement new? • Start Working • Present your results, write a final report (until 8.6.)
End of Talk Questions? weese@inf.fu-berlin.de