210 likes | 356 Views
PE-Assembler: De novo assembler using short paired-end reads. Pramila Nuwantha Ariyaratne. Outline. Method Read screening Seed building Contig extension Scaffolding Gap filling Result. Data-sets Used. Single end reads Paired end reads ReadLength ( from 25bp to 100bp)
E N D
PE-Assembler: De novo assembler using short paired-end reads PramilaNuwanthaAriyaratne
Outline • Method • Read screening • Seed building • Contig extension • Scaffolding • Gap filling • Result
Data-sets Used • Single end reads • Paired end reads • ReadLength (from 25bp to 100bp) • Insert size vary from MinSpanto MaxSpan • The information are mainly from this data-sets.
Overview • Read screening step select a set of reads as starting point. • Seed building step extend these reads using Single End Reads to make them longer than MaxSpan. Successfully extended regions are called seeds. • Contig extension try to extend all seeds using paired-end reads, result sequences called contigs.
Read screening • Get all k-mers from all the reads. • A k-merthat is expected to occur in the actual genome is called a ‘solid’ k-mer. • A k-mer that is expected to occur within a repeat region is called a ‘repeat’ k-mer. • Repeat Region: • ACTTTGACACACACACAC……ACACACACGTTGAG
Read screening • A read is solid read if: • All it’s k-mers are within the two threshold cut-off. • Example: • Two cut-off [42, 120] from previous graph. • K=5 • Read: ACCGTATA • ACCGT, CCGTA, CGTAT, GTATA • 100, 70, 90, 140 • Not a solid read.
Read screening • Example: • Two cut-off [42, 120] from previous graph. • K=5 • Read: ACCGTATG • ACCGT, CCGTA, CGTAT, GTATG • 100, 70, 90, 70 • A solid read.
Seed Building • Try to extend the solid read using all overlapping reads.
Seed Building • Because of sequencing errors or small repeats, there maybe multiple feasible candidates.
Seed Building • Ambiguities due to sequencing errors, we extend every candidate base up to ReadLength. • If only one candidate path reach the full distance ReadLength, then that path is assumed to be correct extension. • If no path or more than one path found. Try other side.
Seed Building • Finally, when the sequence reach MaxSpan, (called seed) do a verification. • At least one paired-end reads overlaps with this seed within expected length [MinSpan, MaxSpan]
Contig Extension • This step aims to extend each verified seed to form a longer contig using Paired-End reads. • For multiple feasible candidates, may due to 3 reasons. • First, sequencing errors. • Second, short tandem repeat. Handling in Gap Filing step. • Third, long repeat. Which longer than MaxSpan.
Scaffolding • Find the correct ordering of the resulting set of contigs. • Gao Song currently working on it.
Gap filling • Gap filling step is to assemble the gap region between two adjacent contigs to form a longer contig.
Simulated data results. • Result compare using: • Average Length of all contigs. • N50, N90 of contigs. Bigger better. • Coverage. • Large Misassembly: accuracy is much more important than others.