280 likes | 606 Views
De-novo Assembly. Day 4. The Challenge. Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome Issues : Coverage Errors in reads Reads vary from very short (35bp) to quite long (800bp), and are double-stranded Non-uniqueness of solution
E N D
De-novo Assembly Day 4
The Challenge • Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome • Issues: • Coverage • Errors in reads • Reads vary from very short (35bp) to quite long (800bp), and are double-stranded • Non-uniqueness of solution • Running time and memory
In an ideal world… • Reads have no error • Reads are long enough and each appears once • Each read given in the same orientation (all 5’ to 3’, for example) • Contiguous reads have enough overlap so that they can easily be assembled
Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp Not all sequencing technologies produce mate-pairs. Different error models Different read lengths 10,000bp
Terminology • Reads are what you start with (35bp-800bp) • Fragmented assemblies produce contigs • Contigs can be put together into scaffolds
Reference-based vsde novoassembly • Comparative assembly means you have a “reference” genome • Same OR related species • Can get weird in bacteria and viruses due to recombination • De novo assembly means you do everything from scratch
Comparative Assembly Much easier than de novo! Basic idea: • Take the reads and map them onto the reference (allowing for small mismatches) • Collect all overlapping reads, produce a multiple sequence alignment, and produce consensus sequence
Comparative Assembly • Fast • Short reads can map to several places (especially if they have errors) • Needs close reference genome • Repeats are problematic • Can be highly accurate even when reads have errors
De Novo assembly • Much easier to do with long reads • Need very good coverage • Generally produces fragmented assemblies • Necessarywhen you don’t have a closely related (and correctly assembled) reference genome
Conceptual steps in de novo assembly 1. Find reads that overlap by a specified number of bases (the k-mer size) 2. Merge overlapping, “good” reads into longer contigs 3. Link contigs to form scaffolds using paired-end information Diagrams from Serafim Batzoglou, Stanford
De Novo Assembly paradigms Overlap-layout-consensus methods k-mer graph (especially useful for assembly from short reads) aka de Bruijn graph 11
de Bruijn graph hask-mers as nodes connected by reads;assembly involves finding Eulerian path through graph Diagram from Michael Schatz, Cold Spring Harbor
de Bruijngraph • Vertices are k-mers that appear in some read, and edges defined by overlap of k-1 nucleotides • Small values of k produce small graphs • Does not require all-pairs overlap calculation! • But: loss of information about reads can lead to “chimeric”contigs, and incorrect assemblies • Also produces fragmented assemblies (even shorter contigs)
de BruijnGraph use • Create the de Bruijn graph for the following string, using k=5 • ACATAGGATTCAC • Find the Eulerian path • Is the Eulerian path unique? • Reconstruct the sequence from this path
But… • Because of • Errors in reads • Repeats • Insufficient coverage de Bruijn graphs generally don’t Eulerianpaths/circuits • This means the first step doesn’t completely assemble the genome
Velvet and SOAPdenovo are two leading assemblers • Velvet is from EMBL-EBI • http://www.ebi.ac.uk/~zerbino/velvet • SOAPdenovois from BGI • http://soap.genomics.org.cn/soapdenovo.html • k-mer size is adjustable parameter • Typically it is adjusted to maximize N50 length of scaffolds or contigs • N50 length is central measure of distribution weighted by lengths
NOTE: QC step slightly different • You still do standard checks with something like FASTQC • Read Trimming is different • Algorithms/tools can deal with different read lengths • Finding overlaps give longer contigs • So we never want to sacrifice good reads • Solution: • remove sequencing adapters • trim individual reads as needed • http://www.usadellab.org/cms/index.php?page=trimmomatic
Three useful measures for optimization of k-mer length • Meanlength: the usual average of the lengths • Median length: the length for which half the sequences are shorter & half are longer • N50 length: the length that splits the total bases in half, after the lengths are ordered • Example values for distribution of contig lengths: • Mean length: 627 • Median length: 200 • N50 length: 1,718 • We’ll look at using N50 in in the practical
Practical prep • Download and install • TRIMMOMATIC: http://www.usadellab.org/cms/index.php?page=trimmomatic • VELVET: http://www.ebi.ac.uk/~zerbino/velvet/ • Use make ’MAXKMERLENGTH=70’ when compiling • Grab practical dataset from course page