1 / 19

De-novo Assembly

De-novo Assembly. Day 4. The Challenge. Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome Issues : Coverage Errors in reads Reads vary from very short (35bp) to quite long (800bp), and are double-stranded Non-uniqueness of solution

keitha
Download Presentation

De-novo Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. De-novo Assembly Day 4

  2. The Challenge • Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome • Issues: • Coverage • Errors in reads • Reads vary from very short (35bp) to quite long (800bp), and are double-stranded • Non-uniqueness of solution • Running time and memory

  3. In an ideal world… • Reads have no error • Reads are long enough and each appears once • Each read given in the same orientation (all 5’ to 3’, for example) • Contiguous reads have enough overlap so that they can easily be assembled

  4. Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp Not all sequencing technologies produce mate-pairs. Different error models Different read lengths 10,000bp

  5. Terminology • Reads are what you start with (35bp-800bp) • Fragmented assemblies produce contigs • Contigs can be put together into scaffolds

  6. Reference-based vsde novoassembly • Comparative assembly means you have a “reference” genome • Same OR related species • Can get weird in bacteria and viruses due to recombination • De novo assembly means you do everything from scratch

  7. Comparative Assembly Much easier than de novo! Basic idea: • Take the reads and map them onto the reference (allowing for small mismatches) • Collect all overlapping reads, produce a multiple sequence alignment, and produce consensus sequence

  8. Comparative Assembly • Fast • Short reads can map to several places (especially if they have errors) • Needs close reference genome • Repeats are problematic • Can be highly accurate even when reads have errors

  9. De Novo assembly • Much easier to do with long reads • Need very good coverage • Generally produces fragmented assemblies • Necessarywhen you don’t have a closely related (and correctly assembled) reference genome

  10. Conceptual steps in de novo assembly 1. Find reads that overlap by a specified number of bases (the k-mer size) 2. Merge overlapping, “good” reads into longer contigs 3. Link contigs to form scaffolds using paired-end information Diagrams from Serafim Batzoglou, Stanford

  11. De Novo Assembly paradigms Overlap-layout-consensus methods k-mer graph (especially useful for assembly from short reads) aka de Bruijn graph 11

  12. de Bruijn graph hask-mers as nodes connected by reads;assembly involves finding Eulerian path through graph Diagram from Michael Schatz, Cold Spring Harbor

  13. de Bruijngraph • Vertices are k-mers that appear in some read, and edges defined by overlap of k-1 nucleotides • Small values of k produce small graphs • Does not require all-pairs overlap calculation! • But: loss of information about reads can lead to “chimeric”contigs, and incorrect assemblies • Also produces fragmented assemblies (even shorter contigs)

  14. de BruijnGraph use • Create the de Bruijn graph for the following string, using k=5 • ACATAGGATTCAC • Find the Eulerian path • Is the Eulerian path unique? • Reconstruct the sequence from this path

  15. But… • Because of • Errors in reads • Repeats • Insufficient coverage de Bruijn graphs generally don’t Eulerianpaths/circuits • This means the first step doesn’t completely assemble the genome

  16. Velvet and SOAPdenovo are two leading assemblers • Velvet is from EMBL-EBI • http://www.ebi.ac.uk/~zerbino/velvet • SOAPdenovois from BGI • http://soap.genomics.org.cn/soapdenovo.html • k-mer size is adjustable parameter • Typically it is adjusted to maximize N50 length of scaffolds or contigs • N50 length is central measure of distribution weighted by lengths

  17. NOTE: QC step slightly different • You still do standard checks with something like FASTQC • Read Trimming is different • Algorithms/tools can deal with different read lengths • Finding overlaps give longer contigs • So we never want to sacrifice good reads • Solution: • remove sequencing adapters • trim individual reads as needed • http://www.usadellab.org/cms/index.php?page=trimmomatic

  18. Three useful measures for optimization of k-mer length • Meanlength: the usual average of the lengths • Median length: the length for which half the sequences are shorter & half are longer • N50 length: the length that splits the total bases in half, after the lengths are ordered • Example values for distribution of contig lengths: • Mean length: 627 • Median length: 200 • N50 length: 1,718 • We’ll look at using N50 in in the practical

  19. Practical prep • Download and install • TRIMMOMATIC: http://www.usadellab.org/cms/index.php?page=trimmomatic • VELVET: http://www.ebi.ac.uk/~zerbino/velvet/ • Use make ’MAXKMERLENGTH=70’ when compiling • Grab practical dataset from course page

More Related