De-novo Assembly

De-novo Assembly Day 4

The Challenge • Given many (millions or billions) of reads, produce a linear (or perhaps circular) genome • Issues: • Coverage • Errors in reads • Reads vary from very short (35bp) to quite long (800bp), and are double-stranded • Non-uniqueness of solution • Running time and memory

In an ideal world… • Reads have no error • Reads are long enough and each appears once • Each read given in the same orientation (all 5’ to 3’, for example) • Contiguous reads have enough overlap so that they can easily be assembled

Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp Not all sequencing technologies produce mate-pairs. Different error models Different read lengths 10,000bp

Terminology • Reads are what you start with (35bp-800bp) • Fragmented assemblies produce contigs • Contigs can be put together into scaffolds

Reference-based vsde novoassembly • Comparative assembly means you have a “reference” genome • Same OR related species • Can get weird in bacteria and viruses due to recombination • De novo assembly means you do everything from scratch

Comparative Assembly Much easier than de novo! Basic idea: • Take the reads and map them onto the reference (allowing for small mismatches) • Collect all overlapping reads, produce a multiple sequence alignment, and produce consensus sequence

Comparative Assembly • Fast • Short reads can map to several places (especially if they have errors) • Needs close reference genome • Repeats are problematic • Can be highly accurate even when reads have errors

De Novo assembly • Much easier to do with long reads • Need very good coverage • Generally produces fragmented assemblies • Necessarywhen you don’t have a closely related (and correctly assembled) reference genome

Conceptual steps in de novo assembly 1. Find reads that overlap by a specified number of bases (the k-mer size) 2. Merge overlapping, “good” reads into longer contigs 3. Link contigs to form scaffolds using paired-end information Diagrams from Serafim Batzoglou, Stanford

De Novo Assembly paradigms Overlap-layout-consensus methods k-mer graph (especially useful for assembly from short reads) aka de Bruijn graph 11

de Bruijn graph hask-mers as nodes connected by reads;assembly involves finding Eulerian path through graph Diagram from Michael Schatz, Cold Spring Harbor

de Bruijngraph • Vertices are k-mers that appear in some read, and edges defined by overlap of k-1 nucleotides • Small values of k produce small graphs • Does not require all-pairs overlap calculation! • But: loss of information about reads can lead to “chimeric”contigs, and incorrect assemblies • Also produces fragmented assemblies (even shorter contigs)

de BruijnGraph use • Create the de Bruijn graph for the following string, using k=5 • ACATAGGATTCAC • Find the Eulerian path • Is the Eulerian path unique? • Reconstruct the sequence from this path

But… • Because of • Errors in reads • Repeats • Insufficient coverage de Bruijn graphs generally don’t Eulerianpaths/circuits • This means the first step doesn’t completely assemble the genome

Velvet and SOAPdenovo are two leading assemblers • Velvet is from EMBL-EBI • http://www.ebi.ac.uk/~zerbino/velvet • SOAPdenovois from BGI • http://soap.genomics.org.cn/soapdenovo.html • k-mer size is adjustable parameter • Typically it is adjusted to maximize N50 length of scaffolds or contigs • N50 length is central measure of distribution weighted by lengths

NOTE: QC step slightly different • You still do standard checks with something like FASTQC • Read Trimming is different • Algorithms/tools can deal with different read lengths • Finding overlaps give longer contigs • So we never want to sacrifice good reads • Solution: • remove sequencing adapters • trim individual reads as needed • http://www.usadellab.org/cms/index.php?page=trimmomatic

Three useful measures for optimization of k-mer length • Meanlength: the usual average of the lengths • Median length: the length for which half the sequences are shorter & half are longer • N50 length: the length that splits the total bases in half, after the lengths are ordered • Example values for distribution of contig lengths: • Mean length: 627 • Median length: 200 • N50 length: 1,718 • We’ll look at using N50 in in the practical

Practical prep • Download and install • TRIMMOMATIC: http://www.usadellab.org/cms/index.php?page=trimmomatic • VELVET: http://www.ebi.ac.uk/~zerbino/velvet/ • Use make ’MAXKMERLENGTH=70’ when compiling • Grab practical dataset from course page

De-novo Assembly

De-novo Assembly

Presentation Transcript

De Novo Genome Assembly Using vSMP

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

Parallel de novo Assembly of Large Genomes from Paired Short Reads

De novo assembly from Illumina

Secuenciación de novo

Genovo : De Novo Assembly for Metagenomes

De novo assembly of RNA

De Novo Repeat Classification and Fragment Assembly

De novo assembly from clinical sample

De novo genome assembly

De novo Peptide Design

De Novo Genome Assembly - Introduction

De Novo Genome Assembly - Introduction

De novo assembly validation

Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs