190 likes | 479 Views
De Novo Genome Assembly Using vSMP. Wayne Pfeiffer (SDSC/UCSD) August 10, 2011. Goal: determine the DNA sequence of the chromosomes that make up a genome. Background Chromosomes consist of two long DNA molecules in a double helix
E N D
De Novo Genome AssemblyUsing vSMP Wayne Pfeiffer (SDSC/UCSD) August 10, 2011
Goal: determine the DNA sequence of the chromosomes that make up a genome • Background • Chromosomes consist of two long DNA molecules in a double helix • Humans have 23 pairs of nuclear chromosomes, each 50M to 250M base pairs long • This gives 3.1 billion bases in the human haploid genome • Humans have 20,000+ genes (which comprise the exome) that are typically a few kbases long, though dystrophin is 2.4 Mbases long • High-throughput sequencers output only short reads (typically ~100 bases long) that are paired-end • Challenge • Assemble a genome or exome sequence from short DNA reads without using a reference sequence (i.e., de novo) • Algorithm • Generate and analyze a large graph that is stored in shared memory read read
Conceptual steps in de novo assembly 1. Find reads that overlap by a specified number of bases (the k-mer size) 2. Merge overlapping, “good” reads into longer contigs 3. Link contigs to form scaffolds using paired-end information Diagrams from Serafim Batzoglou, Stanford
de Bruijn graph hask-mers as nodes connected by reads;assembly involves finding Eulerian path through graph Diagram from Michael Schatz, Cold Spring Harbor
Velvet & SOAPdenovo are two leading assemblersthat use de Bruijn graph algorithm • Velvet is from EMBL-EBI • Code has two steps: hash & graph • Both are parallelized with OpenMP • Either step can use more time or memory depending upon problem & computer • http://www.ebi.ac.uk/~zerbino/velvet • SOAPdenovo is from BGI • Code has four steps: pregraph, contig, map, & scaffold • pregraph & map are parallelized with Pthreads; others are serial • pregraph uses the most time & memory • http://soap.genomics.org.cn/soapdenovo.html • k-mer size is adjustable parameter • Typically it is adjusted to maximize N50 length of scaffolds or contigs • N50 length is central measure of distribution weighted by lengths
Three central measures of the distribution ofsequence lengths can be compared • Two common measures • Mean length: the usual average of the lengths • Median length: the length for which half the sequences are shorter & half are longer • Less common measure • N50 length: the length that splits the total bases in half, after the lengths are ordered • Example values for distribution of scaffold lengths for Daphnia genome (from SOAPdenovo with k-mer size=39) • Mean length: 627 • Median length: 200 • N50 length: 1,718
We do not get one contig or even one scaffold for each chromosomeof Daphnia genome or for each gene of human exomewhen assembling short reads Read k-mer Longest Scaf- Problem Reads length size Contigs contig folds Daphnia genome: 208M 44, 75 39 684k 24k 351k Huqhos2 Human exome: 186M 101 73 279k 13k 258k HT002.s4.dedup Complete assemblies are difficult to obtain • Roughly half (!) of the human genome consists of repeatswith peaks around 300 bases & 6 kbases, which are longer than the reads • However, it helps to have reads that are paired-end and have different(but roughly known) insert lengths (i.e., fragment lengths) Tabulated results are for • Daphnia data from Abe Tucker of Indiana & human data from Sam Levy of STSI • Illumina paired-end reads with k-mer size picked to maximize scaffold N50 length • SOAPdenovo 1.05
Velvet & SOAPdenovo often give similar assemblies,though Velvet seems better for Daphnia genome SOAP- SOAP- Velvet denovo Velvet denovo Read k-mer longest longest scaffold scaffold Problem length size scaffold scaffold N50 len N50 len Daphnia genome: 44, 75 31 94k243k 4,017 1,634 Huqhos2 35 99k 283k 4,383 * 1,696 39 95k 225k 4,220 1,718 * Human exome: 101 63 err 14k err 656 HT002.s4.dedup 67 17k 13k 787 * 692 73 17k 13k 768 702 * Tabulated results are for • Daphnia data from Abe Tucker of Indiana & human data from Sam Levy of STSI • Illumina paired-end reads with * for optimal k-mer size • Velvet 1.1.03 & SOAPdenovo 1.05
Velvet runs slower than SOAPdenovo;time roughly increases with GB of reads& decreases with k-mer size SOAP- Reads Velvet denovo Read on disk k-mer time time Problem Reads length (GB) size (h) (h) Daphnia genome: 208M 44, 75 28 31 6.42.2 Huqhos2 35 5.0 * 1.6 39 4.4 1.7 * Human exome: 186M 101 45 63 err 1.7 HT002.s4.dedup 67 2.6 * 1.8 73 2.2 1.5 * Tabulated results are for • Daphnia data from Abe Tucker of Indiana & human data from Sam Levy of STSI • Illumina paired-end reads with * for optimal k-mer size • One 32-core node of Triton PDAF with 2.5-GHz AMD Shanghais • Velvet 1.1.03 with 8 threads & SOAPdenovo 1.05 with 16 threads
Velvet requires more memory than SOAPdenovo;memory roughly increases with GB of reads,but has more complicated dependence on k-mer size SOAP- Reads Velvet denovo Read on disk k-mer memory memory Problem Reads length (GB) size (GB) (GB) Daphnia genome: 208M 44, 75 28 31 22118 Huqhos2 35 139 * 33 39 94 33 * Human exome: 186M 101 45 63 err 59 HT002.s4.dedup 67 124 * 97 73 113 97 * Tabulated results are for • Daphnia data from Abe Tucker of Indiana & human data from Sam Levy of STSI • Illumina paired-end reads with * for optimal k-mer size • One 32-core node of Triton PDAF with 256 of memory • Velvet 1.1.03 with 8 threads & SOAPdenovo 1.05 with 16 threads; blue cases plotted in later slides
Benchmark tests were run on three computersthat allow large shared-memory runs • Triton PDAF from Sun at SDSC • 32-core nodes with 2.5-GHz AMD Shanghais • 256 & 512 GB of memory per node • Dash from Appro at SDSC • 8-core nodes with 2.4-GHz Intel Nehalems • 48 GB of memory per node • vSMP supernode combining 16 regular nodes for larger memory • Ember from SGI at NCSA • 384-core NUMA nodes with 2.66-GHz Intel Nehalem-EXes • 2 TB of memory per NUMA node • Highly variable run times
Here are some details • For vSMP runs, numactl was called before the usual run command to limit threads to specific processors, e.g., • numactl --cpunodebind=0,1 <usual run command> • No source code changes were made • Memory used was obtained from vmem record in PBS job summary sent by e-mail, e.g., • vmem record: resources_used.vmem=128389148kb • Memory used: 128389148 kB / 1.024^2 kB/GB = 122 GB
Using 8 threads (or cores),graph step of Velvet is fast on Dash with vSMP,but hash step is much slower when vSMP is needed above 48 GB
What is going on? • Memory access for graph step is fairly regular • This is efficient with vSMP • Performance has improved significantly over the past year through tuning of vSMP by ScaleMP • Memory access for hash step is nearly random • This is inefficient with vSMP • Changes to Velvet source code are likely needed to obtain better performance • Assessments are based on memory profiling by ScaleMP
For Velvet assemblies of Daphnia genome,Dash with vSMP is fastest from 1 to 8 cores,but its speed drops a lot on 16 cores (i.e., on 2 cores)
For Velvet assemblies of human exome,Dash with vSMP is much slower than Triton PDAFat all core counts because hash step is slow
For SOAPdenovo assemblies using 8 threads (or cores),Dash with vSMP is much slower than other computers for human exome; this is similar to behavior observed using Velvet for the same data
Here are some take-away messages • vSMP performs very well for some Velvet assemblies • These are cases where access to large memory is regular • Run time is actually faster on Dash with vSMP than on other computers with large shared-memory nodes! • vSMP performs less well for other Velvet assemblies & all SOAPdenovo assemblies studied • These are cases where access to large memory is random • Although run time using vSMP is relatively slow, it provides capability to do analyses that are possible on only a few computers