320 likes | 471 Views
Parallel de novo Assembly of Large Genomes from Paired Short Reads. Srinivas Aluru Electrical and Computer Engineering Iowa State University Computer Science and Engineering Indian Institute of Technology Bombay. Sequencing Technology (1977-2005). Sanger Sequencing 500-1000 bp per run
E N D
Parallel de novo Assembly of Large Genomes from Paired Short Reads Srinivas Aluru Electrical and Computer Engineering Iowa State University Computer Science and Engineering Indian Institute of Technology Bombay
Sequencing Technology (1977-2005) Sanger Sequencing • 500-1000 bp per run • Up to 96 runs ABI 3700
Sequencing long DNA Multiple copies of the same source Sequence Unordered genome fragments Randomly fragment the copies Iowa State University
Next-gen Sequencing 454 36 –> 50 –> 75 bp 100 – 200 million reads 400-500 bp 1.25 million reads Polonator • 26 bp • 200 – 400 million reads Illumina Genome Analyzer
High-throughput sequencing • 10-fold increase per year • Throughput of Solid sequencer doubling every quarter recently • In 2008, Illumina customers sequenced ≈10 X GenBank size in 2005 (Science 2009) • $15 million for 5-7X of Sanger vs. $48K for 30X of Illumina for human genome (June 2009)
The Resulting Data Deluge Broad this year: 50 Gbases in 12 days ~4GB/instrument/day 98 instruments 400 Gbases/day 4 TB/day raw data 1.5 Petabytes/year Broad next year: 200 Gbases in 8 days 25GB/instrument/day 100 instruments 2.5 Tbases/day 28 TB/day raw data 10 Petabytes/year CERN 15 PB/year
Exact Matching Filter Directly detectpromising pairs Sanger: De novo genome assembly Overlap-Layout-Consensus pairs ≡O(n2l2) run-time O(n) pairs O(nl2) run-time Iowa State University
Genome Sequencing - Examples • Human; $1 B+ (Venter et al. 2001): • 10,000 CPU hours for overlaps • 10,000 CPU hours for the rest • Parallel assembler (3 months 1 day; Kalyanaraman & Aluru JPDC 2007) • B73 maize genome; $30 million (WSU, UAZ, ISU, CSHL; Science Nov. 2009) Iowa State University
Next-Gen: Graph based assembly TACG ACGTAAC Build de Bruijn graph from reads. Each read is a path Genome is a super path Use paired read constraints as guide CGTA CGTA CGTA GTAA ACGT ACGT ACGT GTAA GTAA GTAA TAAC TAAC TAAC AACG TAAC
Rapidly Growing Area • ALLPATHS (Butler et al. 2008) • EULER-SR (Chaissonet al. 2008) • SHARCGS (Dohmet al. 2008) • Velvet (Zerbino and Birney 2008) • Bidirected graphs (Medvedev & Brudno 2008) • YAGA (Jackson, Schnable and Aluru 2009) • ABySS (Simpson et al. 2009) Key Challenge: memory-intensive graph models for high quality + high-coverage to compensate for short reads
Generating Paired Reads GACGGTGACGTTGGG ACGGTGACGTTGGGAGCGGAGCG AGGGCTTAGAC AGGGCTTAGACGGT AGGGCTTAGACGGT AGGGCTTAGACGGTG GACGGTGACGTTGGG ACGGTGACGTTGGGAGCGGAGCG AGGGCTTAGACGGTG AGGGCTTAGAC GGTGACGGTGACGTTGGGAGCGGAGCG GGTGACGGTGACGTTGGGAGCGGAGCG AGCGGAGCG AGCGGAGCG CCCAACGTCACCGTC CGCTCCGCTCCCAACGTCACCGT CGCTCCGCTCCCAACGTCACCGT ACCGTCTAAGCCCT ACCGTCTAAGCCCT CCCAACGTCACCGTC GTCTAAGCCCT GTCTAAGCCCT CGCTCCGCTCCCAACGTCACCGTCACC CGCTCCGCTCCCAACGTCACCGTCACC CGCTCCGCT CGCTCCGCT CACCGTCTAAGCCCT CACCGTCTAAGCCCT
Multiple Fragment Types Minimum Fragment Size Maximum Fragment Size Minimum Read Length Maximum Read Length
Error Correction K-spectrum approach (Chaissonet al., Bioinformatics 2008) SHREC (Shroderet al., Bioinformatics 2009) Reptile (Yang and Aluru, 2010)
Importance of Error Correction BartonellaHenselae 330X coverage, 36bp Illumina reads
Bidirected de Bruijn Graph + CGCCGAC TCGCCGAC CGCCGACT GCCGACTA CCGACTAC CGACTACT CGCCGAC GTAGTCGC TAGTCGGC AGTAGTCG GTCGGCGA AGTCGGCG GTCGGCG GTCGGCG - TCGCCGACTACT - AGTAGTCGGCGA + - + + + - -
Bidirected String Graph • Edges labeled with two strings CGACTACT GCCGACTA CCGACTAC TCGCCGAC CGCCGACT AGTAGTCG GTAGTCGC GTCGGCGA AGTCGGCG TAGTCGGC T/A C/T TCGC/AGTA G/G C/A
Parallel Graph Representation • Distributed tuple List for edges • <u, v, du, dv, Cu, Cv> • <v, u, dv, du, Cv, Cu> • 2k bits to represent node IDs • Sorting tuples: • by node label: edges incident to a node in local memory • by (smaller node label, larger node label): both tuples corresponding to an edge in local memory
Parallel Graph Construction • Reads are partitioned to processors based on total size • Each processor generates (k+1)-molecules (edges) from its short reads • Sort to eliminate duplicates and keep frequency count
Graph Compaction • Compact chains into single edges • Reduction to familiar list ranking • List ranking: • Distributed linked list with adjacency information • Find distance from each node to end of list • Extend: • Multiple linked lists • Identify nodes with multiple edges • Undirected
Transforming Edge Compaction to List Ranking: The transformation can be accomplished using a constant number of parallel sorts.
Errors Detection and Removal Bubble Spurious Link Tip Singleton k k k k TCGTTGCGTGCGTGAGCGT TCGTTGCGTGCGTGAGCGT TCGTTGCGTGCGTGAGCGT
Repeats and the String Graph Repeat 1 Repeat 2 While a path through the graph corresponds to the genome, it is not obvious which path it is Paired reads provide approximate distance constraints that can be used to choose between alternate traversals
Measuring path support • Given a path in the graph • Each (k+1)-molecule in a read maps to some position in the graph • Can check to see if the observed distance in the path matches distance constraints
Clustering Path Support • Important idea: can specify the amount of expected support for two edges in a path • For each fragment type, much of some range should be supported • Can cluster observed support into support ranges and check to see if these observed ranges match expected ranges • Use single linkage clustering to combine multiple ranges into a single unit • Key Idea: Clustering remains invariant in the face of intervening path changes • can be precalculated prior to graph traversal • These “partial” clusters can be converted to true clusters during path walking
Creating a path extension “lock” For each edge in the path, we create an expected support “lock” given a candidate edge extension:
Finding a Key (cluster) that Fits • Convert precalculated “partial” clusters to clusters specific to this path • Check calculated clusters against the lock: • Fits if: • Lock is a subrange of key • Key fits “tightly” against one side of the lock Fit No Fit
Parallel Short Sequence Assembler • YAGA (Yet Another Genome Assembler) • 12,000 line C++ implementation • A parallel support library built on MPI • A runtime library that allows alternate implementation of underlying functionality (eases debugging) • Computational methods as permutations and transformations on arrays of tuples. Assembler Computation Primitives Runtime Library Parallel Serial MPI 80 Runtime Perfect Speedup 40 20 10 5 16 32 64 128 256
Experimental Results • Synthetic data from previously assembled genomes • Protocol I – 900±100, 4300±600 • Protocol II - 330±30, 660±60, 1100±100, 2200±200, 4400±400 • 30-50bp reads, 1% error rate, and 300X coverage
Experimental Results NX score is the maximum length L such that at least X percent of the genome is covered by assembled contigs of length L
Performance Results (Blue Gene/L) D. Melanogaster • ~900 million reads and 512 processors • 50 mins. for data transfer • 100 mins. for the parallel phases • 20 mins. for remaining
Conclusions and Future Work • HPC needed to scale graph methods to large genomes • Method applicable to heterogeneous reads • Can we do novo assemble large mammalian and plant genomes? • Error correction for repeat-rich genomes? • Parallel path walking? • Improve assembly using multiple parameters? • Scaffolding using graph and (k+1) pair clusters?