Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig Presenter: Erkan Okuyan

Motivation • Massive amount of sequencing data (Illumina – 454 - SOLID) (short reads - with high error rate) • Assembly processes sensitive to errors in reads thus sequencing errors needs to be corrected • Size of error correction problem is computationally demanding

Definitions - Let R = {r1, r2,…,rk} be a set of k reads with |ri| = L - Let ri be in {A, C, G, T}Lfor all 1 ≤ i ≤ k. - Let m (multiplicity) and l (length) satisfy m>1 and l<L • Definition1 (Solid and Weak): An l-tuple (a DNA string of length l) is called solid with respect to R and m if it is a substring of at least m reads in R and weak otherwise. • m-way replicated l-tuple is probably a correct l-tuple • Definition2 (Spectrum): The spectrum of R with respect to m and l, denoted as Tm,l(R), is the set of all solid l-tuples with respect to R and m. • Spectrum Tm,l(R) is the set of all correct l-tuples

Definitions - Let R = {r1, r2,…,rk} be a set of k reads with |ri| = L - Let ri be in {A, C, G, T}Lfor all 1 ≤ i ≤ k. - Let m (multiplicity) and l (length) satisfy m>1 and l<L • Definition3 (T-string): A DNA string s is called aTm,l(R)-string if every l-tuple in s is an element of Tm,l(R). • Definition4 (SAP): Given a DNA string s and spectrum Tm,l(R). Find aTm,l(R)-string s* in the set of Tm,l(R)-strings that minimizes the distance function d(s,s*).

CUDA (Compute UnifiedDevice Architecture) • Integrated host+device app program • Serial or modestly parallel parts in host C code • Highly parallel parts in device SPMD kernel C code Serial Code (host) Parallel Kernel (device) KernelA<<< nBlk,nTid >>>(args); Serial Code (host) Parallel Kernel (device) KernelB<<< nBlk,nTid >>>(args);

CUDA Execution • A GPU device • Is a coprocessor to the CPU or host • Has its own DRAM (device memory) • Runs many threads in parallel • Data-parallel portions of an application are expressed as device kernels which run on many threads • Differences between GPU and CPU threads • GPU threads are extremely lightweight • Very little creation overhead • GPU needs 1000s of threads for full efficiency

Parallel Error Correction with CUDA • Each kernel thread is responsible for correction of a single read ri. • Voting based algorithm • First Step: Calculation of voting matrix • Second Step: Single-Mutation fixing/trimming/discarding

Step1: Voting Matrix Calculation

Step2: Fixing/Trimming/Discarding Reads

Fast Membership Tests • First algorithm(kernel) dominates time • (L-l).(l+3.p.l) membership tests required where p is the number of l-tuples that do not belong in the spectrum. • Space efficient Bloom filter speeds up membership test of spectrum • Compute bloom filter on CPU and store it on texture memory (fast read only cache) on device

Bloom Filter • Probabilistic data structure • No false negatives • Small percentage of false positives • Space efficient and fast • Uses a bit array B of length m and d hash functions • to insert x, we set B[hi(x)] = 1, for i=1,…,d • to query y, we check if B[hi(y)] all equal 1, for i=1,…,d

Bloom Filter Example • a and b are inserted to a m=10 n=2 d=3 bloom filter • Query of c on bloom filter returns false since some bits are 0. • Query of d on bloom filter returns true since all bits are 1 (False positive).

Overall Algorithm • Pre-Computation on the CPU: Program the Bloom filter (counting bloom filter) bit-vector by hashing each l-tuple present on read R. • Data transfer from CPU to GPU: Allocate memory/transfer Bloom filter and reads. • Execute CUDA kernel. • Data transfer from GPU to CPU: Transfer the set of corrected/trimmed reads.

Performance Evaluation • System Parameters • Nvidia Geforce GTX 280 with 1GB memory • AMD Opteron dual core 2.2Ghz CPU with 2GB memory • Datasets • Artificial Sets (1%, 2%, 3% error rates) • Yeast Chromosomes (S.cer5, S.cer7) • Bacterial Genomes (H.inf, E.col) • Real Set • Staphylococcus Aureus strain MW2 (H.Aci) (error rate ~1%)

Performance Evaluation

Discussion/Conclusion (GOOD) • Runtime savings of 10 to 19 times reported. • Bigger datasets is not an issue as long as Bloom filter fits in texture memory. (More than one round of read-load/read-correct approach) • Possible to even further parallelize on distributed memory GPU farms.

Discussion/Conclusion (BAD) • Does not exploit fast shared memory within thread blocks (i.e. each read ri does not really have to be handled by a single thread, voting matrix can be constructed in parallel) thus further speed-up is possible. • Predetermined read length Lis a bit restrictive.

Thank You

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA