270 likes | 380 Views
Parallel particle-based reaction diffusion: a GPU implementation. Lorenzo Dematté – Cosbi HIBI2010. Systems biology. Understand biological systems as a system. S tructure and dynamics rather than the characteristics of isolated parts of a cell or organism (Kitano 2002 ).
E N D
Parallel particle-based reaction diffusion: a GPU implementation Lorenzo Dematté– Cosbi HIBI2010
Systems biology • Understand biological systems as a system. • Structure and dynamicsrather than the characteristics of isolated parts of a cell or organism (Kitano 2002). • Modelling & Simulation
M&S in Systems Biology What? A simplified abstract view of the system: complex reality, empirical objects, phenomena, and physical processes. Why? impossible or impractical to create experimental conditions; human thought processes amplified; (leverage computational power to simulate, visualize, manipulate) Modelling Simulation (and more) How? consistency to empirical data: ability to explain past observations, ability to predict future observations. Simulation. Visualization
Several simulation methods Computational complexity
Spatial simulation • Why? localized fluctuations, diffusion have an important role • mRNA movement within the cytoplasm [Fusco2003] • mRNA localization in budding yeast; Morphogen gradients [Alberts] • Bacterial chemotaxis [Bray2003] • Synapse specificity of long-term facilitation in neural cells [Kandel2001] • Microtubule assisted protein movement during the cell cycle [Novak2001] • … • Two main ways:
Parallel M&S: motivation A+B -> C
Scaling up model complexity • More complex models • More complex methods • Solution? • More powerful computers… • Moore’s law • More cores! Need for parallel processing • Different “threads” of execution • Partition of work/data into (mostly) independent sub-units
Parallelization techniques • Multiple Replication in Parallel (MRIP) • Launch independent replications (uncorrelated) on multiple processes (cores/CPUs/computers). • Allows for sensitivity analysis, statistics, evolutionary computations, optimization problems… (widespread, relatively easy) • Typically data parallel • Single Replication in Parallel (SRIP) • Decomposition of a stochastic trajectory into logical processes, running on different processors. • Challenging, as it have to preserve “exactness” (prove to be equivalent to a sequential simulation) • Task parallel: DES (Discrete Event Simulations) • Data parallel Evolutionary computations Spatial Gillespie GPU Particle based methods
GPU computing for SB • Recent survey [BiB2010] • No Brownian Dynamics • Lots of MD implementations • Why? • Many related problems (collision detection, ABMs) already have GPU implementations • Candidate: Smoldyn • Single molecular detail, fixed time step • Paper by author: no speed-up Andrews, Steven S. and Dennis Bray. "Stochastic simulation of chemical reactions with spatial resolution and single molecule detail." Phys. Biol. 1:137-151, 2004
GPUs • Strong points: • Peak performance in the TFlop range • Weak points: • GPU programming is challenging • Not everything fits on the GPU model! • Very fine-grained • Re-think a problem in a different way
GPUs • Streaming Processors (execution unit) • Single program on partitioned data + SIMD (SIMT) • Typically data parallel
GPUs CPU/OS Threads VS. GPU “Threads” //CPU code void increment_cpu(float *a, float b, int N) { for (intidx = 0; idx<N; idx++) a[idx] = a[idx] + b; } //GPU code __device__ void increment_gpu(float *a, float b, int N){ intidx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) a[idx] = a[idx] + b; } • Context switch is “free” • Actually, GPUs need to switch to keep SM busy • Thousands of threads in-flight • Each “thread” works on a piece of data
Initialize Smoldyn on the CPU • Zeroth reaction B A A • Diffusion • First order • Second order C # 0th reactions Binding/unbinding radii # molecules (foreach) # molecules # molecules^2, # molecules (space partition)
Challenges • General • Arithmetic intensity: operations / words transferred • Random number generation • Mersenne Twisters • Gaussian distributions • LUT in GPU constant memory (read-only, cached) • Variable number of molecules • Keeping a data parallel approach (streams) • Control flow: rethink (if (nProducts == 1) then…) • Natural CPU implementation: linked list w. pointers
Implementation: data structures 0.1 1.1 A 0.2 -2 C 1.2 A 2.4 2.3 A -3 ... ... ... ... A -2 1.2 ... Same for the reaction list Bim rates/radii in a 2D matrix
Implementation: data structures • Struct of arrays: better memory access (also compared to striping) • Parallel implementation of Mersenne Twister (MT) • Single twister not parallel • Instead, many twisters • different parameters (correlation) • dcmt • Gaussian LUT • Constant memory
Implementation: diffusion One kernel for each molecule uint index = blockIdx.x * blockDim.x + threadIdx.x; if (index >= numMolecules) return; // Ensure coalesced reads volatile float4 pos = posArray[index]; volatile inttypeId = typeArray[index]; intrngIndex = index % MT_RNG_COUNT; TwisterState* rngState = &(rngStateArray[rngIndex]); // Get diffusion rate float rate = tex1Dfetch(diffusionRatesTex, typeId); intrandX = MTRand(rngState, rngIndex); // Transform it to gaussian, combine with rate pos.x += rate * gaussianLookupTable[randX & tableDim-1]; // ... same for y, z
Implementation: first order ... A C A A ... A A C A+B 0 A A A 0 ... A 0 B 0 0 ... 0 A A A 0 B ... A 0 B 0 0 C 0 ... 0 0 1 2 3 3 ... n 0 0 1 1 1 2 ... m CUDPP “style” scan A A A B ... A B C ... n + m
Implementation: second order • GPU collision detection • Used for video-games • Missile-vehicle, vehicle-vehicle • “Special effects” (smoke, trails, water) • N-body, bruteforce • Excellent results on CUDA • But collisions are local interactions • Spatial subdivision with a uniform grid • CUDA Whitepaper [S.Green] • Uses sorting to improve performance
Implementation: second order Cell -> which particles? ?? cellId = SortedList[CellStart [CellIndex]].Second
Implementation: second order uintgridHash = calcGridHash(gridPos); // get start of bucket for this cell uintstartIndex = FETCH(cellStart, gridHash); uint mol1Index = gridParticleIndex[index], mol1Type = types[mol1Index]; // iterate over particles in this cell uintendIndex = FETCH(cellEnd, gridHash); for(uint j=startIndex; j<endIndex; j++) { if (j != index) { // check not colliding with self uintmol2Index = gridParticleIndex[j], mol2Type = types[mol2Index]; // is a reaction possible between them? inttableIdx = mol1Type * numTypes + mol2Type; int r = reactionTable[tableIdx]; if (r == -1) continue; float3 pos2 = make_float3(FETCH(oldPos, j)); float3 relPos = pos1 - pos2; // calculate relative position if (lengthSquared(relPos) <= reactionList->bindRadiusSquared[r]) { float p = reactionList->prob[r]; if (p == 1 || randReal(randState, threadId) <p) bimReact(...); } } } Index is the current particle index, j the one we will test . Both are indexes to the sorted array NOTE: as for the “diffusion” and “first” steps, code for the algorithm is unchanged. CUDA helps here: we just copy-pasted the core C code.
Initialize • Initialize Smoldyn on the GPU • Zeroth reaction • Zeroth reaction CPU CPU GPU • First order • Second order • Diffusion Transfer data Use aux array for “birth”. Compact “birth” and particles arrays (death). Diffusion First order (segmented scan) Read back N Use particle array (non-death), # particles can only diminish. Compact particle array. Fuse them (copy) (reallocate if needed) Second order (build grid, compute hash, process cell, segmented scan) Read back N (Opt) Reallocate • Data structures entirely on the GPU
Summary • Think data parallel • Minimize data exchange CPU-GPU • New data and control flow structure also on multi- many-core • Memory key to great performances • ~140x speed-up only after memory tweaks • Specific to GPUs: prefix scan (compaction) and sorted grid (collision detection) • Future: • Smoldyn input/integration • Editor for spatial structure
Questions? Thank You