1 / 27

Parallel particle-based reaction diffusion: a GPU implementation

Parallel particle-based reaction diffusion: a GPU implementation. Lorenzo Dematté – Cosbi HIBI2010. Systems biology. Understand biological systems as a system. S tructure and dynamics rather than the characteristics of isolated parts of a cell or organism (Kitano 2002 ).

Download Presentation

Parallel particle-based reaction diffusion: a GPU implementation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel particle-based reaction diffusion: a GPU implementation Lorenzo Dematté– Cosbi HIBI2010

  2. Systems biology • Understand biological systems as a system. • Structure and dynamicsrather than the characteristics of isolated parts of a cell or organism (Kitano 2002). • Modelling & Simulation

  3. M&S in Systems Biology What? A simplified abstract view of the system: complex reality, empirical objects, phenomena, and physical processes. Why? impossible or impractical to create experimental conditions; human thought processes amplified; (leverage computational power to simulate, visualize, manipulate) Modelling Simulation (and more) How? consistency to empirical data: ability to explain past observations, ability to predict future observations. Simulation. Visualization

  4. Several simulation methods Computational complexity

  5. Parallel M&S: motivation

  6. Spatial simulation • Why? localized fluctuations, diffusion have an important role • mRNA movement within the cytoplasm [Fusco2003] • mRNA localization in budding yeast; Morphogen gradients [Alberts] • Bacterial chemotaxis [Bray2003] • Synapse specificity of long-term facilitation in neural cells [Kandel2001] • Microtubule assisted protein movement during the cell cycle [Novak2001] • … • Two main ways:

  7. Parallel M&S: motivation A+B -> C

  8. Scaling up model complexity • More complex models • More complex methods • Solution? • More powerful computers… • Moore’s law • More cores! Need for parallel processing • Different “threads” of execution • Partition of work/data into (mostly) independent sub-units

  9. Parallelization techniques • Multiple Replication in Parallel (MRIP) • Launch independent replications (uncorrelated) on multiple processes (cores/CPUs/computers). • Allows for sensitivity analysis, statistics, evolutionary computations, optimization problems… (widespread, relatively easy) • Typically data parallel • Single Replication in Parallel (SRIP) • Decomposition of a stochastic trajectory into logical processes, running on different processors. • Challenging, as it have to preserve “exactness” (prove to be equivalent to a sequential simulation) • Task parallel: DES (Discrete Event Simulations) • Data parallel Evolutionary computations Spatial Gillespie GPU Particle based methods

  10. GPU computing for SB • Recent survey [BiB2010] • No Brownian Dynamics • Lots of MD implementations • Why? • Many related problems (collision detection, ABMs) already have GPU implementations • Candidate: Smoldyn • Single molecular detail, fixed time step • Paper by author: no speed-up Andrews, Steven S. and Dennis Bray. "Stochastic simulation of chemical reactions with spatial resolution and single molecule detail." Phys. Biol. 1:137-151, 2004

  11. GPUs • Strong points: • Peak performance in the TFlop range • Weak points: • GPU programming is challenging • Not everything fits on the GPU model! • Very fine-grained • Re-think a problem in a different way

  12. GPUs • Streaming Processors (execution unit) • Single program on partitioned data + SIMD (SIMT) • Typically data parallel

  13. GPUs CPU/OS Threads VS. GPU “Threads” //CPU code void increment_cpu(float *a, float b, int N) { for (intidx = 0; idx<N; idx++) a[idx] = a[idx] + b; } //GPU code __device__ void increment_gpu(float *a, float b, int N){ intidx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) a[idx] = a[idx] + b; } • Context switch is “free” • Actually, GPUs need to switch to keep SM busy • Thousands of threads in-flight • Each “thread” works on a piece of data

  14. Initialize Smoldyn on the CPU • Zeroth reaction B A A • Diffusion • First order • Second order C # 0th reactions Binding/unbinding radii # molecules (foreach) # molecules # molecules^2, # molecules (space partition)

  15. Challenges • General • Arithmetic intensity: operations / words transferred • Random number generation • Mersenne Twisters • Gaussian distributions • LUT in GPU constant memory (read-only, cached) • Variable number of molecules • Keeping a data parallel approach (streams) • Control flow: rethink (if (nProducts == 1) then…) • Natural CPU implementation: linked list w. pointers

  16. Implementation: data structures 0.1 1.1 A 0.2 -2 C 1.2 A 2.4 2.3 A -3 ... ... ... ... A -2 1.2 ... Same for the reaction list Bim rates/radii in a 2D matrix

  17. Implementation: data structures • Struct of arrays: better memory access (also compared to striping) • Parallel implementation of Mersenne Twister (MT) • Single twister not parallel • Instead, many twisters • different parameters (correlation) • dcmt • Gaussian LUT • Constant memory

  18. Implementation: diffusion One kernel for each molecule uint index = blockIdx.x * blockDim.x + threadIdx.x; if (index >= numMolecules) return; // Ensure coalesced reads volatile float4 pos = posArray[index]; volatile inttypeId = typeArray[index]; intrngIndex = index % MT_RNG_COUNT; TwisterState* rngState = &(rngStateArray[rngIndex]); // Get diffusion rate float rate = tex1Dfetch(diffusionRatesTex, typeId); intrandX = MTRand(rngState, rngIndex); // Transform it to gaussian, combine with rate pos.x += rate * gaussianLookupTable[randX & tableDim-1]; // ... same for y, z

  19. Implementation: first order ... A C A A ... A A C A+B 0 A A A 0 ... A 0 B 0 0 ... 0 A A A 0 B ... A 0 B 0 0 C 0 ... 0 0 1 2 3 3 ... n 0 0 1 1 1 2 ... m CUDPP “style” scan A A A B ... A B C ... n + m

  20. Implementation: second order • GPU collision detection • Used for video-games • Missile-vehicle, vehicle-vehicle • “Special effects” (smoke, trails, water) • N-body, bruteforce • Excellent results on CUDA • But collisions are local interactions • Spatial subdivision with a uniform grid • CUDA Whitepaper [S.Green] • Uses sorting to improve performance

  21. Implementation: second order Cell -> which particles? ?? cellId = SortedList[CellStart [CellIndex]].Second

  22. Implementation: second order uintgridHash = calcGridHash(gridPos); // get start of bucket for this cell uintstartIndex = FETCH(cellStart, gridHash); uint mol1Index = gridParticleIndex[index], mol1Type = types[mol1Index]; // iterate over particles in this cell uintendIndex = FETCH(cellEnd, gridHash); for(uint j=startIndex; j<endIndex; j++) { if (j != index) { // check not colliding with self uintmol2Index = gridParticleIndex[j], mol2Type = types[mol2Index]; // is a reaction possible between them? inttableIdx = mol1Type * numTypes + mol2Type; int r = reactionTable[tableIdx]; if (r == -1) continue; float3 pos2 = make_float3(FETCH(oldPos, j)); float3 relPos = pos1 - pos2; // calculate relative position if (lengthSquared(relPos) <= reactionList->bindRadiusSquared[r]) { float p = reactionList->prob[r]; if (p == 1 || randReal(randState, threadId) <p) bimReact(...); } } } Index is the current particle index, j the one we will test . Both are indexes to the sorted array NOTE: as for the “diffusion” and “first” steps, code for the algorithm is unchanged. CUDA helps here: we just copy-pasted the core C code.

  23. Initialize • Initialize Smoldyn on the GPU • Zeroth reaction • Zeroth reaction CPU CPU GPU • First order • Second order • Diffusion Transfer data Use aux array for “birth”. Compact “birth” and particles arrays (death). Diffusion First order (segmented scan) Read back N Use particle array (non-death), # particles can only diminish. Compact particle array. Fuse them (copy) (reallocate if needed) Second order (build grid, compute hash, process cell, segmented scan) Read back N (Opt) Reallocate • Data structures entirely on the GPU

  24. Results

  25. Results(2)

  26. Summary • Think data parallel • Minimize data exchange CPU-GPU • New data and control flow structure also on multi- many-core • Memory key to great performances • ~140x speed-up only after memory tweaks • Specific to GPUs: prefix scan (compaction) and sorted grid (collision detection) • Future: • Smoldyn input/integration • Editor for spatial structure

  27. Questions? Thank You

More Related