390 likes | 535 Views
Early Experiences in GPU-enabling the Gadget-2 Simulation Code. Syed Akbar Mehdi and Aamir Shafi School of Electrical Engineering and Computer Science (SEECS) National University of Sciences and Technology (NUST). Collaborators. Dr Bryan Carpenter from the University of Portsmouth, UK.
E N D
Early Experiences in GPU-enabling the Gadget-2 Simulation Code SyedAkbar Mehdi and AamirShafi School of Electrical Engineering and Computer Science (SEECS) National University of Sciences and Technology (NUST)
Collaborators • Dr Bryan Carpenter from the University of Portsmouth, UK
Presentation Outline • Introduction to the HPC research group • Emergence of Cluster of GPUs • Programming GPUs • Introduction to the Gadget-2 Code • Preliminary Performance Evaluation • Summary
HPC Research Group at SEECS NUST • Research interests: • Conduct research, development, and evaluation of parallel programming languages, libraries, and paradigms • Support computational scientists in doing their job • Parallel Computing Training: • Workshops • UG/PG courses NUST • Computational resources: • Three small-scale compute clusters
Computational Resources A Tier-2 compliant data centre …
MPJ Express† • MPJ Express is an MPI-like library for parallel Java applications for compute clusters/clouds and multicore processors • The software is an open-source currently developed and maintained at NUST: • Available for free from the sourceforge website • http://mpj-express.org • A recent success story: • MPJ Express has been adopted by SHARCNET (https://www.sharcnet.ca), which is an HPC consortium of Canadian academic institutions † Shafi et al, Nested Parallelism for Multi-core HPC Systems using Java, JPDC, pp 532-545, 69(6), June 2009 6
Presentation Outline • Introduction to the HPC research group • Emergence of Cluster of GPUs • Programming GPUs • Introduction to the Gadget-2 Code • Preliminary Performance Evaluation • Summary
Cluster of GPUs • Increasing interest in building parallel hardware using a mixture of heterogeneous computing devices: • Multicore CPUs • Manycore GPUs • On such hardware—also known as Cluster of GPUs—compute intensive parts of the user application are executed on GPUs • The current generation of GPUs is capable of executing general purpose computation in a massively parallel fashion
The TOP10 in the TOP500† List † The TOP500 Project: http://top500.org
Memory A Typical Cluster of GPUs Node 1 Node 8 Node 2 CPU GPU Node 3 Node 7 Node6 Node 4 Node 5
Presentation Outline • Introduction to the HPC research group • Emergence of Cluster of GPUs • Programming GPUs • Introduction to the Gadget-2 Code • Preliminary Performance Evaluation • Summary
Graphics Processing Units (GPUs) • A GPU is a specialized processor that offloads graphics rendering from the host CPU: • Or at least this used to be “old” definition • Modern GPUs are capable of executing general purpose computation • Two leading GPU manufacturers include: • Nvidia • AMD
Floating-Point Operations per Second for the CPU and the GPU Image courtesy: NVIDIA CUDA Programming Guide version 3.0
CPU versus GPU Images courtesy: NVIDIA CUDA Programming Guide version 3.0
(Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory GPU Architecture and its Programming Image courtesy: Ibid
General Purpose GPU (GPGPU) Computing Image courtesy: Ibid
Compute Unified Device Architecture (CUDA) • CUDA is the computing engine in NVIDIA graphics processing units (GPUs) that is accessible to software developers through industry standard programming languages: • CUDA offers simple extensions to the C++ language • Nvidia’s way of programming GPUs • Explicit memory management • OpenCL: • Vendor neutral way of programming GPUs
Presentation Outline • Introduction to the HPC research group • Emergence of Cluster of GPUs • Programming GPUs • Introduction to the Gadget-2 Code • Preliminary Performance Evaluation • Summary
GPU-enabling Gadget-2 • Gadget-2: • A free production code for cosmological N-body and hydrodynamic computations†. • Written in C—already fully parallelized using MPI. • Versions used in various astrophysics research papers, including the Millennium Simulation. • Our aim in this study is to execute compute intensive parts of this code on the GPU: • Gadget-2 is an N-body simulation and highly irregular † http://www.mpa-garching.mpg.de/gadget
Dynamics in Gadget • Gadget is “just” simulating the movement of (a lot of) representative particles under the influence of Newton’s law of gravity, plus hydrodynamic forces on gas • Classical N-body problem
Gadget Main Loop • Schematic view of the Gadget code: … Initialize … while (not done) { move_particles() ; // update positions domain_Decomposition() ; compute_accelerations() ; advance_and_find_timesteps() ; // update velocities } • Most of the interesting work happens in domain_Decomposition() and compute_accelerations() .
Computing Forces • The function compute_accelerations() must compute forces experienced by all particles. • In particular must compute gravitational force. • Because gravity is a long range force, every particle affects every other. Naively, total cost is O(N2). • With N ≈ 1010, this is infeasible. • Need some kind of approximation ...
Barnes-Hut Tree Recursively split space into octree (quadtree in this 2d example), until no node contains more than 1 particle.
Distribution of BH Tree in Gadget† † Springel et al, The cosmological simulation code GADGET-2, Mon.Not.Roy.Astron.Soc. 364 (2005) 1105-1134
GPU-enabling the Gadget-2 code • The big idea is “to perform tree walk (force calculation) on the GPU instead of CPU” • Steps: • Copy particles and tree information to the GPU memory • Perform tree walk in parallel for multiple partilces on the GPU • Copy particles array back to the CPU
(0,0) Tree Walk for Particle 1 (1,0) Tree Walk for Particle 2 (BlkWidth,0) Walk for Particle j GPU-enabled Gadget-2 code executing on a cluster of four nodes. Each node has a CPU and a GPU … ….. (0,1) Tree Walk for Particle j+1 (1,1) Tree Walk for Particle j+2 (BlkWidth,1) Walk for Particle k 6/9/2010 ….. ….. …… ……. Zoomed-In Thread Block on a GPU on Process 0 MPI Process o on Node A MPI Process 1 on Node B . .. GPU CPU . .. GPU CPU DRAM DRAM DRAM DRAM Block 0 Block 1 Block 0 Block 1 Cluster Interconnect MPI Process 2 on Node C MPI Process 3 on Node D . .. GPU CPU . .. GPU CPU DRAM DRAM DRAM DRAM Block 0 Block 1 Block 0 Block 1
Presentation Outline • Introduction to the HPC research group • Emergence of Cluster of GPUs • Programming GPUs • Introduction to the Gadget-2 Code • Preliminary Performance Evaluation • Summary
Preliminary Performance Evaluation (1x) (14.1x)
Optimization • Reduce the memory transfer between the CPU and the GPU • “42 TFlops Hierarchical N-body Simulations on GPUs with Applications in both Astrophysics and Turbulence” by Hamada et al: • Multiple tree walks use the same “interaction list” • Perform the tree walk “level by level”: • Currently implementing this algorithm
Summary • Our group focuses on the Computer Science (CS) aspects of HPC: • Languages, libraries, and scientific software • Need for collaboration between CS and scientific community on national scale before reaching out to the industry: • At NUST we are going in this direction • “Cluster of GPUs” is an interesting trend, which is likely to continue • We discussed GPU-enabling the Gadget-2 code
TreePM • Artificially split gravitational potential into two parts: [ ] = + 1 – { { Φshort (r) Φlong(r) Gm Gm Gm r r ( ( ) ) erfc erfc r r r 2rs 2rs • Calculate Φshortusing BH; calculate Φlong by projecting particle mass distribution onto a mesh, then working in Fourier space.
Domain Decomposition • Can’t just divide space in a fixed way, because some regions will have many more particle than others – poor load balancing. • Can’t just divide particles in a fixed way, because particles move independently through space, and want to maintain physically close particles on the same processor, as far as practical – communication problem.
Peano-Hilbert Curve † † Picture borrowed from http://www.mpa-garching.mpg.de/gadget/gadget2-paper.pdf
Decomposition based on P-H Curve • Sort particles by position on Peano-Hilbert curve, then divide evenly into P domains. • Characteristics of this decomposition: • Good load balancing. • Domains simply connected and quite “compact” in real space, because particles that are close along P-H curve are close in real space. • Domains have relatively simple mapping to BH octree nodes.
Communication in Gadget • Can identify 4 recurring “non-trivial” patterns: • Distributed sort of particles, according to P-H key: implements domain decomposition. • Export of particles to other nodes, for calculation of remote contribution to force, density, etc, and retrieval of results. • Projection of particle density to regular grid for calculation of Φlong; distribution of results back to irregularly distributed particles. • Distributed Fast Fourier Transform.