Multicore Optimization of Particle-to-Grid Computations

Multicore Optimization of Particle-to-Grid Computations Lenny Oliker1 Kamesh Madduri1 Samuel Williams1 Eun-Jin Im4 Khaled Ibrahim1Stephane Ethier2 John Shalf1 Katherine Yelick1,3 1 CRD/NERSC, Lawrence Berkeley National Laboratory 2Princeton Plasma Physics Laboratory • 3Computer Science Division, UC Berkeley • 4 Kookmin University, Seoul, Korea

Fusion Powers the Sun and Stars Can we harness Fusion power on Earth?

The two ions need to break the Coulomb barrier to get close enough to fuse together Fusion Reaction

The Case for Fusion Energy • Worldwide demand for energy continues to increase • Due to population increases and economic development • Worldwide oil and gas production is near or past peak • Need for alternative source: coal, fission, fusion • Increasing evidence that release of greenhouse gases is causing global climate change • This makes nuclear (fission or fusion) preferable to fossil (coal) • Fusion has clear advantages over fission • Inherently safe (no China syndrome) • No weapons proliferation considerations (security) • Greatly reduced waste disposal problems (no Yucca mountain) • Can produce electricity and hydrogen • Abundant fuel, available to all nations • Deuterium and lithium supply will last 1000’s of years

The Most Successful Magnetic Confinement Configuration is the Tokamak plasma magnets magnetic field

Plasma Science Challenges • Macroscopic Stability • What limits the pressure in plasmas? • Wave-particle Interactions • How do particles and plasma waves interact? • Microturbulence & Transport • What causes plasma transport? • Plasma-material Interactions • How can high-temperature plasma and material surfaces co-exist?

GyrokineticToroidal Code (GTC) • GTC: 3D fully self-consistent Particle-in-Cell (PIC) code which solves the kinetic equation in a toroidal geometry. • ITER: International collaboration to build the first fusion science experiment of a self-sustaining fusion reaction, the “burning plasma”. • GTC one of the numerical tools to predict efficiency of energy confinement in ITER. Donut-shaped (“tokamak”) reactor

The memory bottleneck in parallel GTC 3D toroidal mesh • Two levels of parallelization in the GTC MPI Implementation. • 1D domain decomposition (along the torus, typically 64- to 128-way) • Particle decomposition in each toroidal domain • Particle decomposition among MPI processes in each domain requires poloidal plane replication. • Massive memory footprint due to grid replication hinders GTC scaling for ITER-sized devices. Cross-section: poloidal plane Uniform in ψ-θ ψ ϴ

Our Contributions • Memory-efficient multicoreoptimizations for the GTC charge deposition step. • Explored 13 variants, with different grid decomposition and synchronization strategies. • Experimental study and performance tuning on cache-based multi-socket, multicore systems. • AMD Barcelona, Intel Nehalem, Sun Victoria Falls • The memory efficient strategies avoid a P-way increase in memory footprint, enabling us to solve large problem instances. • For small problem sizes, we achieve a performance improvement of 1.5-4.4x over the optimized MPI implementation with particle decomposition.

Charge deposition Kernel Overview Challenges: • Improve locality of grid access • Distribute work uniformly (load balancing) • Efficiently resolve data dependencies • Minimize memory requirements

Illustration of PIC “Scatter” step (or Charge deposition/particle-to-grid interpolation) “Push” Move particles Fi vi  xi “Scatter” “Gather” Weight particles to field • (xi ,vi)  (ρ,J)j Weight field to particles • (E,B)jFj Δt “Solve” Field solve • (ρ,J)j  (E,B)j • Grid memory accesses depend on the order in which particles are processed. • Parallel implementation with a shared grid => multiple threads update grid locations in parallel. • GTC updates based on varying Larmor radius. PIC scales N instead of N2 – particles interact w/ electromagnetic field on grid

Illustration of GTC “gyrokinetic” charge deposition step: Irregular memory accesses • Gyrating motion of a charged particle (ion) replaced by a moving ring. • Scatter step:ring approximated by four points, each assigned a quarter of the charge to deposit on neighboring grid coordinates. • The charge at 8-32 distinct grid points updated by each ion. • Gyrokinetic radius (Larmor radius) of a particle varies in simulation.

Multithreaded parallelization • Parallelization complemented by grid decomposition and efficient synchronization, to reduce shared array update overhead and enhance locality. Charge deposition pseudo-code For each particle do #1. Get position (5 loads from particle arrays, 3 loads from grid arrays, 2 stores, 1 sqrt,~ 20 flops) #2. Perform four-point gyrokineticaveraging (8-32 updates to charge density grid array, 20 stores to auxiliary particle arrays, 40 loads from grid arrays, ~ 160 flops) Ideal Arithmetic Intensity (flop/byte) = 0.6

Decomposition strategies: Shared grid vs. Full replication 2. Full replication: Each thread maintains a private copy of the grid. Requires a P-way reduction to get final result. 1. Shared grid: One grid shared by all threads. Most memory-efficient approach. P private copies (replicates) thread i thread 1 thread P thread 1 thread j … reduction Requires synchronization for charge updates. No synchronization required. Similar to reference GTC MPI implementation. Simplest Pthreads approach.

Decomposition strategies: Grid partitioning • Particle displacement and velocity is least in the radial (ψ) direction. • Motivates a radial partitioning of the grid, with a radially-binned particle ordering. • The particle Larmor radius is less than the # of grid flux surfaces. • Motivates creation of local ghost flux surfaces for each thread,to further reduce shared grid updates. 4. Partitioned grid w/ ghost flux surfaces 3. Partitioned grid Thread-local updates to partitioned grid or ghost surfaces Non-local updates to shared grid Thread-local updates to partitioned grid Non-local updates to shared grid … One two-way reduction step (local-shared) required to aggregate charge density values. Two reduction steps (ghost-ghost, local-ghost-shared) required to aggregate final charge density.

Synchronization Strategies Atomics Medium-grained locks Coarse-grained locks Fine-grained locks ζ ψ ϴ Lock all θ-ζ for a given ψ Individually lock one grid point Atomically increment one grid point Lock all ζfor a given θ-ψ Assembly code Implemented using POSIX thread Mutex locks # of lock calls per particle 32 32 8 16 Trade-off between lock overhead and contention. Lock overhead increases, but contention reduces.

Other Performance Optimizations Serial Performance • Address calculation and charge deposition loops fused to improve temporal cache locality. • Data structure padding to reduce misses due to cache line fragmentation. • Partial SIMDization of charge updates (on x86 systems). Grid decomposition • Finer partitioning of grid points among threads. Parallelization/Memory affinity • NUMA-aware particle initialization. • Affinity binding: Pinning threads to cores, static scheduling of particle updates.

Experimental setup: GTC problem instances • We assume a radially-binned particle distribution. • Rate of change of particle position least in the radial direction. • Periodic radial binning essential for performance. • We experiment with different grid sizes and particles-per-cell.

Architectural Details of Parallel Systems Sun UltraSparc T5140 Niagara2 AMD Opteron 2356 Barcelona Intel Gainestown X5550 Nehalem Double-precision peak GFlop/s

Comparative performance of various parallel strategies Problem size B (150K grid points), 5 particles per cell. Barcelona Nehalem Baseline version • Pthreads parallelism • Data layout optimizations to improve cache performance • NUMA-aware memory allocation Performance (GFlop/s) MPI MPI Reduction Shared Partitioned + ghosts Partitioned Niagara2 MPI

Comparative performance of various parallel strategies Problem size B (150K grid points), 5 particles per cell. Barcelona Nehalem Observations • Baseline mem-efficient and reduction versions outperform MPI. Performance (GFlop/s) MPI MPI Reduction Shared Partitioned + ghosts Partitioned Niagara2 MPI

Comparative performance of various parallel strategies Problem size B (150K grid points), 5 particles per cell. Barcelona Nehalem Observations • Baseline mem-efficient and reduction versions outperform MPI. • Nehalem comparatively faster. Performance (GFlop/s) MPI MPI Reduction Shared Partitioned + ghosts Partitioned Niagara2 MPI

Comparative performance of various parallel strategies Problem size B (150K grid points), 5 particles per cell. Barcelona Nehalem Observations • Baseline mem-efficient and reduction versions outperform MPI. • Nehalem comparatively faster. • Atomics best-performing synchronization strategy, faster than Pthreads-based fine synch. perf. trend Performance (GFlop/s) Reduction Shared Partitioned + ghosts Partitioned Niagara2

Parallel performance with optimizations Problem size B (150K grid points), 5 particles per cell. Barcelona Nehalem + Thread-pinning • Significant performance improvements on Nehalem and Barcelona. • Partitioned grid + ghosts strategy benefits most. Performance (GFlop/s) MPI MPI Reduction Shared Partitioned + ghosts Partitioned Niagara2 MPI

Parallel performance with optimizations Problem size B (150K grid points), 5 particles per cell. Barcelona Nehalem + SIMDization • Some improvement seen on Nehalem, Barcelona doesn’t benefit much. Performance (GFlop/s) MPI MPI Reduction Shared Partitioned + ghosts Partitioned Niagara2 MPI

Grid Memory Footprint Grid size B (150K grid points) Particle array mem. util. the same for all cases. Niagara2 Nehalem • Memory requirements of the full replication version prohibitive! • Naively using multicore is infeasible.

Additionally Evaluated Architectures NVIDIA FERMI Intel Nehamem EX AMD Isanbul

Charge deposition optimizations summary (double precision) Grid size “C” 20 particles/cell DP + sorting charge updates

Charge deposition optimizations summary (single and mixed-precision)

Performance:Current vs. Prior-generation Architectures

Charge deposition + Push cross-architectural comparison Problem size: C20 Charge deposition notes: CPU baseline is Shared grid + Atomics GPU baseline is DP + AtomicCAS CPU optimized is pure DP, GPU is mixed-precision.

(Auto) Tuning parameters include: • Initial particle distribution/density • Larmor radius distribution/maximum • Grid size and geometry • Particle binning/sorting(frequency) • Synchronization granularity • Grid Decomposition: Ghost surfaces, replication • Mixed precision • CPU atomics (SIMD atomics) • Low level optimization (NUMA, SSE) • On the fly computation vs. read from DRAM • GPU atomics (CAS vs. integer atomic increment) • GPU cooperative threading (memory coalescing)

Conclusions • Our memory-efficient charge deposition approaches (with grid decomposition andsynchronization) enable solution to large problem instances on current multicore systems. • We achieve significant performance improvement over the memory-inefficient MPI implementation. • Maximum Larmor radius value scales as (# of radial flux surfaces)/16, leading to load imbalance with a static particle decomposition on highly threaded systems. • Pipelined atomic intrinsic support would improve performance of the memory-efficient Pthreads approach.

Conclusions (cont) • CUDA strategy of geometric decomp & parallel primitives like sorting/scanning – inadequate for these codes • Large SMP require grid replication for load balancing and fast intra-socket synchronization • Nehalem-EX delivered best performance, but Istanbul delivered more flops relative to peak (and Barcelona) • Fermi improved 1.1x-2.5x vs Tesla, but slower than (24/32 core) Istanbul/Nehalem-EX • Future work will require 2D particle and grid decomposition to mitigate complexities of locality, data dependencies, and load-balancing • Future work will identify GTC parameters to tune, techniques to automatically select algorithmic variants.

Acknowledgments • ASCR Office in the DOE Office of Science under contract number DE-AC02-05CH11231. • Microsoft (Award #024263), Intel (Award #024894), and by matching funding through U.C. Discovery (Award #DIG07-10227)

Multicore Optimization of Particle-to-Grid Computations

Multicore Optimization of Particle-to-Grid Computations

Presentation Transcript

Particle Swarm Optimization Algorithms to Continuous Problem

Particle Swarm Optimization (PSO)

Particle Swarm Optimization (PSO)

Particle Swarm Optimization (PSO)

Particle Swarm Optimization

PARTICLE SWARM OPTIMIZATION

Large Multicore FFTs: Approaches to Optimization

Particle Swarm Optimization Algorithms

Particle Swarm Optimization

Particle Swarm Optimization

Particle Physics Data Grid

Abstractions to Separate Concerns in Semi-Regular Grid Computations

IE 607 Heuristic Optimization Particle Swarm Optimization

A review of Particle Swarm Optimization

Particle Swarm Optimization applied to Automated Docking

Parallel Job Submission In Grid Environment Using Parallel Particle Swarm Optimization

Particle Swarm Optimization

Particle Swarm Optimization (PSO)

Particle Swarm Optimization

Particle Physics Data Grid

Multi-objective Optimization of Sparse Array Computations

Particle Swarm Optimization