160 likes | 329 Views
Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG. The Sheffield Advanced Code. The Sheffield Advanced Code (SAC) is a novel fully non-linear MHD code based on the Versatile Advection Code (VAC) designed for simulations of linear and non-linear wave propagation
E N D
Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG
The Sheffield Advanced Code • The Sheffield Advanced Code (SAC) is a novel fully non-linear MHD code based on the Versatile Advection Code (VAC) • designed for simulations of linear and non-linear wave propagation • with gravitationally strongly stratified magnetised plasma. • Shelyag, S.; Fedun, V.; Erdélyi, R. Astronomy and Astrophysics, Volume 486, Issue 2, 2008, pp.655-662
Numerical Diffusion • Central differencing can generate numerical instabilities • Difficult to find solutions for shocked systems • We define a hyperviscosity parameter which is the ratio of the forward difference of a parameter to third order and first order • Tracking evolution of the hyperviscosity we can identify numerical noise and apply smoothing where necessary
Why MHD Using GPU’s? • Consider a simplified 2d problem • Solving flux equation • Derivative using central diffrencing • Time step using RungeKutta • Excellent scaling with GPU’s but, • Central differencing requires numerical stabilisation • Stabilisation with GPU’s trickier, requires • Reduction/maximum routine • An additional and larger mesh
Halo Messaging • Each proc has a “ghost” layer • Used in calculation of update • Obtained from neighbouring left and right processors • Pass top and bottom layers to neighbouring processors • Become neighbours ghost layers • Distribute rows over processors N/nproc rows per proc • Every processor stores all N columns • SMAUG-MPI implements messaging using a 2D halo model for 2D and 3D halo model for 3D • Consider a 2d model – for simplicity distribute layers over a line of processes
N+1 N Processor 1 p1min p2max p1min p2max Processor 2 p2min Send top layer p3max Receive bottom layer p2min p3max Processor 3 Send bottom layer Receive top layer Processor 4 1 N+1
MPI Implementation • Based on halo messaging technique employed in SAC code void exchange_halo(vector v) { gather halo data from v into gpu_buffer1cudaMemcpy(host_buffer1, gpu_buffer1,...);MPI_Isend(host_buffer1,...,destination,...);MPI_Irecv(host_buffer2,...,source,...); MPI_Waitall(...); cudaMemcpy(gpu_buffer2,host_buffer2,...); scatter halo data from gpu_buffer2 to halo regions in v}
Halo Messaging with GPU Direct • Simpler faster call structure void exchange_halo(vector v) { gather halo data from v into gpu_buffer1MPI_Isend(gpu_buffer1,...,destination...);MPI_IRecv(gpu_buffer2,...,source...) MPI_Waitall(...); scatter halo data from gpu_buffer2 to halo regions in v}
Progress with MPI Implementation • Successfully running two dimensional models under GPU direct • Wilkes GPU cluster at The University of Cambridge • N8 - GPU Facility, Iceberg • 2D MPI version is verified • Currently optimising communications performance under GPU direct • 3D MPI implementation is already implemented still requires testing
Orszag-Tang Test 200x200 Model at t=0.1, t=0.26, t=0.42 and t=0.58s
A Model of Wave Propagation in the Magnetised Solar Atmmosphere The model features a Flux Tube with Torsional Driver, with a fully stratified quiet solar atmosphere based on VALIIIC Grid size is 128x128x128, representing a box in the solar atmosphere of dimensions 1.5x2x2Mm Flux tube has a magnetic field strength of 1000G Driver Amplitude 200km/s
Timing for Orszag-Tang Using SAC/SMAUG with Different Architetures
Performance Results (Hyperdiffusion disabled) • Timings in seconds for 100 iterations (Orszag-Tang test)
Performance Results (With Hyperdiffusion enabled) • Timings in seconds for 100 iterations (Orszag-Tang test)
Conclusions • We have demonstrated that we can successfully compute large problems by distributing across multiple GPUs • For 2D problems the performance using messaging with and without GPUdirect is similar. • This is expected to change when 3D models are tested • It is likely that much of the communications overhead arises from routines used transfer data within the GPU memory • Performance enhancements possible through application architecture modification • Further work needed with larger models for comparisons with X86 implementation using MPI • The algorithm has been implemented in 3D testing of 3D models will be undertaken over the forthcoming weeks