Performance Optimizations for running NIM on GPUs

Performance Optimizations for running NIM on GPUs Jacques Middlecoff NOAA/OAR/ESRL/GSD/AB Jacques.Middlecoff@noaa.gov Mark Govett, Tom Henderson Jim Rosinski

Goal for NIM • Optimize MPI communication between distributed processors • Old topic • GPU speed gives it new urgency • FIM example: 320 processors, 8192 footprint • Comm takes 2.4% of dynamics computation time • NIM has similar comm and comp characteristics • 8192 points is in NIM sweet spot for the GPU • GPU 25X speedup means comm takes 60%! • Still gives a 16X speedup

Optimizations to be discussed • NIM: The halo to be communicated between processors is packed and unpacked on the GPU • No copy of entire variable to and from the CPU • About the same speed as the CPU • Halo computation • Overlapping communication with computation • Mapped, pinned memory • NVIDIA GPUDirect technology

Halo Computation • Redundant computation to avoid communication • Calculate values in the halo instead of MPI send • Trades computation time for communication time • GPUs create more opportunity for halo comp • NIM has halo comp for everything not requiring extra communication • NIM next step is to look at halo comp’s that require new but less often communication

Overlapping Communication with Computation • Works best with a co-processor to handle comm • Overlap communication with other calculations between when a variable is set and used. • Not enough computation time on the GPU • Calculate perimeter first then do communication while calculating the interior • Loop level: Not enough computation on the GPU • Subroutine level: Not enough computation time • Entire dynamics: Not feasible for NIM

Overlapping Communication with Computation: Entire Dynamics Perimeter 14 exchanges per time step 3 iteration Runge Kutta loop Exchanges in the RK loop Results in a 7 deep halo Interior Way too much communication More halo comp? Move exchanges out of RK loop? Considerable code restructuring required.

Mapped, Pinned Memory: Theory • Mapped, pinned memory is CPU memory • Mapped so GPU can access it across PCIe bus • Page-locked so the OS can’t swap it out • Limited amount • Integrated GPUs: Always a performance gain • Discrete GPUs (what we have) • Advantageous only in certain cases • The data is not cached on the GPU • Global loads and stores must be coalesced • Zero-copy: Both GPU and CPU can access data

Mapped, Pinned Memory: Practice Pack the halo on GPU SendBuf = VAR Zero-copy 2.7X slower Why? Unpack halo on GPU VAR = RecvBuf Zero-copy unpack same speed but no copy • Using mapped, pinned memory for fast copy • SendBuf is mapped and pinned • Regular GPU array (d_buff) is packed on GPU • d_buff is copied to SendBuf • Twice as fast as copying d_buff to a CPU array

Mapped, Pinned Memory: Results • NIM 10242 horizontal, 96 vertical 10 processors • Lowest value selected to avoid skew

Mapped, Pinned Memory: Results • NIM 10242 horizontal, 96 vertical 10 processors • Barriers before each operation: .08-21, .24-.43

NVIDIA GPUDirect Technology • Eliminates the CPU in interprocessor communication • Based on an interface between the GPU and InfiniBand • Both devices share pinned memory buffers • Data written by GPU can be sent immediately by InfiniBand • Overlapping communication with computation? • No longer a co-processor to do the comm? • We have this technology but have yet to install it

Questions? 12 9/22/2014

Performance Optimizations for running NIM on GPUs