1 / 12

Performance Optimizations for running NIM on GPUs

Performance Optimizations for running NIM on GPUs. Jacques Middlecoff NOAA/OAR/ESRL/GSD/AB Jacques.Middlecoff@noaa.gov Mark Govett, Tom Henderson Jim Rosinski. Goal for NIM. Optimize MPI communication between distributed processors Old topic GPU speed gives it new urgency

marsha
Download Presentation

Performance Optimizations for running NIM on GPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Optimizations for running NIM on GPUs Jacques Middlecoff NOAA/OAR/ESRL/GSD/AB Jacques.Middlecoff@noaa.gov Mark Govett, Tom Henderson Jim Rosinski

  2. Goal for NIM • Optimize MPI communication between distributed processors • Old topic • GPU speed gives it new urgency • FIM example: 320 processors, 8192 footprint • Comm takes 2.4% of dynamics computation time • NIM has similar comm and comp characteristics • 8192 points is in NIM sweet spot for the GPU • GPU 25X speedup means comm takes 60%! • Still gives a 16X speedup

  3. Optimizations to be discussed • NIM: The halo to be communicated between processors is packed and unpacked on the GPU • No copy of entire variable to and from the CPU • About the same speed as the CPU • Halo computation • Overlapping communication with computation • Mapped, pinned memory • NVIDIA GPUDirect technology

  4. Halo Computation • Redundant computation to avoid communication • Calculate values in the halo instead of MPI send • Trades computation time for communication time • GPUs create more opportunity for halo comp • NIM has halo comp for everything not requiring extra communication • NIM next step is to look at halo comp’s that require new but less often communication

  5. Overlapping Communication with Computation • Works best with a co-processor to handle comm • Overlap communication with other calculations between when a variable is set and used. • Not enough computation time on the GPU • Calculate perimeter first then do communication while calculating the interior • Loop level: Not enough computation on the GPU • Subroutine level: Not enough computation time • Entire dynamics: Not feasible for NIM

  6. Overlapping Communication with Computation: Entire Dynamics Perimeter 14 exchanges per time step 3 iteration Runge Kutta loop Exchanges in the RK loop Results in a 7 deep halo Interior Way too much communication More halo comp? Move exchanges out of RK loop? Considerable code restructuring required.

  7. Mapped, Pinned Memory: Theory • Mapped, pinned memory is CPU memory • Mapped so GPU can access it across PCIe bus • Page-locked so the OS can’t swap it out • Limited amount • Integrated GPUs: Always a performance gain • Discrete GPUs (what we have) • Advantageous only in certain cases • The data is not cached on the GPU • Global loads and stores must be coalesced • Zero-copy: Both GPU and CPU can access data

  8. Mapped, Pinned Memory: Practice Pack the halo on GPU SendBuf = VAR Zero-copy 2.7X slower Why? Unpack halo on GPU VAR = RecvBuf Zero-copy unpack same speed but no copy • Using mapped, pinned memory for fast copy • SendBuf is mapped and pinned • Regular GPU array (d_buff) is packed on GPU • d_buff is copied to SendBuf • Twice as fast as copying d_buff to a CPU array

  9. Mapped, Pinned Memory: Results • NIM 10242 horizontal, 96 vertical 10 processors • Lowest value selected to avoid skew

  10. Mapped, Pinned Memory: Results • NIM 10242 horizontal, 96 vertical 10 processors • Barriers before each operation: .08-21, .24-.43

  11. NVIDIA GPUDirect Technology • Eliminates the CPU in interprocessor communication • Based on an interface between the GPU and InfiniBand • Both devices share pinned memory buffers • Data written by GPU can be sent immediately by InfiniBand • Overlapping communication with computation? • No longer a co-processor to do the comm? • We have this technology but have yet to install it

  12. Questions? 12 9/22/2014

More Related