1 / 9

Hybrid MPI/CUDA

Hybrid MPI/CUDA. Scaling accelerator code. Why Hybrid CUDA?. CUDA is fast! (for some problems) CUDA on a single card is like OpenMP (doesn’t scale) MPI can only scale so far Excessive power Communication overhead L arge amount of work remains for each node

lael
Download Presentation

Hybrid MPI/CUDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hybrid MPI/CUDA Scaling accelerator code

  2. Why Hybrid CUDA? • CUDA is fast! (for some problems) • CUDA on a single card is like OpenMP (doesn’t scale) • MPI can only scale so far • Excessive power • Communication overhead • Large amount of work remains for each node • What if you can harness the power of multiple accelerators on multiple MPI processes?

  3. Hybrid Architectures Node • Tesla S1050 connected to nodes • 1 GPU, connected directly to a node • Al-Salam @ Earlham (as11 & as12) • Tesla S1070 • A server node with 4 GPUs, typically connected via PCI-E to 2 nodes • Sooner @ OU has some of these • Lincoln @ NCSA (192 nodes) • Accelerator Cluster (AC) @ NCSA (32 nodes) RAM GPU GPU GPU GPU Node RAM

  4. MPI/CUDA Approach • CUDA will be: • Doing the computational heavy lifting • Dictating your algorithm & parallel layout (data parallel) • Therefore: • Design CUDA portions first • Use MPI to move work to each node

  5. Implementation • Do as much work as possible on the GPU before bringing data back to the CPU and communicating it • Sometimes you won’t have a choice… • Debugging tips: • Develop/test/debug one-node version first • Then test it with multiple nodes to verify commun-ication move data to each node while not done: copy data to GPU do work <<< >>> get new state out of GPU communicate with others aggregate results from all nodes

  6. Multi-GPU Programming • A CPU thread can only have a single active context to communicate with a GPU • cudaGetDeviceCount(int * count) • cudaSetDevice(int device) • Be careful using MPI rank alone, device count only counts the cards visible from each node • Use MPI_Get_processor_name()to determine which processes are running where

  7. Compiling • CUDA needs nvcc, MPI needs mpicc • Dirty trick: wrap mpicc with nvcc • nvccprocesses .cu files, sends the rest to its wrapped compiler • Kernel, kernel invocation, cudaMalloc, are all best off in a .cu file somewhere • MPI calls should be in .c files • There are workarounds, but this is the simplest approach nvcc--compiler-bindirmpiccmain.ckernel.cu

  8. Executing • Typically one MPI process per available GPU • On Sooner (OU), each node has 2 GPUs available, so ppn should be 2. • On AC, each node has 4 GPUs and correspond to the number of processors requested, so this requests a total of 8 GPUs on 2 nodes: #BSUB -R "select[cuda > 0]“ #BSUB -R "rusage[cuda=2]“ #BSUB –l nodes=1:ppn=2 #BSUB -l nodes=2:tesla:cuda3.2:ppn=4

  9. Hybrid CUDA Lab • We already have Area Under a Curve code for MPI and CUDA independently. • You canwrite a hybrid code that has each GPU calculate a portion of the area, then use MPI to combine subtotals for the complete area. • Otherwise feel free to take any code we’ve used so far and experiment!

More Related