90 likes | 337 Views
Hybrid MPI/CUDA. Scaling accelerator code. Why Hybrid CUDA?. CUDA is fast! (for some problems) CUDA on a single card is like OpenMP (doesn’t scale) MPI can only scale so far Excessive power Communication overhead L arge amount of work remains for each node
E N D
Hybrid MPI/CUDA Scaling accelerator code
Why Hybrid CUDA? • CUDA is fast! (for some problems) • CUDA on a single card is like OpenMP (doesn’t scale) • MPI can only scale so far • Excessive power • Communication overhead • Large amount of work remains for each node • What if you can harness the power of multiple accelerators on multiple MPI processes?
Hybrid Architectures Node • Tesla S1050 connected to nodes • 1 GPU, connected directly to a node • Al-Salam @ Earlham (as11 & as12) • Tesla S1070 • A server node with 4 GPUs, typically connected via PCI-E to 2 nodes • Sooner @ OU has some of these • Lincoln @ NCSA (192 nodes) • Accelerator Cluster (AC) @ NCSA (32 nodes) RAM GPU GPU GPU GPU Node RAM
MPI/CUDA Approach • CUDA will be: • Doing the computational heavy lifting • Dictating your algorithm & parallel layout (data parallel) • Therefore: • Design CUDA portions first • Use MPI to move work to each node
Implementation • Do as much work as possible on the GPU before bringing data back to the CPU and communicating it • Sometimes you won’t have a choice… • Debugging tips: • Develop/test/debug one-node version first • Then test it with multiple nodes to verify commun-ication move data to each node while not done: copy data to GPU do work <<< >>> get new state out of GPU communicate with others aggregate results from all nodes
Multi-GPU Programming • A CPU thread can only have a single active context to communicate with a GPU • cudaGetDeviceCount(int * count) • cudaSetDevice(int device) • Be careful using MPI rank alone, device count only counts the cards visible from each node • Use MPI_Get_processor_name()to determine which processes are running where
Compiling • CUDA needs nvcc, MPI needs mpicc • Dirty trick: wrap mpicc with nvcc • nvccprocesses .cu files, sends the rest to its wrapped compiler • Kernel, kernel invocation, cudaMalloc, are all best off in a .cu file somewhere • MPI calls should be in .c files • There are workarounds, but this is the simplest approach nvcc--compiler-bindirmpiccmain.ckernel.cu
Executing • Typically one MPI process per available GPU • On Sooner (OU), each node has 2 GPUs available, so ppn should be 2. • On AC, each node has 4 GPUs and correspond to the number of processors requested, so this requests a total of 8 GPUs on 2 nodes: #BSUB -R "select[cuda > 0]“ #BSUB -R "rusage[cuda=2]“ #BSUB –l nodes=1:ppn=2 #BSUB -l nodes=2:tesla:cuda3.2:ppn=4
Hybrid CUDA Lab • We already have Area Under a Curve code for MPI and CUDA independently. • You canwrite a hybrid code that has each GPU calculate a portion of the area, then use MPI to combine subtotals for the complete area. • Otherwise feel free to take any code we’ve used so far and experiment!