CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

www.nec-labs.com CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications Yi Yang, NEC Labs Huiyang Zhou, NCSU PPoPP'2014

Outline • Background • Motivation • CUDA-NP • Experiments • Conclusions PPoPP'2014

Background • Many-core architecture • Overcome the limitation of Instruction level parallelism (ILP). • Achieve high performance at lower energy • Thread level parallelism (TLP) has been the key to utilize many-core architectures • CPUs support 10+ threads • Intel Many Integrated Core (MIC) supports 200+ threads • GPGPUs support 10K+ threads • TLP is used to • Occupy a large number of cores. • Hide the off-chip memory latency. PPoPP'2014

GPGPU architecture • Same-instruction multiple-data (SIMD) • A warp of threads (32 threads) executes same instruction on different data • A thread can read registers from another thread in the same warp using shfl instruction (Latest NVIDIA Kepler GPUs) • Memory coalescing • A warp of threads accesses data in a single cache line to maximize memory bandwidth • A thread block contains multiple warps • Threads in the same thread block can communicate using shared memory (software-managed on-chip cache) • Threads in the same thread block run in a SM PPoPP'2014

Parallel programs to enable TLP • Parallel programming languages • OpenMP • CUDA and OpenCL • OpenACC • In order to write a correct parallel program, developers have to • Identify parallel code sections or parallel loops • Modify the code sections or loops using a specific language • In order to achieve high performance • Understand the hardware platform • None of these steps is easy PPoPP'2014

How to write a parallel (CUDA) program void tmv_single_thread (float *a, float*b, float* c, int w, int h){ for (int k=0; k<w; k++) { float sum = 0; for (inti=0; i<h; i++) sum += a[i*w+k]*b[i]; c[tx] = sum; } } Transposed-matrix-vector multiplication (TMV) • Two loops in the single thread program: • Which one do you prefer to parallelize? ----------------------------------- __global__ void tmv_kernel int k = threadIdx.x+blockIdx.x*blockDim.x; ----------------------------------- Why not parallelize inner loop? PPoPP'2014

Why not parallelize the inner loop • Developers prefer to parallelize outer loops. • How to handle reduction or scan variables? (sum+=) • How to utilize the GPGPU features when parallelizing the nested loop? __global__ void tmv(float *a, float*b, float* c, int w, int h){ float sum = 0; inttx = threadIdx.x+blockIdx.x*blockDim.x; float sum = 0; for (inti=0; i<h; i++) sum += a[i*w+tx]*b[i]; c[tx] = sum; } Kernel code of Transposed-matrix-vector multiplication (TMV) So we can find nested parallelism in many parallel programs PPoPP'2014

Impact of nested parallelism • The overall thread level parallelism is not utilized • If we parallelize the nested parallelism, we can get more TLP to make use of under-utilized resources • 10K threads per GPU = 100 threads from outer loop X 100 threads from inner loop • The workload/resource of each thread is heavy • If we parallelize the nested parallelism, we can reduce the workload/resource per thread • With limited resources, we can have more threads PPoPP'2014

NVIDIA dynamic parallelism • NVIDIA dynamic parallelism: launch child kernels in a GPU thread • Memory-copy microbenchmark • We launch each child kernel using a parent thread • Each thread of child kernel copies an element • The overall data to be copied (number of child kernel * number of thread per child kernel): 64m floats • For same overall workload, increasing the number of child kernels reduces the performance. • E.G. 4K child kernel launches (16k threads per child kernel: 34GB/s • 142 GB/S without dynamic parallelism • Up to 63 GB/S with enabled dynamic parallelism PPoPP'2014

Limitation of NVIDIA dynamic parallelism • Child kernel launch overhead • Communication between child kernel and parent kernel • Significant overhead as it has to go through global memory • Complicate the code development • Not good for the applications with small loop counts PPoPP'2014

Our solution: CUDA-NP • Developers add an OpenMP-like pragma to the parallel loop • Our compiler framework generates the optimized code leveraging nested parallelism __global__ void tmv(float *a, float*b, float* c, int w, int h){ float sum = 0; inttx = threadIdx.x+blockIdx.x*blockDim.x; #pragma np parallel for reduction(+:sum) for (inti=0; i<h; i++) sum += a[i*w+tx]*b[i]; c[tx] = sum; } Kernel code of transposed-matrix-vector multiplication (TMV) PPoPP'2014

Execution diagram Sequential section • Assume each thread block of baseline has 8 threads • Optimized kernel has 8*4 threads per thread block • 4 slave threads are used to process the parallel section. Parallel section Loop section Sequential section Master threads a) Execution time of baseline Sequential section Sequential section Slave threads b) Execution time of the optimized kernel PPoPP'2014

Example after CUDA-NP • Introduce threads in Y dimension as slave threads • Process parallel section using multiple slave threads • Apply reduction after parallel section • Master thread is used for executing non-parallel section __global__ void tmv_np(float *a, float*b, float* c, int w, int h){ float sum = 0; inttx = threadIdx.x+blockIdx.x*blockDim.x; intslave_id = threadIdx.y; for (inti= slave_id; i<h; i+=slave_size) sum += a[i*w+tx]*b[i]; sum = reduction(sum); if (salve_id==0) c[tx] = sum; } Kernel code of transposed-matrix-vector multiplica-tion (TMV) PPoPP'2014

Slave threads organization 0 1 2 3 4 5 6 7 master thread id 8 9 10 11 12 13 14 15 slave thread id • Inter-warp nested parallelism • For a master thread, we allocate salve threads in different warps. • Master thread id 0: slave thread ids 0, 8, 16, 24. 16 17 18 19 20 21 22 23 slave thread id Sequential section 24 25 26 27 28 29 30 31 slave thread id Parallel section Sequential section Slave threads Inter-warp NP (warp size is 8) PPoPP'2014

Slave threads organization 0 4 8 12 16 20 24 28 master thread id 1 5 9 13 17 21 25 29 slave thread id • Intra-warp nested parallelism • For a master thread, we allocate salve threads in same warp. • Master thread id 0: slave thread ids 0, 1, 2, 3. 2 6 10 14 18 22 26 30 slave thread id Sequential section 3 7 11 15 19 23 37 31 slave thread id Parallel section Sequential section Slave threads Intra-warp NP PPoPP'2014

Variables across parallel sections • Scalar variables • Inputs/Live-Ins to Parallel Sections • Outputs/Live-Outs from Parallel Sections • Array variables • Inputs/Live-Ins to Parallel Sections • Outputs/Live-Outs from Parallel Sections PPoPP'2014

Scalar variables • Inputs/Live-Ins to parallel sections • A scalar variable of master thread has to be broadcasted to its slave threads. • Intra-warp NP on Kepler • __shfl can be used to broadcast a scalar variable to its slave threads • Intra-warp NP on legacy hardware or Inter-warp NP • Shared memory • Scalar Outputs/Live-Outs from Parallel Sections • Reduction and scan variables • Intra-warp NP on Kepler • __shfl can be used • Intra-warp NP on legacy hardware or Inter-warp NP • Shared memory implementation PPoPP'2014

Array structures across parallel sections • Global memory or shared memory • Visible for all slave threads • Local array (local memory or registers) • Replace local array with global memory • Replace local array with shared memory • Partition local array into small local array per slave thread PPoPP'2014

Inter-Warp NP vs. Intra-Warp NP Only advantage of Intra-warp NP PPoPP'2014

Experimental Results • NVIDIA GTX 680 GPU • CUDA SDK 5.0 • Benchmarks • NVIDIA SDK: MarchingCubes (MC) • GPGPUSim: Libor (LIB). • Rodinia: Lud(LU), Leukocyte (LE), Streamcluster (SS), Computational Fluid Dynamics (CFD), BucketSort (BK), and Nearest Neighbor (NN) • TMV and MV PPoPP'2014

Best speedup over baseline • CUDA-NP can achieve from 1.36x to 6.69x speedups • On average CUDA-NP can achieve 2.18x speedup among the ten benchmarks PPoPP'2014

Intra-warp NP vs inter-warp NP • Most benchmarks prefer inter-warp NP • LU has controldivergence in the baseline • NN prefer intra-warp NP due to un-coalesced memory accesses in the baseline PPoPP'2014

Number of slave threads • More TLP is not always useful • Most benchmarks prefer 4 or 8 slave threads to achieve the best performance PPoPP'2014

Performance comparison for TMV • CUBLAS 5.0 is a highly optimized library by NVIDIA • For 1K input, CUDA-NP version delivers 4.9x speedup over CUBLAS • CUDA-NP doesn’t hurt performance for large input sizes PPoPP'2014

Benefit of shfl instruction • __shfl instruction is very useful for MC and LU to save shared memory usage • MC and LU use shared memory intensively PPoPP'2014

Conclusions • Many benchmarks have nested parallelism with small loop counts • We propose CUDA-NP as a compiler framework to support directive-based nested parallelism • Our compiler explores both intra-warp NP and inter-warp NP, and handles live variables across code sections • 2.18x speedup on average PPoPP'2014

Thanks PPoPP'2014

Local array replacement PPoPP'2014

Comparison • NVIDIA dynamic parallelism • NN, TMV, LE, LIB, and CFD, are 28.92, 7.61, 13.45, 125.67 and 52.29 times slower than baselines, respectively. • MC, LU, MV, SS, and BK are using shared memory • Require to copy data from shared memory to global memory to utilize the NVIDIA dynamic parallelism PPoPP'2014

Experimental Methodology • NVIDIA K20c • Benchmarks PPoPP'2014

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications