CS179: GPU Programming

CS179: GPU Programming Lecture 10: GPU-Accelerated Libraries

Today • Some useful libraries: • cuRAND • cuBLAS • cuFFT

cuRAND • Oftentimes, we want random data • Simulations often need entropy to behave realistically • How to obtain on GPU? • No rand(), or simple equivalent • Could use pseudo-random function with inputs based on properties • Ex.: inti = cos(999 * thread.Idx.x + 123 * threadIdx.y) • Works okay, but not great

cuRAND • What could do with your current tools: • Generate N random numbers on CPU • Allocate space on GPU • Memcpy to GPU • Not bad -- if we want to do this only once • Issues: • Number generation is synchronous • Memcpy can be slow • Much more ideal if random data can live only on GPU

cuRAND • Solution: cuRAND • CUDA random number library • Works on both host and device • Lots of different distributions • Uniform, normal, log-normal, Poisson, etc.

cuRAND • Performance

cuRANDHost API • Using on the host: • Call from host • Allocates memory on GPU • Generates random numbers on GPU • Several pseudorandom generators available • Several random distributions available

cuRANDHost API • Functions to know: • curandCreateGenerator(&g, GEN_TYPE) • GEN_TYPE = CURAND_RNG_PSEUDO_DEFAULT, CURAND_RNG_PSEUDO_XORWOW • Doesn’t particularly matter, differences are small • curandSetRandomGeneratorSeed(g, SEED) • Again, SEED doesn’t matter too much, just pick one (ex.: time(NULL)) • curandGenerate______(…) • Depends on distribution • Ex.: curandGenerate(g, src, n), curandGenerateNormal(g, src, n, mean, stddev) • curandDestroyGenerator(g)

cuRANDHost API • curandGenerate() launches asynchronously • Much faster than serial CPU generation • However, we still need to copy data to GPU • src in curandGenerate() is host pointer, not device pointer! • Introduces some undesired overhead • Might need more memory than we can pass in one go • Solution: cuRAND device API

cuRANDDevice API • Supports RNG on kernels • Do not need to generate random data before kernel • We don’t have to copy and store all data at once • Stores RNG states completely on GPU • Still need to allocate memory for it on host

cuRANDDevice API • Example: curandState *devStates; cudaMalloc(&devStates, sizeof(curandState) * nThreads); kernel<<<gD, bD, sM>>>(devStates, …); cudaFree(devStates); don’t forget to free!

cuRANDDevice API • Example continued: // On the device: __global__ kernel(curandState *states, …) { int id = … // calculate thread id curand_init(seed, id, 0, &states[id]); // generate random value in range [0, 1] v[id] = curand_uniform(&states[id]) // transform to rand [a, b] v[id] = v[id] * (b - a) + a }

cuRANDDevice API • Note the difference between cuRAND states and the actual values • States determine random seed of variables • Numbers aren’t generated until curand_<DISTRIBUTION>(&state) is called

cuRANDOverview • Can generate numbers on either host or device • Whether generating on host or device, host must allocate space for device • Many different random seeds, distributions available • Check out these for more details: • http://docs.nvidia.com/cuda/curand/host-api-overview.html • http://docs.nvidia.com/cuda/curand/device-api-overview.html

cuBLAS • Linear algebra is extremely important in many applications • Physics, engineering, mathematics, computer graphics, networking, … • Anything STEM, really • Linear algebra systems are oftentimes HUGE • Ex.: Invert a matrix of size 106x106 would take a while on a CPU… • Linear algebra systems are oftentimes parallelizable • Element a[0][0] doesn’t care about what a[1][0] will be, just what it was • Linear algebra is a perfect candidate for GPU

cuBLAS • cuBLAS: CUDA’s linear algebra system • Based on BLAS (basic linear algebra system) • Supports all 152 standard BLAS routines • Works pretty similarly to BLAS

cuBLAS • Performance

cuBLAS • Several levels of BLAS: • BLAS1: Handles vector & vector-vector functions • Sum, min, max, etc. • Add, scale, dot, etc. • BLAS2: Handles matrix-vector functions • Multiplication, generally • BLAS3: Handles matrix-matrix functions • Multiplication, adding, etc.

cuBLAS • Using is fairly simple • Call initialization before kernel • cublasInit() • Use whatever functions you need in kernel • Call shutdown after you’re done with cuBLAS • cublasShutdown • Check out the following for more info: • http://docs.nvidia.com/cuda/cublas/index.html

cuBLAS • Alternative: cuSPARSE • Another CUDA LA library • Generally works well when dealing with sparse matrices (most entries are 0) • Works pretty well even with dense vectors

cuFFT • Another concept with lots of application, scalability, and parallelizability: Fourier Transformation • Commonly used in physics, signal processing, etc. • Oftentimes needs to be real-time • Makes great use of GPU

cuFFT • Supports 1D, 2D, or 3D Fourier Transforms • 1D transforms can have up to 128 million elements • Based on Cooley-Tukey and Bluestein FFT algorithms • Similar API to FFTW, if familiar • Thread-safe, streamed, asynchronous execution • Supports both in-place and out-of-place transforms • Supports real, complex, float, double data

cuFFT • Performance

cuFFT • Usage is fairly simple • Allocate space on the GPU • Same old cudaMalloc() call • Create a cuFFT plan • Tells dimension, sizes, and data types • cufftPlan3d(&plan, nx, ny, nz, TYPE) • TYPE = C2C, C2R, R2C (complex to complex, complex to real, real to complex)

cuFFT • Execute the plan • cufftExecC2C(plan, in_data, out_data, CUFFT_FORWARD) • Replace C2C with your plan type • Can replace CUFFT_FORWARD with CUFFT_INVERSE • Destroy plan, clean up data • cufftDestroy(plan) • cudaFree(in_data), cudaFree(out_data) • Check out more here: • http://docs.nvidia.com/cuda/cufft/index.html

GPU-Accelerated Libraries • Many more available • https://developer.nvidia.com/gpu-accelerated-libraries • OpenCV: Computer vision library (has GPU acceleration libraries) • NPP: Performance primitives library, helps with signal/image processing • Check them out! • Best practice for learning: • Check out documentation • Check out examples • Modify example code • Repeat above until familiar, then use in your own code!

CS179: GPU Programming