1 / 29

CS179: GPU Programming

CS179: GPU Programming. Lecture 10: GPU-Accelerated Libraries. Today. Some useful libraries: cuRAND cuBLAS cuFFT. cuRAND. Oftentimes, we want random data Simulations often need entropy to behave realistically How to obtain on GPU? N o rand(), or simple equivalent

fancy
Download Presentation

CS179: GPU Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS179: GPU Programming Lecture 10: GPU-Accelerated Libraries

  2. Today • Some useful libraries: • cuRAND • cuBLAS • cuFFT

  3. cuRAND • Oftentimes, we want random data • Simulations often need entropy to behave realistically • How to obtain on GPU? • No rand(), or simple equivalent • Could use pseudo-random function with inputs based on properties • Ex.: inti = cos(999 * thread.Idx.x + 123 * threadIdx.y) • Works okay, but not great

  4. cuRAND • What could do with your current tools: • Generate N random numbers on CPU • Allocate space on GPU • Memcpy to GPU • Not bad -- if we want to do this only once • Issues: • Number generation is synchronous • Memcpy can be slow • Much more ideal if random data can live only on GPU

  5. cuRAND • Solution: cuRAND • CUDA random number library • Works on both host and device • Lots of different distributions • Uniform, normal, log-normal, Poisson, etc.

  6. cuRAND • Performance

  7. cuRANDHost API • Using on the host: • Call from host • Allocates memory on GPU • Generates random numbers on GPU • Several pseudorandom generators available • Several random distributions available

  8. cuRANDHost API • Functions to know: • curandCreateGenerator(&g, GEN_TYPE) • GEN_TYPE = CURAND_RNG_PSEUDO_DEFAULT, CURAND_RNG_PSEUDO_XORWOW • Doesn’t particularly matter, differences are small • curandSetRandomGeneratorSeed(g, SEED) • Again, SEED doesn’t matter too much, just pick one (ex.: time(NULL)) • curandGenerate______(…) • Depends on distribution • Ex.: curandGenerate(g, src, n), curandGenerateNormal(g, src, n, mean, stddev) • curandDestroyGenerator(g)

  9. cuRANDHost API • curandGenerate() launches asynchronously • Much faster than serial CPU generation • However, we still need to copy data to GPU • src in curandGenerate() is host pointer, not device pointer! • Introduces some undesired overhead • Might need more memory than we can pass in one go • Solution: cuRAND device API

  10. cuRANDDevice API • Supports RNG on kernels • Do not need to generate random data before kernel • We don’t have to copy and store all data at once • Stores RNG states completely on GPU • Still need to allocate memory for it on host

  11. cuRANDDevice API • Example: curandState *devStates; cudaMalloc(&devStates, sizeof(curandState) * nThreads); kernel<<<gD, bD, sM>>>(devStates, …); cudaFree(devStates); don’t forget to free!

  12. cuRANDDevice API • Example continued: // On the device: __global__ kernel(curandState *states, …) { int id = … // calculate thread id curand_init(seed, id, 0, &states[id]); // generate random value in range [0, 1] v[id] = curand_uniform(&states[id]) // transform to rand [a, b] v[id] = v[id] * (b - a) + a }

  13. cuRANDDevice API • Note the difference between cuRAND states and the actual values • States determine random seed of variables • Numbers aren’t generated until curand_<DISTRIBUTION>(&state) is called

  14. cuRANDOverview • Can generate numbers on either host or device • Whether generating on host or device, host must allocate space for device • Many different random seeds, distributions available • Check out these for more details: • http://docs.nvidia.com/cuda/curand/host-api-overview.html • http://docs.nvidia.com/cuda/curand/device-api-overview.html

  15. cuBLAS • Linear algebra is extremely important in many applications • Physics, engineering, mathematics, computer graphics, networking, … • Anything STEM, really • Linear algebra systems are oftentimes HUGE • Ex.: Invert a matrix of size 106x106 would take a while on a CPU… • Linear algebra systems are oftentimes parallelizable • Element a[0][0] doesn’t care about what a[1][0] will be, just what it was • Linear algebra is a perfect candidate for GPU

  16. cuBLAS • cuBLAS: CUDA’s linear algebra system • Based on BLAS (basic linear algebra system) • Supports all 152 standard BLAS routines • Works pretty similarly to BLAS

  17. cuBLAS • Performance

  18. cuBLAS • Performance

  19. cuBLAS • Performance

  20. cuBLAS • Several levels of BLAS: • BLAS1: Handles vector & vector-vector functions • Sum, min, max, etc. • Add, scale, dot, etc. • BLAS2: Handles matrix-vector functions • Multiplication, generally • BLAS3: Handles matrix-matrix functions • Multiplication, adding, etc.

  21. cuBLAS • Using is fairly simple • Call initialization before kernel • cublasInit() • Use whatever functions you need in kernel • Call shutdown after you’re done with cuBLAS • cublasShutdown • Check out the following for more info: • http://docs.nvidia.com/cuda/cublas/index.html

  22. cuBLAS • Alternative: cuSPARSE • Another CUDA LA library • Generally works well when dealing with sparse matrices (most entries are 0) • Works pretty well even with dense vectors

  23. cuFFT • Another concept with lots of application, scalability, and parallelizability: Fourier Transformation • Commonly used in physics, signal processing, etc. • Oftentimes needs to be real-time • Makes great use of GPU

  24. cuFFT • Supports 1D, 2D, or 3D Fourier Transforms • 1D transforms can have up to 128 million elements • Based on Cooley-Tukey and Bluestein FFT algorithms • Similar API to FFTW, if familiar • Thread-safe, streamed, asynchronous execution • Supports both in-place and out-of-place transforms • Supports real, complex, float, double data

  25. cuFFT • Performance

  26. cuFFT • Performance

  27. cuFFT • Usage is fairly simple • Allocate space on the GPU • Same old cudaMalloc() call • Create a cuFFT plan • Tells dimension, sizes, and data types • cufftPlan3d(&plan, nx, ny, nz, TYPE) • TYPE = C2C, C2R, R2C (complex to complex, complex to real, real to complex)

  28. cuFFT • Execute the plan • cufftExecC2C(plan, in_data, out_data, CUFFT_FORWARD) • Replace C2C with your plan type • Can replace CUFFT_FORWARD with CUFFT_INVERSE • Destroy plan, clean up data • cufftDestroy(plan) • cudaFree(in_data), cudaFree(out_data) • Check out more here: • http://docs.nvidia.com/cuda/cufft/index.html

  29. GPU-Accelerated Libraries • Many more available • https://developer.nvidia.com/gpu-accelerated-libraries • OpenCV: Computer vision library (has GPU acceleration libraries) • NPP: Performance primitives library, helps with signal/image processing • Check them out! • Best practice for learning: • Check out documentation • Check out examples • Modify example code • Repeat above until familiar, then use in your own code!

More Related