CUDA Workshop, Week 4

CUDA Workshop, Week 4 NVVP, Existing Libraries, Q/A

Agenda Text book / resources Eclipse Nsight, NVIDIA Visual Profiler Available libraries Questions Certificate dispersal (Optional) Multiple GPUs: Where’s Pixel-Waldo?

Text Book / Resources Text book • Programming Massively Parallel Processors, A Hands on approach • David Kirk, Wen-meiHwu

Text Book / Resources Nvidia developer zone • Early access to updated drivers / updates • Heavily curated help forum • Requires registration and approval (nearly automated) • developer.nvidia.com

Text Book / Resources US! • We’re pretty passionate about this GPU computing stuff. • Collaboration is cool • If you think you’ve got a problem that can benefit from GPU computation we may have some ideas.

Eclipse Nsight, NVVP IDE with an Eclipse foundation CUDA aware syntax highlighting / suggestions / recognition Hooked into NVVP

Eclipse Nsight, NVVP Deep profiling of every aspect of GPU execution ( memory bandwidth, branch divergence, bank conflicts, compute / transfer overlap, and more! ) Provides suggestions for optimization Graphical view of GPU performance

Eclipse Nsight, NVVP Nsight and NVVP are available on our cuda# machines Ssh–X <user>@<cuda machine> Nsight demo on Week 3 code

Available Libraries • Why re-invent the wheel? • There are many GPU enabled tools built on CUDA that are already available • These tools have been extensively tested for efficiency and in most cases will outperform custom solutions • Some require CUDA-like code structure

Available Libraries Linear Algebra, cuBLAS • CUDA enabled basic linear algebra subroutines • GPU-accelerated version of the complete standard BLAS library • Provided with the CUDA toolkit. Code examples are also provided • Callable from C and Fortran

Available Libraries Linear Algebra, cuBLAS

Available Libraries Linear Algebra, CULA, MAGMA • CULA and MAGMA extend BLAS • CULA (Paid) • CULA-dense: LAPACK and BLAS implementations, solvers, decompositions, basic matrix operations • CULA-sparse: sparse matrix specialized routines, specialized storage structures, iterative methods • MAGMA (Free, BSD) (Fortran Bindings) • LAPACK and BLAS implementations, developed by the same dev. team as LAPACK.

Available Libraries Linear Algebra, CULA, MAGMA

Available Libraries IMSL Fortran/C Numerical Library • Large collection of mathematical and statistical gpu-accelerated functions • Free evaluation, paid extension • http://www.roguewave.com/products/imsl-numerical-libraries/fortran-library.aspx

Available Libraries Image/Signal Processing: NVIDIA Performance Primitives • 1900 Image processing and 600 signal processing algorithms • Free and provided with the CUDA toolkit, code examples included. • Can be used in tandem with visualization libraries like OpenGL, DirectX.

Available Libraries Image/Signal Processing: NVIDIA Performance Primitives

Available Libraries CUDA without the CUDA: Thrust Library • Thrust is a high level interface to GPU computing. • Offers template-interface access to sort, scan, reduce, etc. • A production tested version is provided with the CUDA toolkit.

Available Libraries CUDA without the CUDA: Thrust Library

Available Libraries Python and CUDA • PyCUDA • Python interface to CUDA functions. • Simply a collection of wrappers, but effective. • NumbaPro (Paid) • Announced this year at GTC 2013, native CUDA python compiler • Python = 4th major cuda language

Available Libraries R and CUDA • R+GPU • Package with accelerated alternatives for common R statistical functions • Rpud / rpudplus • Package with accelerated alternatives for common R statistical functions • Rcuda • … Package with accelerated alternatives for common R statistical functions

Available Libraries R and CUDA

Questions?

Certificate Dispersal

Multiple GPUs • Where’s Pixel-Waldo? Motivation: Given two images which contain a unique suspect and a number of distinct bystanders, identify the suspect by pairwise comparison.

Multiple GPUs • This is hard We’ll simplify the problem by reducing the targets to pixel triples.

Multiple GPUs f.bmp s.bmp GPU0 GPU1 0 | 0 | 0 | … 0 | 0 | 0 | … 0: upload an image and a list to store targets to each GPU.

Multiple GPUs f.bmp s.bmp GPU0 GPU1 11 | 143 | 243 | … 3 | 1632 | 54321 | … 1: Find all positions of potential targets (triples) within each image using both GPUS independently.

Multiple GPUs f.bmp s.bmp GPU0 11 | 143 | 243 | … GPU1 3 | 1632 | 54321 | … 0 | 0 PCI Bus 2: Allow GPU0 to access GPU1 memory, use both images and target lists to compare potential suspects.

Multiple GPUs f.bmp CPU GPU0 11 | 143 | 243 | … 132 | 629 PCI Bus 3: Print the positions of the single matching suspect.

Multiple GPUs • Walk though the source code. • Things to note: • This is un-optimized and known to be inefficient, but the concepts of asynchronous streams, GPU context switching, universal addressing, and peer-to-peer access are covered • Source code requires the tclap library to compile appropriately. • Source code will be made available in a github repository after the workshop.

CUDA Workshop, Week 4