1 / 34

CUDA Workshop, Week 4

Explore the power of CUDA through workshops, NVVP tools, existing libraries, and Q/A sessions. Learn about multi-GPU usage and optimization techniques. Discover available libraries for linear algebra, image processing, and advanced CUDA development. Dive deep into programming massively parallel processors with hands-on guidance from experts. Get certificates and unleash the potential of NVIDIA tools and resources for enhanced GPU performance. Engage in collaborative GPU computing at Eclipse Nsight and NVVP IDE for deep GPU profiling.

ashannon
Download Presentation

CUDA Workshop, Week 4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUDA Workshop, Week 4 NVVP, Existing Libraries, Q/A

  2. Agenda Text book / resources Eclipse Nsight, NVIDIA Visual Profiler Available libraries Questions Certificate dispersal (Optional) Multiple GPUs: Where’s Pixel-Waldo?

  3. Text Book / Resources Text book • Programming Massively Parallel Processors, A Hands on approach • David Kirk, Wen-meiHwu

  4. Text Book / Resources Nvidia developer zone • Early access to updated drivers / updates • Heavily curated help forum • Requires registration and approval (nearly automated) • developer.nvidia.com

  5. Text Book / Resources US! • We’re pretty passionate about this GPU computing stuff. • Collaboration is cool • If you think you’ve got a problem that can benefit from GPU computation we may have some ideas.

  6. Eclipse Nsight, NVVP IDE with an Eclipse foundation CUDA aware syntax highlighting / suggestions / recognition Hooked into NVVP

  7. Eclipse Nsight, NVVP Deep profiling of every aspect of GPU execution ( memory bandwidth, branch divergence, bank conflicts, compute / transfer overlap, and more! ) Provides suggestions for optimization Graphical view of GPU performance

  8. Eclipse Nsight, NVVP Nsight and NVVP are available on our cuda# machines Ssh–X <user>@<cuda machine> Nsight demo on Week 3 code

  9. Available Libraries • Why re-invent the wheel? • There are many GPU enabled tools built on CUDA that are already available • These tools have been extensively tested for efficiency and in most cases will outperform custom solutions • Some require CUDA-like code structure

  10. Available Libraries Linear Algebra, cuBLAS • CUDA enabled basic linear algebra subroutines • GPU-accelerated version of the complete standard BLAS library • Provided with the CUDA toolkit. Code examples are also provided • Callable from C and Fortran

  11. Available Libraries Linear Algebra, cuBLAS

  12. Available Libraries Linear Algebra, cuBLAS

  13. Available Libraries Linear Algebra, CULA, MAGMA • CULA and MAGMA extend BLAS • CULA (Paid) • CULA-dense: LAPACK and BLAS implementations, solvers, decompositions, basic matrix operations • CULA-sparse: sparse matrix specialized routines, specialized storage structures, iterative methods • MAGMA (Free, BSD) (Fortran Bindings) • LAPACK and BLAS implementations, developed by the same dev. team as LAPACK.

  14. Available Libraries Linear Algebra, CULA, MAGMA

  15. Available Libraries Linear Algebra, CULA, MAGMA

  16. Available Libraries IMSL Fortran/C Numerical Library • Large collection of mathematical and statistical gpu-accelerated functions • Free evaluation, paid extension • http://www.roguewave.com/products/imsl-numerical-libraries/fortran-library.aspx

  17. Available Libraries Image/Signal Processing: NVIDIA Performance Primitives • 1900 Image processing and 600 signal processing algorithms • Free and provided with the CUDA toolkit, code examples included. • Can be used in tandem with visualization libraries like OpenGL, DirectX.

  18. Available Libraries Image/Signal Processing: NVIDIA Performance Primitives

  19. Available Libraries CUDA without the CUDA: Thrust Library • Thrust is a high level interface to GPU computing. • Offers template-interface access to sort, scan, reduce, etc. • A production tested version is provided with the CUDA toolkit.

  20. Available Libraries CUDA without the CUDA: Thrust Library

  21. Available Libraries CUDA without the CUDA: Thrust Library

  22. Available Libraries CUDA without the CUDA: Thrust Library

  23. Available Libraries Python and CUDA • PyCUDA • Python interface to CUDA functions. • Simply a collection of wrappers, but effective. • NumbaPro (Paid) • Announced this year at GTC 2013, native CUDA python compiler • Python = 4th major cuda language

  24. Available Libraries R and CUDA • R+GPU • Package with accelerated alternatives for common R statistical functions • Rpud / rpudplus • Package with accelerated alternatives for common R statistical functions • Rcuda • … Package with accelerated alternatives for common R statistical functions

  25. Available Libraries R and CUDA

  26. Questions?

  27. Certificate Dispersal

  28. Multiple GPUs • Where’s Pixel-Waldo? Motivation: Given two images which contain a unique suspect and a number of distinct bystanders, identify the suspect by pairwise comparison.

  29. Multiple GPUs • This is hard We’ll simplify the problem by reducing the targets to pixel triples.

  30. Multiple GPUs f.bmp s.bmp GPU0 GPU1 0 | 0 | 0 | … 0 | 0 | 0 | … 0: upload an image and a list to store targets to each GPU.

  31. Multiple GPUs f.bmp s.bmp GPU0 GPU1 11 | 143 | 243 | … 3 | 1632 | 54321 | … 1: Find all positions of potential targets (triples) within each image using both GPUS independently.

  32. Multiple GPUs f.bmp s.bmp GPU0 11 | 143 | 243 | … GPU1 3 | 1632 | 54321 | … 0 | 0 PCI Bus 2: Allow GPU0 to access GPU1 memory, use both images and target lists to compare potential suspects.

  33. Multiple GPUs f.bmp CPU GPU0 11 | 143 | 243 | … 132 | 629 PCI Bus 3: Print the positions of the single matching suspect.

  34. Multiple GPUs • Walk though the source code. • Things to note: • This is un-optimized and known to be inefficient, but the concepts of asynchronous streams, GPU context switching, universal addressing, and peer-to-peer access are covered • Source code requires the tclap library to compile appropriately. • Source code will be made available in a github repository after the workshop.

More Related