1 / 64

CUDA Odds and Ends

CUDA Odds and Ends. Joseph Kider University of Pennsylvania CIS 565 - Fall 2011. Sources. Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors. Agenda. Atomic Functions Paged-Locked Host Memory Streams

gin
Download Presentation

CUDA Odds and Ends

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS 565 - Fall 2011

  2. Sources • Patrick Cozzi Spring 2011 • NVIDIA CUDA Programming Guide • CUDA by Example • Programming Massively Parallel Processors

  3. Agenda • Atomic Functions • Paged-Locked Host Memory • Streams • Graphics Interoperability

  4. Atomic Functions • What is the value of count if 8 threads execute ++count? __device__ unsigned int count = 0; // ... ++count;

  5. Atomic Functions • Read-modify-write atomic operation • Guaranteed no interference from other threads • No guarantee on order • Shared or global memory • Requires compute capability 1.1 (> G80) See G.1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements

  6. Atomic Functions • What is the value of count if 8 threads execute atomicInc below? __device__ unsigned int count = 0; // ... // atomic ++count atomicInc(&count, 1);

  7. Atomic Functions • How do you implement atomicInc? __device__intatomicAdd( int *address, int val);

  8. Atomic Functions • How do you implement atomicInc? __device__intatomicAdd( int *address, int val) { // Made up keyword: __lock (address) { *address += value; } }

  9. Atomic Functions • How do you implement atomicInc without locking?

  10. Atomic Functions • How do you implement atomicInc without locking? • What if you were given an atomic compare and swap? intatomicCAS(int *address, int compare, int val);

  11. Atomic Functions • atomicCAS pseudo implementation intatomicCAS(int *address, int compare, int val) { // Made up keyword __lock(address) { intold = *address; *address = (old == compare) ? val : old; return old; } }

  12. Atomic Functions • atomicCAS pseudo implementation intatomicCAS(int *address, int compare, int val) { // Made up keyword __lock(address) { intold = *address; *address = (old == compare) ? val : old; return old; } }

  13. Atomic Functions • atomicCAS pseudo implementation intatomicCAS(int *address, int compare, int val) { // Made up keyword __lock(address) { intold = *address; *address = (old == compare) ? val : old; return old; } }

  14. Atomic Functions • Example: *addr = 1; atomicCAS(addr, 1, 2); atomicCAS(addr, 1, 3); atomicCAS(addr, 2, 3);

  15. Atomic Functions • Example: *addr = 1; atomicCAS(addr, 1, 2); atomicCAS(addr, 1, 3); atomicCAS(addr, 2, 3); // returns 1 // *addr = 2

  16. Atomic Functions • Example: *addr = 1; atomicCAS(addr, 1, 2); atomicCAS(addr, 1, 3); atomicCAS(addr, 2, 3); // returns 2 // *addr = 2

  17. Atomic Functions • Example: *addr = 1; atomicCAS(addr, 1, 2); atomicCAS(addr, 1, 3); atomicCAS(addr, 2, 3); // returns 2 // *addr = 3

  18. Atomic Functions • Again, how do you implement atomicInc given atomicCAS? __device__intatomicAdd( int *address, int val);

  19. Atomic Functions __device__ int atomicAdd(int *address, int val) { int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, val + assumed); } while (assumed != old); return old; }

  20. Atomic Functions __device__ int atomicAdd(int *address, int val) { int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, val + assumed); } while (assumed != old); return old; } Read original value at *address.

  21. Atomic Functions __device__ int atomicAdd(int *address, int val) { int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, val + assumed); } while (assumed != old); return old; } If the value at *address didn’t change, increment it.

  22. Atomic Functions __device__ int atomicAdd(int *address, int val) { int old = *address, assumed; do { assumed = old; old = atomicCAS(address, assumed, assumed + val); } while (assumed != old); return old; } Otherwise, loop until atomicCAS succeeds. The value of *address after this function returns is not necessarily the original value of *address + val, why?

  23. Atomic Functions • Lots of atomics: // Arithmetic // Bitwise atomicAdd()atomicAnd() atomicSub()atomicOr() atomicExch()atomicXor() atomicMin() atomicMax() atomicInc() atomicDec() atomicCAS() See B.10 in the NVIDIA CUDA C Programming Guide

  24. Atomic Functions • How can threads from different blocks work together? • Use atomics sparingly. Why?

  25. Page-Locked Host Memory • Page-locked Memory • Host memory that is essentially removed from virtual memory • Also called Pinned Memory

  26. Page-Locked Host Memory Time Data Transfer Kernel Execution Normally: Data Transfer Paged-locked: Kernel Execution • Benefits • Overlap kernel execution and data transfers See G.1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements

  27. Page-Locked Host Memory • Benefits • Increased memory bandwidth for systems with a front-side bus • Up to ~2x throughput Image from http://arstechnica.com/hardware/news/2009/10/day-of-nvidia-chipset-reckoning-arrives.ars

  28. Page-Locked Host Memory • Benefits • Writing-Combing Memory • Page-locked memory is cacheable • Allocate with cudaHostAllocWriteCombinedto • Avoid polluting L1 and L2 caches • Avoid snooping transfers across PCIe • Improve transfer performance up to 40% - in theory • Reading from write-combing memory is slow! • Only write to it from the host

  29. Page-Locked Host Memory • Benefits • Paged-locked host memory can be mapped into the address space of the device on some systems • What systems allow this? • What does this eliminate? • What applications does this enable? • Call cudaGetDeviceProperties() and check canMapHostMemory

  30. Page-Locked Host Memory • Usage: cudaHostAlloc() / cudaMallocHost() cudaHostFree() cudaMemcpyAsync() See 3.2.5 in the NVIDIA CUDA C Programming Guide

  31. Page-Locked Host Memory DEMO CUDA SDK Example: bandwidthTest

  32. Page-Locked Host Memory • What’s the catch? • Page-locked memory is scarce • Allocations will start failing before allocation of in pageable memory • Reduces amount of physical memory available to the OS for paging • Allocating too much will hurt overall system performance

  33. Streams Stream A Stream B Command 0 Command 0 Command 1 Command 1 Command 2 Command 2 Stream: Sequence of commands that execute in order Streams may execute their commands out-of-order or concurrently with respect to other streams

  34. Streams • Is this a possible order? Stream A Stream B Time Command 0 Command 0 Command 0 Command 1 Command 1 Command 1 Command 2 Command 2 Command 2 Command 0 Command 1 Command 2

  35. Streams • Is this a possible order? Stream A Stream B Time Command 0 Command 0 Command 0 Command 1 Command 1 Command 1 Command 2 Command 2 Command 2 Command 0 Command 1 Command 2

  36. Streams • Is this a possible order? Stream A Stream B Time Command 0 Command 0 Command 0 Command 1 Command 1 Command 0 Command 1 Command 2 Command 2 Command 1 Command 2 Command 2

  37. Streams • Is this a possible order? Stream A Stream B Time Command 0 Command 0 Command 0 Command 1 Command 1 Command 0 Command 2 Command 2 Command 2 Command 2 Command 1 Command 1

  38. Streams • Is this a possible order? Stream A Stream B Time Command 0 Command 0 Command 0 Command 0 Command 1 Command 1 Command 1 Command 1 Command 2 Command 2 Command 2 Command 2

  39. Streams • In CUDA, what commands go in a stream? • Kernel launches • Host device memory transfers

  40. Streams • Code Example • Create two streams • Each stream: • Copy page-locked memory to device • Launch kernel • Copy memory back to host • Destroy streams

  41. Stream Example (Step 1 of 3) cudaStream_t stream[2]; for (int i = 0; i < 2; ++i) { cudaStreamCreate(&stream[i]); } float *hostPtr; cudaMallocHost(&hostPtr, 2 * size);

  42. Stream Example (Step 1 of 3) cudaStream_t stream[2]; for (int i = 0; i < 2; ++i) { cudaStreamCreate(&stream[i]); } float *hostPtr; cudaMallocHost(&hostPtr, 2 * size); Create two streams

  43. Stream Example (Step 1 of 3) cudaStream_t stream[2]; for (int i = 0; i < 2; ++i) { cudaStreamCreate(&stream[i]); } float *hostPtr; cudaMallocHost(&hostPtr, 2 * size); Allocate two buffers in page-locked memory

  44. Stream Example (Step 2 of 3) for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(/* ... */, cudaMemcpyHostToDevice, stream[i]); kernel<<<100, 512, 0, stream[i]>>> (/* ... */); cudaMemcpyAsync(/* ... */, cudaMemcpyDeviceToHost, stream[i]); }

  45. Stream Example (Step 2 of 3) for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(/* ... */, cudaMemcpyHostToDevice, stream[i]); kernel<<<100, 512, 0, stream[i]>>> (/* ... */); cudaMemcpyAsync(/* ... */, cudaMemcpyDeviceToHost, stream[i]); } Commands are assigned to, and executed by streams

  46. Stream Example (Step 3 of 3) for (int i = 0; i < 2; ++i) { // Blocks until commands complete cudaStreamDestroy(stream[i]); }

  47. Streams • Assume compute capabilities: • Overlap of data transfer and kernel execution • Concurrent kernel execution • Concurrent data transfer • How can the streams overlap? See G.1 in the NVIDIA CUDA C Programming Guide for more on compute capabilities

  48. Streams Time Host device memory Host device memory Stream A Stream B Device to host memory Device to host memory Kernel execution Kernel execution Can we have more overlap than this?

  49. Streams Time Host device memory Host device memory Stream A Stream B Device to host memory Device to host memory Kernel execution Kernel execution Can we have this?

  50. Streams • Implicit Synchronization • An operation that requires a dependency check to see if a kernel finished executing: • Blocks all kernel launches from any stream until the checked kernel is finished See 3.2.6.5.3 in the NVIDIA CUDA C Programming Guide for all limitations

More Related