1 / 59

CUDA Lecture 8 CUDA Memories

CUDA Lecture 8 CUDA Memories. Prepared 8/9/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Grid. Block (0, 0). Block (1, 0). Shared Memory. Shared Memory. Registers. Registers. Registers. Registers. Thread (0, 0). Thread (1, 0). Thread (0, 0). Thread (1, 0). Host.

raina
Download Presentation

CUDA Lecture 8 CUDA Memories

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUDA Lecture 8CUDA Memories Prepared 8/9/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

  2. Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Constant Memory Hardware Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write per-thread local memory • Read/write per-block shared memory • Read/write per-grid global memory • Read/only per-gridconstant memory CUDA Memories – Slide 2

  3. CUDA Variable Type Qualifiers • __device__ is optional when used with __local__, __shared__ or __constant__ CUDA Memories – Slide 3

  4. CUDA Variable Type Qualifiers (cont.) • Automatic scalar variables without any qualifier reside in a register • Compiler will spill to thread local memory • Automatic array variables without any qualifier reside in a thread-local memory CUDA Memories – Slide 4

  5. CUDA Variable Type Performance • scalarvariables reside in fast, on-chip registers • sharedvariables reside in fast, on-chip memories • thread-local arrays and global variables reside in uncached off-chip memory • constant variables reside in cached off-chip memory CUDA Memories – Slide 5

  6. CUDA Variable Type Scale • 100Ks per-thread variables, R/W by 1 thread • 100s sharedvariables, each R/W by 100s of threads • 1 global variable is R/W by 100Ks threads • 1 constant variable is readable by 100Ks threads CUDA Memories – Slide 6

  7. Where to declare variables? CUDA Memories – Slide 7

  8. Example: Thread-local Variables CUDA Memories – Slide 8

  9. Example: Shared Variables CUDA Memories – Slide 9

  10. Example: Shared Variables (cont.) Two loads CUDA Memories – Slide 10

  11. Example: Shared Variables (cont.) // once by thread i // again by thread i+1 CUDA Memories – Slide 11

  12. Example: Shared Variables (cont.) CUDA Memories – Slide 12

  13. Example: Shared Variables (cont.) CUDA Memories – Slide 13

  14. Example: Shared Variables (cont.) CUDA Memories – Slide 14

  15. Optimization Analysis • Experiment performed on a GT200 chip • Improvement likely better on an older architecture • Improvement likely worse on a newer architecture • Optimizations tend to come with a development cost CUDA Memories – Slide 15

  16. Variable Type Restrictions • Pointers can only point to memory allocated or declared in global memory: • Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr) • Obtained as the address of a global variable: float* ptr = &GlobalVar; CUDA Memories – Slide 16

  17. Variable Type Restrictions (cont.) • So you can use pointers and point at any memory space per se: CUDA Memories – Slide 17

  18. Variable Type Restrictions (cont.) • Pointers aren’t typed on memory space • Where does ptrpoint? • ptr is a __shared__pointer variable, not a pointer to a __shared__variable! CUDA Memories – Slide 18

  19. Don’t confuse the compiler! CUDA Memories – Slide 19

  20. Advice • Prefer dereferencing pointers in simple, regular access patterns • Avoid propagating pointers • Avoid pointers to pointers • The GPU would rather not pointer chase • Linked lists will not perform well • Pay attention to compiler warning messages Warning: Cannot tell what pointer points to, assuming global memory space • Crash waiting to happen CUDA Memories – Slide 20

  21. A Common Programming Strategy • Global memory resides in device memory (DRAM) • Much slower access than shared memory • So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory: • Generalize from adjacent_difference example • Divide and conquer CUDA Memories – Slide 21

  22. A Common Programming Strategy (cont.) • Partition data into subsets that fit into shared memory CUDA Memories – Slide 22

  23. A Common Programming Strategy (cont.) • Handleeach data subset with one thread block as follows: CUDA Memories – Slide 23

  24. A Common Programming Strategy (cont.) • Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism CUDA Memories – Slide 24

  25. A Common Programming Strategy (cont.) • Perform the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element CUDA Memories – Slide 25

  26. A Common Programming Strategy (cont.) • Copy the results from shared memory back to global memory CUDA Memories – Slide 26

  27. A Common Programming Strategy (cont.) • Constant memory also resides in device memory (DRAM) • Much slower access than shared memory • But…cached! • Highly efficient access for read-only data CUDA Memories – Slide 27

  28. A Common Programming Strategy (cont.) • Carefully partition data according to access patterns • Read-only __constant__ memory (very fast if in cache) • R/W & shared within block __shared__ memory (very fast) • R/W within each thread  registers (very fast) • Indexed R/W within each thread  local memory (slow) • R/W inputs/results cudaMalloc’edglobal memory (very slow) CUDA Memories – Slide 28

  29. Communication through Memory • This is a race condition; the result isundefined • The order in which threads access the variable is undefined without explicit coordination • Two ways to enforce well-defined semantics CUDA Memories – Slide 29

  30. Communication through Memory (cont.) • Use barriers (e.g., __syncthreads) to ensure data is ready for access • The state of the entire data array is now well-defined for all threads in this block. CUDA Memories – Slide 30

  31. Communication through Memory (cont.) • Use atomic operations (e.g., atomicAdd) to ensure exclusive access to a variable • After this kernel exits, the value of *resultwill be the sum of the inputs CUDA Memories – Slide 31

  32. Resource Contention • Atomic operations aren’t cheap; they imply serialized access to a variable. • How many threads will contend for exclusive access to result? CUDA Memories – Slide 32

  33. Hierarchical Atomics • Divide and Conquer • Per-thread atomicAdd to a __shared__partial sum • Per-block atomicAdd to the total sum S S0 S1 Si CUDA Memories – Slide 33

  34. Hierarchical Atomics (cont.) CUDA Memories – Slide 34

  35. Advice • Use barriers such as __syncthreadsto wait until __shared__data is ready • Prefer barriers to atomics when data access patterns are regular or predictable • Prefer atomics to barriers when data access patterns are sparse or unpredictable • Atomics to __shared__variables are much faster than atomics to global variables • Don’t synchronize or serialize unnecessarily CUDA Memories – Slide 35

  36. Example: Matrix Multiplication using Shared Memory • Generalize adjacent_difference example • AB = A * B • Each element ABij • = dot(row(A,i),col(B,j)) • Parallelization strategy • Thread ABij • 2D kernel B A AB CUDA Memories – Slide 36

  37. First Try: Matrix Multiply Kernel using Multiple Blocks CUDA Memories – Slide 37

  38. How will this perform? CUDA Memories – Slide 38

  39. Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Constant Memory How will this perform? (cont.) • All threads access global memory for their input matrix elements • The actual code runs at about 15 GFLOPS • Need to drastically cut down memory accesses to get closer to the peak 805 GFLOPS CUDA Memories – Slide 39

  40. Idea: Use __shared__ memory to reuse global data • Each input element is read by width threads • Load each element into __shared__memory and have several threads use the local version to reduce the memory bandwidth B A AB width CUDA Memories – Slide 40

  41. Tiled Multiply TILE_WIDTH • Partition kernel loop into phases so that the data accesses in each phase are focused on one subset (tile) of A and B • Load a tile of both matrices into __shared__each phase B A AB CUDA Memories – Slide 41

  42. Tiled Multiply (cont.) TILE_WIDTH • Each phase • each block computes one square sub-matrix ABsub of size TILE_WIDTH • each phase, each thread computes a partial result, one element of ABsub B A AB CUDA Memories – Slide 42

  43. A Small Example B0,0 B1,0 B0,1 B1,1 B0,2 B1,2 B0,3 B1,3 A0,0 A1,0 A2,0 A3,0 AB0,0 AB1,0 AB2,0 AB3,0 A0,1 A1,1 A2,1 A3,1 AB0,1 AB1,1 AB2,1 AB3,1 AB0,2 AB1,2 AB2,2 AB3,2 AB0,3 AB1,3 AB2,3 AB3,3 CUDA Memories – Slide 43

  44. A Small Example (cont.) • Every A and B element is used exactly twice in generating a 2-by-2 tile of AB Access order CUDA Memories – Slide 44

  45. Breaking A and B into Tiles B0,0 B1,0 B0,1 B1,1 B0,2 B1,2 B0,3 B1,3 A0,0 A1,0 A2,0 A3,0 AB0,0 AB1,0 AB2,0 AB3,0 A0,1 A1,1 A2,1 A3,1 AB0,1 AB1,1 AB2,1 AB3,1 AB0,2 AB1,2 AB2,2 AB3,2 AB0,3 AB1,3 AB2,3 AB3,3 CUDA Memories – Slide 45

  46. Breaking A and B into Tiles (cont.) • Each phase of a thread block uses one tile from A and one from B time CUDA Memories – Slide 46

  47. Tiled Multiply (cont.) TILE_WIDTH • Each phase • each block computes one square sub-matrix ABsub of size TILE_WIDTH • each phase, each thread computes a partial result, one element of ABsub B A AB CUDA Memories – Slide 47

  48. Better Implementation • Set up the execution configuration CUDA Memories – Slide 48

  49. Better Implementation (cont.) CUDA Memories – Slide 49

  50. Better Implementation (cont.) CUDA Memories – Slide 50

More Related