590 likes | 755 Views
CUDA Lecture 8 CUDA Memories. Prepared 8/9/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Grid. Block (0, 0). Block (1, 0). Shared Memory. Shared Memory. Registers. Registers. Registers. Registers. Thread (0, 0). Thread (1, 0). Thread (0, 0). Thread (1, 0). Host.
E N D
CUDA Lecture 8CUDA Memories Prepared 8/9/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Constant Memory Hardware Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write per-thread local memory • Read/write per-block shared memory • Read/write per-grid global memory • Read/only per-gridconstant memory CUDA Memories – Slide 2
CUDA Variable Type Qualifiers • __device__ is optional when used with __local__, __shared__ or __constant__ CUDA Memories – Slide 3
CUDA Variable Type Qualifiers (cont.) • Automatic scalar variables without any qualifier reside in a register • Compiler will spill to thread local memory • Automatic array variables without any qualifier reside in a thread-local memory CUDA Memories – Slide 4
CUDA Variable Type Performance • scalarvariables reside in fast, on-chip registers • sharedvariables reside in fast, on-chip memories • thread-local arrays and global variables reside in uncached off-chip memory • constant variables reside in cached off-chip memory CUDA Memories – Slide 5
CUDA Variable Type Scale • 100Ks per-thread variables, R/W by 1 thread • 100s sharedvariables, each R/W by 100s of threads • 1 global variable is R/W by 100Ks threads • 1 constant variable is readable by 100Ks threads CUDA Memories – Slide 6
Where to declare variables? CUDA Memories – Slide 7
Example: Thread-local Variables CUDA Memories – Slide 8
Example: Shared Variables CUDA Memories – Slide 9
Example: Shared Variables (cont.) Two loads CUDA Memories – Slide 10
Example: Shared Variables (cont.) // once by thread i // again by thread i+1 CUDA Memories – Slide 11
Example: Shared Variables (cont.) CUDA Memories – Slide 12
Example: Shared Variables (cont.) CUDA Memories – Slide 13
Example: Shared Variables (cont.) CUDA Memories – Slide 14
Optimization Analysis • Experiment performed on a GT200 chip • Improvement likely better on an older architecture • Improvement likely worse on a newer architecture • Optimizations tend to come with a development cost CUDA Memories – Slide 15
Variable Type Restrictions • Pointers can only point to memory allocated or declared in global memory: • Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr) • Obtained as the address of a global variable: float* ptr = &GlobalVar; CUDA Memories – Slide 16
Variable Type Restrictions (cont.) • So you can use pointers and point at any memory space per se: CUDA Memories – Slide 17
Variable Type Restrictions (cont.) • Pointers aren’t typed on memory space • Where does ptrpoint? • ptr is a __shared__pointer variable, not a pointer to a __shared__variable! CUDA Memories – Slide 18
Don’t confuse the compiler! CUDA Memories – Slide 19
Advice • Prefer dereferencing pointers in simple, regular access patterns • Avoid propagating pointers • Avoid pointers to pointers • The GPU would rather not pointer chase • Linked lists will not perform well • Pay attention to compiler warning messages Warning: Cannot tell what pointer points to, assuming global memory space • Crash waiting to happen CUDA Memories – Slide 20
A Common Programming Strategy • Global memory resides in device memory (DRAM) • Much slower access than shared memory • So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory: • Generalize from adjacent_difference example • Divide and conquer CUDA Memories – Slide 21
A Common Programming Strategy (cont.) • Partition data into subsets that fit into shared memory CUDA Memories – Slide 22
A Common Programming Strategy (cont.) • Handleeach data subset with one thread block as follows: CUDA Memories – Slide 23
A Common Programming Strategy (cont.) • Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism CUDA Memories – Slide 24
A Common Programming Strategy (cont.) • Perform the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element CUDA Memories – Slide 25
A Common Programming Strategy (cont.) • Copy the results from shared memory back to global memory CUDA Memories – Slide 26
A Common Programming Strategy (cont.) • Constant memory also resides in device memory (DRAM) • Much slower access than shared memory • But…cached! • Highly efficient access for read-only data CUDA Memories – Slide 27
A Common Programming Strategy (cont.) • Carefully partition data according to access patterns • Read-only __constant__ memory (very fast if in cache) • R/W & shared within block __shared__ memory (very fast) • R/W within each thread registers (very fast) • Indexed R/W within each thread local memory (slow) • R/W inputs/results cudaMalloc’edglobal memory (very slow) CUDA Memories – Slide 28
Communication through Memory • This is a race condition; the result isundefined • The order in which threads access the variable is undefined without explicit coordination • Two ways to enforce well-defined semantics CUDA Memories – Slide 29
Communication through Memory (cont.) • Use barriers (e.g., __syncthreads) to ensure data is ready for access • The state of the entire data array is now well-defined for all threads in this block. CUDA Memories – Slide 30
Communication through Memory (cont.) • Use atomic operations (e.g., atomicAdd) to ensure exclusive access to a variable • After this kernel exits, the value of *resultwill be the sum of the inputs CUDA Memories – Slide 31
Resource Contention • Atomic operations aren’t cheap; they imply serialized access to a variable. • How many threads will contend for exclusive access to result? CUDA Memories – Slide 32
Hierarchical Atomics • Divide and Conquer • Per-thread atomicAdd to a __shared__partial sum • Per-block atomicAdd to the total sum S S0 S1 Si CUDA Memories – Slide 33
Hierarchical Atomics (cont.) CUDA Memories – Slide 34
Advice • Use barriers such as __syncthreadsto wait until __shared__data is ready • Prefer barriers to atomics when data access patterns are regular or predictable • Prefer atomics to barriers when data access patterns are sparse or unpredictable • Atomics to __shared__variables are much faster than atomics to global variables • Don’t synchronize or serialize unnecessarily CUDA Memories – Slide 35
Example: Matrix Multiplication using Shared Memory • Generalize adjacent_difference example • AB = A * B • Each element ABij • = dot(row(A,i),col(B,j)) • Parallelization strategy • Thread ABij • 2D kernel B A AB CUDA Memories – Slide 36
First Try: Matrix Multiply Kernel using Multiple Blocks CUDA Memories – Slide 37
How will this perform? CUDA Memories – Slide 38
Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Constant Memory How will this perform? (cont.) • All threads access global memory for their input matrix elements • The actual code runs at about 15 GFLOPS • Need to drastically cut down memory accesses to get closer to the peak 805 GFLOPS CUDA Memories – Slide 39
Idea: Use __shared__ memory to reuse global data • Each input element is read by width threads • Load each element into __shared__memory and have several threads use the local version to reduce the memory bandwidth B A AB width CUDA Memories – Slide 40
Tiled Multiply TILE_WIDTH • Partition kernel loop into phases so that the data accesses in each phase are focused on one subset (tile) of A and B • Load a tile of both matrices into __shared__each phase B A AB CUDA Memories – Slide 41
Tiled Multiply (cont.) TILE_WIDTH • Each phase • each block computes one square sub-matrix ABsub of size TILE_WIDTH • each phase, each thread computes a partial result, one element of ABsub B A AB CUDA Memories – Slide 42
A Small Example B0,0 B1,0 B0,1 B1,1 B0,2 B1,2 B0,3 B1,3 A0,0 A1,0 A2,0 A3,0 AB0,0 AB1,0 AB2,0 AB3,0 A0,1 A1,1 A2,1 A3,1 AB0,1 AB1,1 AB2,1 AB3,1 AB0,2 AB1,2 AB2,2 AB3,2 AB0,3 AB1,3 AB2,3 AB3,3 CUDA Memories – Slide 43
A Small Example (cont.) • Every A and B element is used exactly twice in generating a 2-by-2 tile of AB Access order CUDA Memories – Slide 44
Breaking A and B into Tiles B0,0 B1,0 B0,1 B1,1 B0,2 B1,2 B0,3 B1,3 A0,0 A1,0 A2,0 A3,0 AB0,0 AB1,0 AB2,0 AB3,0 A0,1 A1,1 A2,1 A3,1 AB0,1 AB1,1 AB2,1 AB3,1 AB0,2 AB1,2 AB2,2 AB3,2 AB0,3 AB1,3 AB2,3 AB3,3 CUDA Memories – Slide 45
Breaking A and B into Tiles (cont.) • Each phase of a thread block uses one tile from A and one from B time CUDA Memories – Slide 46
Tiled Multiply (cont.) TILE_WIDTH • Each phase • each block computes one square sub-matrix ABsub of size TILE_WIDTH • each phase, each thread computes a partial result, one element of ABsub B A AB CUDA Memories – Slide 47
Better Implementation • Set up the execution configuration CUDA Memories – Slide 48
Better Implementation (cont.) CUDA Memories – Slide 49
Better Implementation (cont.) CUDA Memories – Slide 50