CUDA Lecture 8 CUDA Memories

CUDA Lecture 8CUDA Memories Prepared 8/9/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Constant Memory Hardware Implementation of CUDA Memories • Each thread can: • Read/write per-thread registers • Read/write per-thread local memory • Read/write per-block shared memory • Read/write per-grid global memory • Read/only per-gridconstant memory CUDA Memories – Slide 2

CUDA Variable Type Qualifiers • __device__ is optional when used with __local__, __shared__ or __constant__ CUDA Memories – Slide 3

CUDA Variable Type Qualifiers (cont.) • Automatic scalar variables without any qualifier reside in a register • Compiler will spill to thread local memory • Automatic array variables without any qualifier reside in a thread-local memory CUDA Memories – Slide 4

CUDA Variable Type Performance • scalarvariables reside in fast, on-chip registers • sharedvariables reside in fast, on-chip memories • thread-local arrays and global variables reside in uncached off-chip memory • constant variables reside in cached off-chip memory CUDA Memories – Slide 5

CUDA Variable Type Scale • 100Ks per-thread variables, R/W by 1 thread • 100s sharedvariables, each R/W by 100s of threads • 1 global variable is R/W by 100Ks threads • 1 constant variable is readable by 100Ks threads CUDA Memories – Slide 6

Where to declare variables? CUDA Memories – Slide 7

Example: Thread-local Variables CUDA Memories – Slide 8

Example: Shared Variables CUDA Memories – Slide 9

Example: Shared Variables (cont.) Two loads CUDA Memories – Slide 10

Example: Shared Variables (cont.) // once by thread i // again by thread i+1 CUDA Memories – Slide 11

Example: Shared Variables (cont.) CUDA Memories – Slide 12

Optimization Analysis • Experiment performed on a GT200 chip • Improvement likely better on an older architecture • Improvement likely worse on a newer architecture • Optimizations tend to come with a development cost CUDA Memories – Slide 15

Variable Type Restrictions • Pointers can only point to memory allocated or declared in global memory: • Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr) • Obtained as the address of a global variable: float* ptr = &GlobalVar; CUDA Memories – Slide 16

Variable Type Restrictions (cont.) • So you can use pointers and point at any memory space per se: CUDA Memories – Slide 17

Variable Type Restrictions (cont.) • Pointers aren’t typed on memory space • Where does ptrpoint? • ptr is a __shared__pointer variable, not a pointer to a __shared__variable! CUDA Memories – Slide 18

Don’t confuse the compiler! CUDA Memories – Slide 19

Advice • Prefer dereferencing pointers in simple, regular access patterns • Avoid propagating pointers • Avoid pointers to pointers • The GPU would rather not pointer chase • Linked lists will not perform well • Pay attention to compiler warning messages Warning: Cannot tell what pointer points to, assuming global memory space • Crash waiting to happen CUDA Memories – Slide 20

A Common Programming Strategy • Global memory resides in device memory (DRAM) • Much slower access than shared memory • So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory: • Generalize from adjacent_difference example • Divide and conquer CUDA Memories – Slide 21

A Common Programming Strategy (cont.) • Partition data into subsets that fit into shared memory CUDA Memories – Slide 22

A Common Programming Strategy (cont.) • Handleeach data subset with one thread block as follows: CUDA Memories – Slide 23

A Common Programming Strategy (cont.) • Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism CUDA Memories – Slide 24

A Common Programming Strategy (cont.) • Perform the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element CUDA Memories – Slide 25

A Common Programming Strategy (cont.) • Copy the results from shared memory back to global memory CUDA Memories – Slide 26

A Common Programming Strategy (cont.) • Constant memory also resides in device memory (DRAM) • Much slower access than shared memory • But…cached! • Highly efficient access for read-only data CUDA Memories – Slide 27

A Common Programming Strategy (cont.) • Carefully partition data according to access patterns • Read-only __constant__ memory (very fast if in cache) • R/W & shared within block __shared__ memory (very fast) • R/W within each thread  registers (very fast) • Indexed R/W within each thread  local memory (slow) • R/W inputs/results cudaMalloc’edglobal memory (very slow) CUDA Memories – Slide 28

Communication through Memory • This is a race condition; the result isundefined • The order in which threads access the variable is undefined without explicit coordination • Two ways to enforce well-defined semantics CUDA Memories – Slide 29

Communication through Memory (cont.) • Use barriers (e.g., __syncthreads) to ensure data is ready for access • The state of the entire data array is now well-defined for all threads in this block. CUDA Memories – Slide 30

Communication through Memory (cont.) • Use atomic operations (e.g., atomicAdd) to ensure exclusive access to a variable • After this kernel exits, the value of *resultwill be the sum of the inputs CUDA Memories – Slide 31

Resource Contention • Atomic operations aren’t cheap; they imply serialized access to a variable. • How many threads will contend for exclusive access to result? CUDA Memories – Slide 32

Hierarchical Atomics • Divide and Conquer • Per-thread atomicAdd to a __shared__partial sum • Per-block atomicAdd to the total sum S S0 S1 Si CUDA Memories – Slide 33

Hierarchical Atomics (cont.) CUDA Memories – Slide 34

Advice • Use barriers such as __syncthreadsto wait until __shared__data is ready • Prefer barriers to atomics when data access patterns are regular or predictable • Prefer atomics to barriers when data access patterns are sparse or unpredictable • Atomics to __shared__variables are much faster than atomics to global variables • Don’t synchronize or serialize unnecessarily CUDA Memories – Slide 35

Example: Matrix Multiplication using Shared Memory • Generalize adjacent_difference example • AB = A * B • Each element ABij • = dot(row(A,i),col(B,j)) • Parallelization strategy • Thread ABij • 2D kernel B A AB CUDA Memories – Slide 36

First Try: Matrix Multiply Kernel using Multiple Blocks CUDA Memories – Slide 37

How will this perform? CUDA Memories – Slide 38

Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Constant Memory How will this perform? (cont.) • All threads access global memory for their input matrix elements • The actual code runs at about 15 GFLOPS • Need to drastically cut down memory accesses to get closer to the peak 805 GFLOPS CUDA Memories – Slide 39

Idea: Use __shared__ memory to reuse global data • Each input element is read by width threads • Load each element into __shared__memory and have several threads use the local version to reduce the memory bandwidth B A AB width CUDA Memories – Slide 40

Tiled Multiply TILE_WIDTH • Partition kernel loop into phases so that the data accesses in each phase are focused on one subset (tile) of A and B • Load a tile of both matrices into __shared__each phase B A AB CUDA Memories – Slide 41

Tiled Multiply (cont.) TILE_WIDTH • Each phase • each block computes one square sub-matrix ABsub of size TILE_WIDTH • each phase, each thread computes a partial result, one element of ABsub B A AB CUDA Memories – Slide 42

A Small Example B0,0 B1,0 B0,1 B1,1 B0,2 B1,2 B0,3 B1,3 A0,0 A1,0 A2,0 A3,0 AB0,0 AB1,0 AB2,0 AB3,0 A0,1 A1,1 A2,1 A3,1 AB0,1 AB1,1 AB2,1 AB3,1 AB0,2 AB1,2 AB2,2 AB3,2 AB0,3 AB1,3 AB2,3 AB3,3 CUDA Memories – Slide 43

A Small Example (cont.) • Every A and B element is used exactly twice in generating a 2-by-2 tile of AB Access order CUDA Memories – Slide 44

Breaking A and B into Tiles B0,0 B1,0 B0,1 B1,1 B0,2 B1,2 B0,3 B1,3 A0,0 A1,0 A2,0 A3,0 AB0,0 AB1,0 AB2,0 AB3,0 A0,1 A1,1 A2,1 A3,1 AB0,1 AB1,1 AB2,1 AB3,1 AB0,2 AB1,2 AB2,2 AB3,2 AB0,3 AB1,3 AB2,3 AB3,3 CUDA Memories – Slide 45

Breaking A and B into Tiles (cont.) • Each phase of a thread block uses one tile from A and one from B time CUDA Memories – Slide 46

Tiled Multiply (cont.) TILE_WIDTH • Each phase • each block computes one square sub-matrix ABsub of size TILE_WIDTH • each phase, each thread computes a partial result, one element of ABsub B A AB CUDA Memories – Slide 47

Better Implementation • Set up the execution configuration CUDA Memories – Slide 48

Better Implementation (cont.) CUDA Memories – Slide 49

Better Implementation (cont.) CUDA Memories – Slide 50

CUDA Lecture 8 CUDA Memories

CUDA Lecture 8 CUDA Memories

Presentation Transcript

Cuda

CUDA

CUDA

CUDA Lecture 5 CUDA at the University of Akron

CUDA Lecture 7 CUDA Threads and Atomics

CUDA

CUDA Lecture 4 CUDA Programming Basics

CUDA Programming

CUDA

CUDA 5.0

Lecture 8: CUDA

CUDA

CUDA

Lecture 8-2 : CUDA Programming