350 likes | 525 Views
CS179: GPU Programming. Lecture 5: Memory. Today. GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling. Memory Overview. Very slow access: Between host and device Slow access: Global Memory Fast access: Shared memory, constant memory, texture memory, local memory
E N D
CS179: GPU Programming Lecture 5: Memory
Today • GPU Memory Overview • CUDA Memory Syntax • Tips and tricks for memory handling
Memory Overview • Very slow access: • Between host and device • Slow access: • Global Memory • Fast access: • Shared memory, constant memory, texture memory, local memory • Very fast access: • Register memory
Global Memory • Read/write • Shared between blocks and grids • Same across multiple kernel executions • Very slow to access • No caching!
Constant Memory • Read-only in device • Cached in multiprocessor • Fairly quick • Cache can broadcast to all active threads
Texture Memory • Read-only in device • 2D cached -- quick access • Filtering methods available
Shared Memory • Read/write per block • Memory is shared within block • Generally quick • Has bad worst-cases
Local Memory • Read/write per thread • Not too fast (stored independent of chip) • Each thread can only see its own local memory • Indexable (can do arrays)
Register Memory • Read/write per thread function • Extremely fast • Each thread can only see its own register memory • Not indexable (can’t do arrays)
Syntax:Register Memory • Default memory type • Declare as normal -- no special syntax • intvar = 1; • Only accessible by current thread
Syntax:Local Memory • “Global” variables for threads • Can modify across local functions for a thread • Declare with __device__ __local__ keyword • __device__ __local__ intvar = 1; • Can also just use __local__
Syntax: Shared Memory • Shared across threads in block, not across blocks • Cannot use pointers, but can use array syntax for arrays • Declare with __device__ __shared__ keyword • __device__ __shared__ intvar[]; • Can also just use __shared__ • Don’t need to declare size for arrays
Syntax: Global Memory • Created with cudaMalloc • Can pass pointers between host and kernel • Transfer is slow! • Declare with __device__keyword • __device__ intvar = 1;
Syntax: Constant Memory • Declare with __device__ __constant__ keyword • __device__ __constant__ intvar = 1; • Can also just use __constant__ • Set using cudaMemcpyToSymbol(or cudaMemcpy) • cudaMemcpyToSymbol(var, src, count);
Syntax: Texture Memory • To be discussed later…
Memory Issues • Each multiprocessor has set amount of memory • Limits amount of blocks we can have • (# of blocks) x (memory used per block) <= total memory • Either get lots of blocks using little memory, or fewer blocks using lots of memory
Memory Issues • Register memory is limited! • Similar to shared memory in blocks • Can have many threads using fewer registers, or few threads using many registers • Former is better, more parallelism
Memory Issues • Global accesses: slow! • Can be sped up when memory is contiguous • Memory coalescing: making memory contiguous • Coalesced accesses are: • Contiguous accesses • In-order accesses • Aligned accesses
Memory Coalescing:Aligned Accesses • Threads read 4, 8, or 16 bytes at a time from global memory • Accesses must be aligned in memory! • Good: • Bad: • Which is worse, reading 16 bytes from 0xABCD0 or 0xABCDE? 0x00 0x04 0x14 0x00 0x07 0x14
Memory CoalescingAligned Accesses Also bad: beginning unaligned
Memory Coalescing:Aligned Accesses • Built-in types force alignment • float3 (12B) takes up the same space as float4 (16B) • float3 arrays are not aligned! • To align a struct, use __align__(x) // x = 4, 8, 16 • cudaMallocaligns the start of each block automatically • cudaMalloc2D aligns the start of each row for 2D arrays
Memory Coalescing:Contiguous Accesses • Contiguous = memory is together • Example: non-contiguous memory • Thread 3 and 4 swapped accesses!
Memory Coalescing:Contiguous Accesses • Which is better? • index = threadIdx.x + blockDim.x * (blockIdx.x + gridDim.x * blockIdx.y); • index = threadIdx.x+ blockDim.y* (blockIdx.y+ gridDim.y* blockIdx.x);
Memory Coalescing:Contiguous Accesses • Case 1: Contiguous accesses bank[0] thread[0][0] thread[1][0] bank[1] thread[0][1] thread[1][1] bank[2] bank[3]
Memory Coalescing:Contiguous Accesses • Case 1: Contiguous accesses bank[0] thread[0][0] thread[1][0] bank[1] thread[0][1] thread[1][1] bank[2] bank[3]
Memory Coalescing:In-order Accesses • In-order accesses • Do not skip addresses • Access addresses in order in memory • Bad example: • Left: address 140 skipped • Right: lots of skipped addresses
Memory Coalescing • Good example:
Memory Coalescing • Not as much of an issue in new hardware • Many restrictions relaxed -- e.g., do not need to have sequential access • However, memory coalescing and alignment still good practice!
Memory Issues • Shared memory: • Also can be limiting • Broken up into banks • Optimal when entire warp is reading shared memory together • Banks: • Each bank services only one thread at a time • Bank conflict: when two threads try to access same block • Causes slowdowns in program!
Bank Conflicts • Bad: • Many threads trying to access the same bank
Bank Conflicts • Good: • Few to no bank conflicts
Bank Conflicts • Banks service 32-bit words at a time at addresses mod 64 • Bank 0 services 0x00, 0x40, 0x80, etc., bank 1 services 0x04, 0x44, 0x84, etc. • Want to avoid multiple thread access to same bank • Keep data spread out • Split data that is larger than 4 bytes into multiple accesses • Be careful of data elements with even stride
Broadcasting • Fast distribution of data to threads • Happens when entire warp tries to access same address • Memory will get broadcasted to all threads in one read
Summary • Best memory management: • Balances memory optimization with parallelism • Break problem up into coalesced chunks • Process data in shared memory, then copy back to global • Remember to avoid bank conflicts!
Next Time • Texture memory • CUDA Applications in graphics