CS179: GPU Programming

CS179: GPU Programming Lecture 5: Memory

Today • GPU Memory Overview • CUDA Memory Syntax • Tips and tricks for memory handling

Memory Overview • Very slow access: • Between host and device • Slow access: • Global Memory • Fast access: • Shared memory, constant memory, texture memory, local memory • Very fast access: • Register memory

Global Memory • Read/write • Shared between blocks and grids • Same across multiple kernel executions • Very slow to access • No caching!

Constant Memory • Read-only in device • Cached in multiprocessor • Fairly quick • Cache can broadcast to all active threads

Texture Memory • Read-only in device • 2D cached -- quick access • Filtering methods available

Shared Memory • Read/write per block • Memory is shared within block • Generally quick • Has bad worst-cases

Local Memory • Read/write per thread • Not too fast (stored independent of chip) • Each thread can only see its own local memory • Indexable (can do arrays)

Register Memory • Read/write per thread function • Extremely fast • Each thread can only see its own register memory • Not indexable (can’t do arrays)

Syntax:Register Memory • Default memory type • Declare as normal -- no special syntax • intvar = 1; • Only accessible by current thread

Syntax:Local Memory • “Global” variables for threads • Can modify across local functions for a thread • Declare with __device__ __local__ keyword • __device__ __local__ intvar = 1; • Can also just use __local__

Syntax: Shared Memory • Shared across threads in block, not across blocks • Cannot use pointers, but can use array syntax for arrays • Declare with __device__ __shared__ keyword • __device__ __shared__ intvar[]; • Can also just use __shared__ • Don’t need to declare size for arrays

Syntax: Global Memory • Created with cudaMalloc • Can pass pointers between host and kernel • Transfer is slow! • Declare with __device__keyword • __device__ intvar = 1;

Syntax: Constant Memory • Declare with __device__ __constant__ keyword • __device__ __constant__ intvar = 1; • Can also just use __constant__ • Set using cudaMemcpyToSymbol(or cudaMemcpy) • cudaMemcpyToSymbol(var, src, count);

Syntax: Texture Memory • To be discussed later…

Memory Issues • Each multiprocessor has set amount of memory • Limits amount of blocks we can have • (# of blocks) x (memory used per block) <= total memory • Either get lots of blocks using little memory, or fewer blocks using lots of memory

Memory Issues • Register memory is limited! • Similar to shared memory in blocks • Can have many threads using fewer registers, or few threads using many registers • Former is better, more parallelism

Memory Issues • Global accesses: slow! • Can be sped up when memory is contiguous • Memory coalescing: making memory contiguous • Coalesced accesses are: • Contiguous accesses • In-order accesses • Aligned accesses

Memory Coalescing:Aligned Accesses • Threads read 4, 8, or 16 bytes at a time from global memory • Accesses must be aligned in memory! • Good: • Bad: • Which is worse, reading 16 bytes from 0xABCD0 or 0xABCDE? 0x00 0x04 0x14 0x00 0x07 0x14

Memory CoalescingAligned Accesses Also bad: beginning unaligned

Memory Coalescing:Aligned Accesses • Built-in types force alignment • float3 (12B) takes up the same space as float4 (16B) • float3 arrays are not aligned! • To align a struct, use __align__(x) // x = 4, 8, 16 • cudaMallocaligns the start of each block automatically • cudaMalloc2D aligns the start of each row for 2D arrays

Memory Coalescing:Contiguous Accesses • Contiguous = memory is together • Example: non-contiguous memory • Thread 3 and 4 swapped accesses!

Memory Coalescing:Contiguous Accesses • Which is better? • index = threadIdx.x + blockDim.x * (blockIdx.x + gridDim.x * blockIdx.y); • index = threadIdx.x+ blockDim.y* (blockIdx.y+ gridDim.y* blockIdx.x);

Memory Coalescing:Contiguous Accesses • Case 1: Contiguous accesses bank[0] thread[0][0] thread[1][0] bank[1] thread[0][1] thread[1][1] bank[2] bank[3]

Memory Coalescing:In-order Accesses • In-order accesses • Do not skip addresses • Access addresses in order in memory • Bad example: • Left: address 140 skipped • Right: lots of skipped addresses

Memory Coalescing • Good example:

Memory Coalescing • Not as much of an issue in new hardware • Many restrictions relaxed -- e.g., do not need to have sequential access • However, memory coalescing and alignment still good practice!

Memory Issues • Shared memory: • Also can be limiting • Broken up into banks • Optimal when entire warp is reading shared memory together • Banks: • Each bank services only one thread at a time • Bank conflict: when two threads try to access same block • Causes slowdowns in program!

Bank Conflicts • Bad: • Many threads trying to access the same bank

Bank Conflicts • Good: • Few to no bank conflicts

Bank Conflicts • Banks service 32-bit words at a time at addresses mod 64 • Bank 0 services 0x00, 0x40, 0x80, etc., bank 1 services 0x04, 0x44, 0x84, etc. • Want to avoid multiple thread access to same bank • Keep data spread out • Split data that is larger than 4 bytes into multiple accesses • Be careful of data elements with even stride

Broadcasting • Fast distribution of data to threads • Happens when entire warp tries to access same address • Memory will get broadcasted to all threads in one read

Summary • Best memory management: • Balances memory optimization with parallelism • Break problem up into coalesced chunks • Process data in shared memory, then copy back to global • Remember to avoid bank conflicts!

Next Time • Texture memory • CUDA Applications in graphics

CS179: GPU Programming

CS179: GPU Programming

Presentation Transcript

IC91AD Programming

Prolog: Programming in Logic

Programming with Java

INTRODUCTION TO FUNCTIONAL PROGRAMMING

Artificial Intelligence programming and Prolog Language Tutorial

General Information

Programming (SE7)

Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

Introduction to Programming

Python Programming

Introducing the ‘c’ Programming language

CPSC 230 Computers and Programming I

Programming (SE7)

52.223 Low Level Programming Lecturer: Duncan Smeed

Graphics Processing Unit (GPU) Architecture and Programming

The C Programming Language

Becoming More Effective with C++ … Day Two

Chapter 13 Systems Programming

Network Programming

Programming in Java

CMPS 135 Introduction to Programming