1 / 35

CS179: GPU Programming

CS179: GPU Programming. Lecture 5: Memory. Today. GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling. Memory Overview. Very slow access: Between host and device Slow access: Global Memory Fast access: Shared memory, constant memory, texture memory, local memory

ophira
Download Presentation

CS179: GPU Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS179: GPU Programming Lecture 5: Memory

  2. Today • GPU Memory Overview • CUDA Memory Syntax • Tips and tricks for memory handling

  3. Memory Overview • Very slow access: • Between host and device • Slow access: • Global Memory • Fast access: • Shared memory, constant memory, texture memory, local memory • Very fast access: • Register memory

  4. Global Memory • Read/write • Shared between blocks and grids • Same across multiple kernel executions • Very slow to access • No caching!

  5. Constant Memory • Read-only in device • Cached in multiprocessor • Fairly quick • Cache can broadcast to all active threads

  6. Texture Memory • Read-only in device • 2D cached -- quick access • Filtering methods available

  7. Shared Memory • Read/write per block • Memory is shared within block • Generally quick • Has bad worst-cases

  8. Local Memory • Read/write per thread • Not too fast (stored independent of chip) • Each thread can only see its own local memory • Indexable (can do arrays)

  9. Register Memory • Read/write per thread function • Extremely fast • Each thread can only see its own register memory • Not indexable (can’t do arrays)

  10. Syntax:Register Memory • Default memory type • Declare as normal -- no special syntax • intvar = 1; • Only accessible by current thread

  11. Syntax:Local Memory • “Global” variables for threads • Can modify across local functions for a thread • Declare with __device__ __local__ keyword • __device__ __local__ intvar = 1; • Can also just use __local__

  12. Syntax: Shared Memory • Shared across threads in block, not across blocks • Cannot use pointers, but can use array syntax for arrays • Declare with __device__ __shared__ keyword • __device__ __shared__ intvar[]; • Can also just use __shared__ • Don’t need to declare size for arrays

  13. Syntax: Global Memory • Created with cudaMalloc • Can pass pointers between host and kernel • Transfer is slow! • Declare with __device__keyword • __device__ intvar = 1;

  14. Syntax: Constant Memory • Declare with __device__ __constant__ keyword • __device__ __constant__ intvar = 1; • Can also just use __constant__ • Set using cudaMemcpyToSymbol(or cudaMemcpy) • cudaMemcpyToSymbol(var, src, count);

  15. Syntax: Texture Memory • To be discussed later…

  16. Memory Issues • Each multiprocessor has set amount of memory • Limits amount of blocks we can have • (# of blocks) x (memory used per block) <= total memory • Either get lots of blocks using little memory, or fewer blocks using lots of memory

  17. Memory Issues • Register memory is limited! • Similar to shared memory in blocks • Can have many threads using fewer registers, or few threads using many registers • Former is better, more parallelism

  18. Memory Issues • Global accesses: slow! • Can be sped up when memory is contiguous • Memory coalescing: making memory contiguous • Coalesced accesses are: • Contiguous accesses • In-order accesses • Aligned accesses

  19. Memory Coalescing:Aligned Accesses • Threads read 4, 8, or 16 bytes at a time from global memory • Accesses must be aligned in memory! • Good: • Bad: • Which is worse, reading 16 bytes from 0xABCD0 or 0xABCDE? 0x00 0x04 0x14 0x00 0x07 0x14

  20. Memory CoalescingAligned Accesses Also bad: beginning unaligned

  21. Memory Coalescing:Aligned Accesses • Built-in types force alignment • float3 (12B) takes up the same space as float4 (16B) • float3 arrays are not aligned! • To align a struct, use __align__(x) // x = 4, 8, 16 • cudaMallocaligns the start of each block automatically • cudaMalloc2D aligns the start of each row for 2D arrays

  22. Memory Coalescing:Contiguous Accesses • Contiguous = memory is together • Example: non-contiguous memory • Thread 3 and 4 swapped accesses!

  23. Memory Coalescing:Contiguous Accesses • Which is better? • index = threadIdx.x + blockDim.x * (blockIdx.x + gridDim.x * blockIdx.y); • index = threadIdx.x+ blockDim.y* (blockIdx.y+ gridDim.y* blockIdx.x);

  24. Memory Coalescing:Contiguous Accesses • Case 1: Contiguous accesses bank[0] thread[0][0] thread[1][0] bank[1] thread[0][1] thread[1][1] bank[2] bank[3]

  25. Memory Coalescing:Contiguous Accesses • Case 1: Contiguous accesses bank[0] thread[0][0] thread[1][0] bank[1] thread[0][1] thread[1][1] bank[2] bank[3]

  26. Memory Coalescing:In-order Accesses • In-order accesses • Do not skip addresses • Access addresses in order in memory • Bad example: • Left: address 140 skipped • Right: lots of skipped addresses

  27. Memory Coalescing • Good example:

  28. Memory Coalescing • Not as much of an issue in new hardware • Many restrictions relaxed -- e.g., do not need to have sequential access • However, memory coalescing and alignment still good practice!

  29. Memory Issues • Shared memory: • Also can be limiting • Broken up into banks • Optimal when entire warp is reading shared memory together • Banks: • Each bank services only one thread at a time • Bank conflict: when two threads try to access same block • Causes slowdowns in program!

  30. Bank Conflicts • Bad: • Many threads trying to access the same bank

  31. Bank Conflicts • Good: • Few to no bank conflicts

  32. Bank Conflicts • Banks service 32-bit words at a time at addresses mod 64 • Bank 0 services 0x00, 0x40, 0x80, etc., bank 1 services 0x04, 0x44, 0x84, etc. • Want to avoid multiple thread access to same bank • Keep data spread out • Split data that is larger than 4 bytes into multiple accesses • Be careful of data elements with even stride

  33. Broadcasting • Fast distribution of data to threads • Happens when entire warp tries to access same address • Memory will get broadcasted to all threads in one read

  34. Summary • Best memory management: • Balances memory optimization with parallelism • Break problem up into coalesced chunks • Process data in shared memory, then copy back to global • Remember to avoid bank conflicts!

  35. Next Time • Texture memory • CUDA Applications in graphics

More Related