1 / 45

General Purpose Graphics Processing Units (GPGPUs)

General Purpose Graphics Processing Units (GPGPUs). Lecture notes from MKP, J. Wang, and S. Yalamanchili. What is a GPGPU?. Graphics Processing Unit (GPU): (NVIDIA/AMD/Intel) Many-core Architecture Massively Data-Parallel Processor (Compared with a CPU) Highly Multi-threaded GPGPU:

melia
Download Presentation

General Purpose Graphics Processing Units (GPGPUs)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. General Purpose Graphics Processing Units (GPGPUs) Lecture notes from MKP, J. Wang, and S. Yalamanchili

  2. What is a GPGPU? • Graphics Processing Unit (GPU): (NVIDIA/AMD/Intel) • Many-core Architecture • Massively Data-Parallel Processor (Compared with a CPU) • Highly Multi-threaded • GPGPU: • General-Purpose GPU, High Performance Computing • Become popular with CUDA and OpenCL programming languages

  3. Motivation • High Throughput and Memory Bandwidth

  4. Discrete GPUs in the System

  5. Fused GPUs: AMD & Intel Not as powerful as the discrete GPUs On-Chip and sharing the cache

  6. Core Count: NVIDIA • All cores are not created equal • Need to understand the programming model 1536 cores at 1GHz

  7. GPU Architectures (NVIDIA Tesla) Streaming multiprocessor 8 × Streamingprocessors

  8. NVIDIA GK110 Architectures

  9. CUDA Programming Model • NVIDIA • Compute Unified Device Architecture (CUDA) • Kernel: C-like function executed on GPU • SIMD or SIMT • Single Instruction Multiple Data/thread (SIMD, SIMT) • All threads execute the same instruction • But on its own data • Lock Step Thread 0 1 2 3 4 5 6 7 Inst 0 Data Inst 1 Data

  10. CUDA Thread Hierarchy • Each thread uses IDs to decide what data to work on • 3-dimension • Hierarchy: Thread, Block, Grid Block Thread 0 0,1,0 0,0 0,2,0 2,0 0,3,0 1,0 3,0 0,0,0 0,2,1 1 0,1 1,1 2,1 0,3,1 3,1 0,1,1 0,0,1 2 0,2,2 0,1,2 0,0,2 2,2 3,2 1,2 0,3,2 0,2 3 0,3 1,3 2,3 3,3 0,0,3 0,1,3 0,2,3 0,3,3 Kernel 0 Kernel 1 Kernel 2 1,0,0 1,0,1 1,0,2 1,0,3 Grid Grid Grid Block (0,0,0) Block (0,0,1) Block (0,0,0) Block (0,0,1) Block (0,0,0) Block (0,0,1) Block (0,1,0) Block (0,1,1) Block (0,1,0) Block (0,1,1) Block (0,1,0) Block (0,1,1)

  11. Vector Addition • Let’s assume N=16, blockDim=4  4 blocks for (int index = 0; index < N; ++index) { c[index] = a[index] + b[index]; } + + + + + blockIdx.x = 0 blockDim.x = 4 threadIdx.x = 0,1,2,3 Idx= 0,1,2,3 blockIdx.x = 1 blockDim.x = 4 threadIdx.x = 0,1,2,3 Idx= 4,5,6,7 blockIdx.x = 2 blockDim.x = 4 threadIdx.x = 0,1,2,3 Idx= 8,9,10,11 blockIdx.x = 3 blockDim.x = 4 threadIdx.x = 0,1,2,3 Idx= 12,13,14,15

  12. Vector Addition Kernel CPU Program GPU Program void vector_add ( float *a, float* b, float *c, int N) { for (int index = 0; index < N; ++index) c[index] = a[index] + b[index]; } } int main () { vector_add(a, b, c, N); } __global__ vector_add ( float *a, float *b, float *c, int N) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index < N) c[index] = a[index]+b[index]; } int main() { dim3 dimBlock( blocksize, blocksize) ; dim3 dimGrid (N/dimBlock.x, N/dimBlock.y); add_matrix<<<dimGrid, dimBlock>>>( a, b, c, N); }

  13. GPU Architecture Basics …… PC SM SM SM The SI in SIMT I-Cache Fetch Memory Controller Decoder Memory Core Core Core Core CUDA Core EX FP Unit INT Unit In-order Core MEM WB

  14. Execution of a CUDA Program • Blocks are scheduled and executed independently on SMs • All blocks share memory

  15. Executing a Block of Threads • Execution Unit: Warp • a group of threads (32 for NVIDIA GPUs) • Blocks are partitioned into warps with consecutive thread ID. SM Warp 0 Warp 1 Warp 2 Warp 3 Warp 0 Warp 1 Warp 2 Warp 3 Block 0 128 Threads Block 1 128 Threads

  16. Warp Execution • A warp executes one common instruction at a time • Threads in a warp are mapped to CUDA cores • Warps are switched and executed on SM Warp Execution Inst 1 Inst 2 T T T T T T T T T T T T Inst 3 One warp One warp One warp PC SM Core Core Core Core

  17. Handling Branches • CUDA Code: if(…) … (True for some threads) else … (True for others) • What if threads takes different branches? • Branch Divergence! taken not taken T T T T

  18. Branch Divergence • Occurs within a warp • All branch conditions are serialized and will be executed • Performance issue: low warp utilization • if(…) • {… } • else { • …} Idle threads

  19. Vector Addition • N = 60 • 64 Threads, 1 block • Q: Is there any branch divergence? In which warp? __global__ vector_add ( float *a, float *b, float *c, int N) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index < N) c[index] = a[index]+b[index]; }

  20. Example: VectorAddon GPU CUDA: __global__ vector_add ( float *a, float *b, float *c, int N) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index < N) c[index] = a[index]+b[index]; } PTX (Assembly): setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; //r6 = &a[index] ld.global.f32 %f2, [%r7]; //r7 = &b[index] add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; //r8 = &c[index] L2: ret;

  21. Example: VectorAdd on GPU • N=8, 8 Threads, 1 block, warp size = 4 • 1 SM, 4 Cores • Pipeline: • Fetch: • One instruction from each warp • Round-robin through all warps • Execution: • In-order execution within warps • With proper data forwarding • 1 Cycle each stage • How many warps?

  22. Execution Sequence setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  23. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; setp W0 FE DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  24. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; setp W1 FE setp W0 DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  25. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; bra W0 FE setp W1 DE setp W0 setp W0 setp W0 setp W0 EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  26. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; @p bra W1 FE @p bra W0 DE setp W1 setp W1 setp W1 setp W1 EXE EXE EXE EXE setp W0 setp W0 setp W0 setp W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  27. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; bra L2 FE @p bra W1 DE bra W0 bra W0 bra W0 bra W0 EXE EXE EXE EXE setp W0 setp W1 setp W0 setp W1 setp W0 setp W1 setp W1 setp W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  28. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; bra L2 FE DE bra W1 bra W1 bra W1 bra W1 EXE EXE EXE EXE setp W1 bra W0 bra W0 setp W1 bra W0 setp W1 setp W1 bra W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  29. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W0 FE DE EXE EXE EXE EXE bra W1 bra W0 bra W0 bra W1 bra W1 bra W0 bra W1 bra W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  30. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W1 FE ld W0 DE EXE EXE EXE EXE bra W1 bra W1 bra W1 bra W1 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  31. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W0 FE ld W1 DE ld W0 ld W0 ld W0 ld W0 EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  32. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ld W1 FE ld W0 DE ld W1 ld W1 ld W1 ld W1 EXE EXE EXE EXE ld W0 ld W0 ld W0 ld W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  33. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; add W0 FE ld W1 DE ld W1 ld W1 ld W1 ld W1 EXE EXE EXE EXE ld W0 ld W0 ld W0 ld W0 ld W0 ld W0 ld W0 ld W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  34. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; add W1 FE add W0 DE ld W1 ld W1 ld W1 ld W1 ld W1 ld W1 ld W1 ld W1 EXE EXE EXE EXE ld W0 ld W0 ld W0 ld W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  35. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; st W0 FE add W1 DE ld W1 ld W1 ld W1 ld W1 EXE EXE EXE EXE ld W0 add W0 ld W0 add W0 add W0 ld W0 add W0 ld W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  36. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; st W1 FE st W0 DE add W1 ld W1 ld W1 add W1 ld W1 add W1 ld W1 add W1 EXE EXE EXE EXE add W0 add W0 add W0 add W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  37. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ret FE st W1 DE add W1 add W1 add W1 add W1 EXE EXE EXE EXE add W0 st W0 add W0 st W0 st W0 add W0 st W0 add W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  38. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; ret FE ret DE add W1 st W1 add W1 st W1 add W1 st W1 st W1 add W1 EXE EXE EXE EXE st W0 st W0 st W0 st W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  39. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE ret DE st W1 st W1 st W1 st W1 EXE EXE EXE EXE ret st W0 ret st W0 ret st W0 ret st W0 MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  40. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE ret st W1 ret st W1 ret st W1 st W1 ret EXE EXE EXE EXE ret ret ret ret MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  41. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE ret ret ret ret EXE EXE EXE EXE ret ret ret ret MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  42. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE ret ret ret ret EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  43. Execution Sequence (cont.) setp.lt.s32 %p, %r5, %rd4; @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; L2: ret; FE DE EXE EXE EXE EXE MEM MEM MEM MEM WB WB WB WB Warp0 Warp1

  44. Study Guide • Be able to define the terms thread block, warp, and SIMT with examples • Understand the Vector Addition Example in enough detail to • Know what operations are in each core at any cycle • Given a number of pipeline stages in each core know how many warps are required to fill the pipelines? • How many instructions are executed in total? • Key differences between fused and discrete GPUs

  45. Glossary • CUDA • Branch divergence • Kernel • OpenCL • Stream Multiprocessor • Thread block • Warp

More Related