1 / 19

ECE 569 High Performance Processors and Systems

ECE 569 High Performance Processors and Systems. Administrative HW2 available, due next Thursday 2/13 @ start of class GPUs Thrust library Execution model. Control. ALU. ALU. ALU. ALU. DRAM. Cache. DRAM. CPU vs. GPU. F undamentally different designs:. CPU. GPU. simpler,

luigi
Download Presentation

ECE 569 High Performance Processors and Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 569 High Performance Processors and Systems • Administrative • HW2 available, due next Thursday 2/13 @ start of class • GPUs • Thrust library • Execution model ECE 569 -- 06 Feb 2014

  2. Control ALU ALU ALU ALU DRAM Cache DRAM CPU vs. GPU • Fundamentally different designs: CPU GPU simpler, slower, massively-more cores ECE 569 -- 06 Feb 2014

  3. GPGPU languages • CUDA • OpenCL • Microsoft DirectCompute • … ECE 569 -- 06 Feb 2014

  4. Higher-level Abstractions • Microsoft AMP • Thrust • … ECE 569 -- 06 Feb 2014

  5. Thrust • STL-like approach to GPU programming • A template library for CUDA • Installed as part of CUDA 4.0 and newer • 80-20 rule: make 80% of CUDA programming easier… intmain() { int N = 100; thrust::host_vector<int> h_vec(N); // generate data on host:thrust::generate(h_vec.begin(), h_vec.end(), rand); thrust::device_vector<int> d_vec = h_vec; // copy to device: thrust::sort(d_vec.begin(), d_vec.end()); // sort: thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); // copy back: } ECE 569 -- 06 Feb 2014

  6. Demo • Sorting in Thrust…

  7. Demo • Summing a vector in Thrust… #include <thrust/host_vector.h> #include <thrust/device_vector.h> #include <thrust/reduce.h> #include <thrust/functional.h> #include <stdio.h> intmain() { intN = 100; thrust::host_vector<int> h_vec(N); for(inti = 0; i < N; i++) // fill with 1, 2, 3, ..., N: h_vec[i] = (i+1); thrust::device_vector<int> d_vec = h_vec; // copy to device: intsum = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>()); printf("** Sum: %d\n\n", sum); return 0; }

  8. High-level GPU Architecture Global Memory Host Memory ECE 569 -- 06 Feb 2014

  9. Streaming Multiprocessor (SM) Streaming Multiprocessor • Streaming Multiprocessor (SM) • 8 Streaming Processors (SP) • 2 Super Function Units (SFU) • Multi-threaded instruction dispatch • 1 to 512 threads active • Shared instruction fetch per 32 threads • Cover latency of texture/memory loads • 20+ GFLOPS • 16 KB shared memory • DRAM texture and memory access Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP Global Memory

  10. Device Streaming Multiprocessor N Streaming Multiprocessor 2 Streaming Multiprocessor 1 Shared Memory Registers Registers Registers Instruction Unit … Processor 1 Processor 2 Processor M Constant Cache Texture Cache Device memory Memory Architecture • The local, global, constant, and texture spaces are regions of device memory (DRAM) • Each multiprocessor has: • A set of 32-bit registers per processor • On-chip shared memory • A read-only constant cache • A read-only texture cache • Data cache (Fermi only) Data Cache, Fermi only Global, constant, texture memories

  11. Terminology device = GPU = set of multiprocessors Streaming Multiprocessor= set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory Warp = a subset of a thread block (typically 32) that forms the basic unit of scheduling.

  12. NVIDIA GPU Execution Model I. SIMD Execution of a warp II. Multithreaded Execution across different warps / blocks III. Each thread block mapped to single SM Global Memory

  13. SIMT = Single-Instruction Multiple Threads • Coined by Nvidia • Combines SIMD execution within a warp with SPMD execution across warps

  14. CUDA Thread Block Overview • All threads in a block execute the same kernel program (SPMD) • Programmer declares block: • Block size 1 to 512 concurrent threads • Block shape 1D, 2D, or 3D • Block dimensions in threads • Threads have thread id numbers within block • Thread program uses thread id to select work and address shared data • Threads in the same block share data and synchronize while doing their share of the work • Threads in different blocks cannot cooperate • Each block can execute in any order relative to other blocks! CUDA Thread Block Thread Id #:0 1 2 3 … m Thread program Courtesy: John Nickolls, NVIDIA

  15. Launching a Kernel Function • A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); dim3DimGrid(100, 50); // 5000 thread blocks dim3DimBlock(4, 8, 8); // 256 threads per block size_tSharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<<DimGrid, DimBlock, SharedMemBytes>>>(...); Only for data that is not statically allocated

  16. t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31 Example: Thread Scheduling on G80 • Each Block is executed as 32-thread Warps • Warps are scheduling units in SM • If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM? • A total of 768 threads (max) • Each Block consists of 256/32 = 8 Warps • There are 8 * 3 = 24 Warps … Block 1 Warps … Block 2 Warps … Block 3 Warps … … … Streaming Multiprocessor Instruction L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP

  17. warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 warp 3 instruction 96 SM Warp Scheduling • SM hardware implements zero-overhead Warp scheduling • Warps whose next instruction has its operands ready for consumption are eligible for execution • Eligible Warps are selected for execution on a prioritized scheduling policy • 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 • If one global memory access is needed for every 4 instructions… • A minimum of 13 Warps are needed to fully tolerate 200-cycle memory latency SM multithreaded Warp scheduler time ...

  18. How is context switching so efficient? Block 0 Thread 0 Register File Block 0 Thread 1 Block 0 Thread 256 • Large register file (16K registers/block) • Each thread assigned a “window” of physical registers • Works if entire thread block’s registers do not exceed capacity (otherwise, compiler fails) • Similarly, shared memory requirements must not exceed capacity for all blocks simultaneously scheduled Block 8 Thread 1 Block 8 Thread 256 Block 8 Thread 0

  19. MT IU MT IU SP SP Shared Memory Shared Memory t0 t1 t2 … tm t0 t1 t2 … tm SM 0 SM 1 Blocks Blocks • Threads run concurrently • SM maintains thread/block id #s • SM manages/schedules thread execution

More Related