1.25k likes | 1.52k Views
GPU Computing. Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn. Overview. What is GPU?. Graphics Processing Unit First GPU: GeForce 256 (1999) Connected to motherboard via PCI Express High computational density and memory bandwidth Massively multithreaded many-core chips
E N D
GPU Computing Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn
What is GPU? • Graphics Processing Unit • First GPU: GeForce 256 (1999) • Connected to motherboard via PCI Express • High computational density and memory bandwidth • Massively multithreaded many-core chips • Traditionally used for real-time rendering • Several millions units are sold each year.
GPU Pipeline Rasterization
Anti-Aliasing Triangle Geometry Aliased Anti-Aliased
GPGPU • General-Purpose Computing on GPUs • Massively Parallel, Simple Operations • Suitable for compute-intensive engineering problems • The original problem needs to be cast into native graphics operations. • Launched through OpenGL or DirectX API calls • Input data are stored in texture images and issued to the GPU by submitting triangles. • Highly restricted access to input/output • Very tedious, limited success with painstaking efforts
Control ALU ALU ALU ALU DRAM Cache DRAM CPU vs. GPU CPU GPU Multi-Core Many-Core Number of ALUs Memory Bandwidth
Power of the Crowd • SM • Streaming Multiprocessor • Multi-threaded processor core • Processing unit for thread block • SPs (Streaming Processor) • SFUs (Special Function Unit) • SP • Streaming Processor • Scalar ALU for a single CUDA thread • SIMT • Single-Instruction, Multiple-Thread • Shared instruction fetch per 32 threads (warp) Streaming Multiprocessor Instruction L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP
Green Computing GFLOPS per Watt GTX 750 Ti GTX 680 Intel Core i7-980XE GTX 580
Supercomputing • TITAN, Oak Ridge National Laboratory • Speed: 24.8 PFLOPS (Theory), 17.6 PFLOPS (Real) • CPU: AMD Opteron 6274 (18,688 × 16 cores) • GPU: NVIDIA Tesla K20 (18,688 × 2496 cores) • Cost: US$ 97 Million • Power: 9 MW
What is CUDA? • Compute Unified Device Architecture • Introduced by NVIDIA in 2007 • Scalable Parallel Programming Model • Small extensions to standard C/C++ • Enable general-purpose GPU computing • Straightforward APIs to manage devices, memory etc. • Only supports NVIDIA GPUs. http://developer.nvidia.com/category/zone/cuda-zone
Texture Texture Texture Texture Texture Texture Texture Texture Texture Host Input Assembler Thread Execution Manager Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Load/store Load/store Load/store Load/store Load/store Load/store Global Memory CUDA-Enabled GPU
Kepler Architecture • GeForce GTX 680 (Mar. 22, 2012) • GK104, 28 nm process • 3.5 billion transistors on a 294 mm2 die • CUDA Cores: 1536 (8 SMs X 192 SPs) • Memory Bandwidth: 192 GB/S • Peak Performance: 3090 GFLOPS • TDP: 195W • Release Price: $499
Maxwell Architecture • GeForce GTX 750 Ti (Feb. 18, 2014) • GM107, 28 nm process • 1.87 billion transistors on a 148 mm2 die • CUDA Cores: 640 (5 SMs X 128 Cores) • Memory Bandwidth: 86.4 GB/S • Peak Performance: 1306 GFLOPS • TDP: 60W • Release Price: $149
CUDA Teaching Lab • GTX 750 (GM107) • Compute Capability: 5.0 • 512 CUDA Cores • 1GB, 128-bit GDDR5 • 80 GB/S • 1044 GFLOPS • TDP: 55W • RMB 799 • GT 630 (GK208) • Compute Capability: 3.5 • 384 CUDA Cores • 2GB, 64-bit GDDR3 • 14.4 GB/S • 692.7 GFLOPS • TDP: 25W • RMB 419
CUDA Installation https://developer.nvidia.com/cuda-downloads
Heterogeneous Computing Host Device
Host Device Kernel 1 Kernel 2 Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 0) Thread (0, 1) Thread (0, 2) Thread (1, 1) Thread (1, 0) Thread (1, 2) Thread (2, 1) Thread (2, 0) Thread (2, 2) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 0) Thread (4, 1) Thread (4, 2) Grids, Blocks and Threads
Thread Block • Threads have thread ID numbers within block. • Threads use thread ID to select work. • Threads are assigned to SMs in block granularity. • Each GT200 SM can have maximum 8 blocks. • Each GT200 SM can have maximum 1024 threads. • Threads in the same block can share data and synchronize. • Threads in different blocks cannot cooperate. • Each block can execute in any order relative to other blocks. Thread Id #:0 1 2 3 … m Thread program
Kernel grid Device Block 2 Block 6 Block 0 Block 4 Block 5 Block 7 Block 3 Block 1 Device Block 5 Block 0 Block 1 Block 2 Block 3 Block 4 Block 6 Block 3 Block 7 Block 0 Block 5 Block 4 Block 6 Block 2 Block 1 Block 7 Transparent Scalability
Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory Constant Memory Memory Space • Each thread can: • Read/write per-thread registers • Read/write per-block shared memory • Read/write per-grid global memory • Read/only per-gridconstant memory GeForce GTX 680 Memory Bandwidth … 192 GB/S Single-Precision Floating Point … 4B Peak Performance … 3090 GFLOPS Practical Performance … 48 GFLOPS
Hello World! int main(void) { printf(“Hello World!\n”); return 0; } __global__ void mykernel(void) { } int main(void) { mykernel<<<1,1>>>(); printf(“Hello World!\n”); return 0; } Your first CUDA code!
Device Code • CUDA keyword __global__indicates a kernel function that: • Runs on the device. • Called from the host. • CUDA keyword __device__indicates a device function that: • Runs on the device. • Called from a kernel function or another device function. • Triple angle brackets <<< >>>indicate a call from host code to device code. • Kernel launch • nvcc separates source code into two components: • Device functions are processed by NVIDIA compiler. • Host functions are processed by standard host compiler. • $ nvcc hello.cu
Addition on Device __global__ void add (int *a, int *b, int *c) { *c=*a+*b; } • add () will execute on the device. • add () will be called from the host. • a, b, c must point to device memory. • We need to allocate memory on GPU.
Memory Management • Host and device memories are separate entities. • Device pointers point to GPU memory. • May be passed to/from host code. • May not be dereferenced in host code. • Host pointers point to CPU memory • May be passed to/from device code. • May not be dereferenced in device code. • CUDA APIs for handling device memory • cudaMalloc(), cudaFree(), cudaMemcpy() • C equivalents: malloc(), free(), memcpy()
Addition on Device: main() int main(void) { int a, b, c; // host copies int *d_a, *d_b, *d_c; // device copies int size=sizeof(int); // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); a=2; b=7; // Copy inputs to device cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
Addition on Device: main() // Launch add() kernel on GPU add<<<1,1>>>(d_a,d_b,d_c); // Copy result back to host cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }
Moving to Parallel • Each call to add() adds two integers. • With add() running in parallel, we can do vector addition in parallel. • add<<<nblocks, 1>>>(d_a, d_b, d_c) • Each parallel invocation of add() is referred to as a block. • By using blockIdx.x to index into the array, each block handles a different index. • Block can be 2D: • dim3 nblocks(M, N) • blockIdx.x, blockIdx.y
Vector Addition on Device __global__ void add (int *a, int *b, int *c) { c[blockIdx.x]=a[blockIdx.x]+b[blockIdx.x]; } Block 0 Block 1 c[0]=a[0]+b[0]; c[1]=a[1]+b[1]; Block 2 Block 3 c[2]=a[2]+b[2]; c[3]=a[3]+b[3];
Vector Addition on Device: main() # define N 512 int main(void) { int*a, *b, *c;// host copies int *d_a, *d_b, *d_c; // device copies int size=N*sizeof(int); // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Allocate space of host copies of a, b, c // Set up initial values a=(int *)malloc(size); rand_ints(a, N); b=(int *)malloc(size); rand_ints(b, N); c=(int *)malloc(size); rand_ints(c, N);
Vector Addition on Device: main() // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU with N blocks add<<<N, 1>>(d_a, d_b, d_c); // Copy results back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }
CUDA Threads • Each block can be split into parallel threads. • Threads can be up to 3D: • dim3 nthreads(M, N, P) • threadIdx.x, threadIdx.y, threadIdx.z __global__ void add (int *a, int *b, int *c) { c[threadIdx.x]=a[threadIdx.x]+b[threadIdx.x]; } add<<<1, N>>>(d_a, d_b, d_c);
Combining Blocks and Threads • We have seen parallel vector addition using: • Many blocks with one thread each • One block with many threads • Let’s adapt vector addition to use both blocks and threads. • Why bother?
Indexing M=8; // 8 threads/block int index=threadIdx.x+blockIdx.x*M; int index=threadIdx.x+blockIdx.x*blockDim.x; __global__ void add (int *a, int *b, int *c) { int index=threadIdx.x+blockIdx.x*blockDim.x; c[index]=a[index]+b[index]; }
Indexing #define N (2048*2048) #define M 512 // THREADS_PER_BLOCK … add<<<N/M, M>>>(d_a, d_b, d_c); __global__ void add (int *a, int *b, int *c, int n) { int index=threadIdx.x+blockIdx.x*blockDim.x; if (index<n) c[index]=a[index]+b[index]; } add<<<(N+M-1)/M, M>>>(d_a, d_b, d_c, N);
Data Access Pattern radius radius How many times? input output
Sharing Data Between Threads • Each thread generates one output element. • blockDim.x elements per block • Each input element needs to be read several times. • High I/O cost • Within a block, threads can share data via shared memory. • Data are not visible to threads in other blocks. • Extremely fast on-chip memory • Declared using keyword: __shared__, allocated per block. • Read (blockDim.x+2*radius)input elements from global to shared memory.