500 likes | 741 Views
CUDA Crash-course. Alastair Nottingham Security and Networks Research Group. Lecture Overview. Overview. Introduction To GPGPU. CUDA Architecture. Memory Model. Programming with CUDA. Maximising Performance. Results. Introduction to GPGPU. Introduction. Introduction to GPGPU.
E N D
CUDA Crash-course Alastair Nottingham Security and Networks Research Group
Lecture Overview Overview Introduction To GPGPU CUDA Architecture Memory Model Programming with CUDA Maximising Performance Results
Introduction Introduction to GPGPU • What is a GPU? • A GPU is essentially a massively parallel coprocessor, primarily utilised for real-time rendering of complex graphical scenes. • What is CUDA? • Compute Unified Device Architecture. • A highly parallel architecture for General Processing on a GPU (GPGPU). • Benefits: • Commodity hardware. • Negligible CPU overhead. • Astoundingly fast, when done right. • What should I use it for? • Highly parallelisable processing of very large datasets. • What shouldn’t I use it for? • Sequential processing tasks (obviously). • Context sensitive data-set processing. • Small data-sets.
Brief History of GPGPU • Pre-1999– Early graphics cards. • 1999 – Geforce256, the first GPU. • 2001 – nfiniteFX and custom programmable shaders. • 2003 – Stanford Brook, the first GPGPU language. • 2006 – DirectX10 and the Unified Shading Model. • 2007 – NVIDIA CUDA and AMD (Fire)Stream released. • 2009 – OpenCL& DirectCompute. • 2011 – AMD Accelerated Parallel Processing (APP) SDK. Introduction to GPGPU
CPU vs. GPU (Performance) Introduction to GPGPU GT200 G80 Ultra G92 G80 G71 G70 3.2 GHz Harpertown 3.0 GHz Core2 Duo NV40 NV35 NV30 Jan 2003 Jun 2003 Apr 2004 Jun 2005 Mar 2006 Nov 2006 May 2007 Jun 2008 G80 Ultra G80 G71 NV40 NV30 Woodcrest Harpertown Prescott EE Northwood
Currently…. Introduction to GPGPU
CPU vs. GPU (Architecture) Introduction to GPGPU Control ALU ALU ALU ALU CACHE The GPU devotes more transistors to data processing CPU GPU
For simplicity, discussion will focus on GTX 280 architecture.
GTX 280 Hardware Model (abstract) CUDA Architecture 1 GB Global Memory
CUDA Programming Model (Abstract) CUDA Architecture CUDA DEVICE Block 1 Block N Shared Memory Shared Memory ... ... ... Thread 1 Thread M Thread 1 Thread M Local Memory 1 Local Memory M Local Memory 1 Local Memory M Device Memory Global Memory Constant Memory A CUDA Device executes a Grid containing N Blocks. Each Block has a Shared Memory store. And executes M threads. Each Thread has a local memory store (registers). Device memory is accessible to all threads in all blocks. Device memory is divided into Global and Constant Memory.
How Thread Blocks Map to Multiprocessors CUDA Architecture GTX 280 Multiprocessor (1 of 30 on GTX 280) Maximum 1024 Threads FULL OCCUPANCY PARTIAL OCCUPANCY • Occupancy is also affected by: • Per Block Shared Memory Requirements • Per Block Register Requirements Thread Block 256 Threads Thread Block 192 Threads Thread Block 512 Threads (Maximum Block Size) Thread Block 192 Threads Thread Block 256 Threads Thread Block 192 Threads Thread Block 256 Threads Thread Block 512 Threads (Maximum Block Size) Thread Block 192 Threads Thread Block 192 Threads Thread Block 256 Threads Unused Capacity 64 Threads
Thread Warps CUDA Architecture CUDA enabled GPUs implement a SIMT (Single Instruction, Multiple Thread) architecture, with zero thread scheduling overhead. While essentially SPMD (Single Program, Multiple Data), threads are executed in SIMD batches called Warps. Each Warp contains 32 threads. All threads in a warp execute the same instructions in unison. • Data-dependent conditional branching within a warp is handled by executing both if and else branches sequentially. • Threads which do not match the condition sleep. • If no threads match a condition, the associated branch is not executed at all. • Warp size is set in hardware, and is unlikely to change in successive GPU generations. Thread divergence in different warps Thread divergence in same warp if Thread 0 Thread 0 else Thread 32 Thread 1
Why Warps? CUDA Architecture GTX 280 Multiprocessor Register File Shared Memory Processing Cores Instruction Register 1 2 3 4 5 6 7 8 Texture Cache • Each processing core can issue an instruction to 4 threads in the time it takes the instruction register to update. • This corresponds to threads sharing a single instruction. • The instruction register on Fermi cards is twice as fast. • GTX 400 series MP has 32 cores and 2 IRs. • GTX 500 series MP has 48 cores and 3 IRs. • Warps are differentiated by block-level thread index: • Threads 00-31 are in the first warp. • Threads 32-63 are in the second warp. • Threads 64-95 and in the third warp, etc.
Thread Synchronisation CUDA Architecture Thread Synchronisation provides a mechanism for merging divergent thread paths in a block, thus reducing divergence. if (threadIdx.x > 7) //if thread index is greater than 7 { ...Important Stuff... //warp divergence } __syncthreads(); //no more warp divergence All threads must wait at __syncthreads() until all other threads in the block reach the directive. Only then may threads recommence. Must be used with caution – if some threads cant reach the directive, the kernel will never complete. if (threadIdx.x > 7) //if thread index is greater than 7 { ...Important Stuff... __syncthreads(); //FAIL }
Global Memory Memory Model Global Memory is the largest memory region on the device, with 1024MB of GDDR3 memory. While plentiful, it is at least two orders of magnitude slower than on-chip memory. Access Speed: 200 – 1000+ clock cycles Exceptions: Global memory performance can be improved through Coalescing, or by leveraging the texture cache.
Coalescing Global Memory Memory Model
Texture Memory Memory Model To avoid the headaches of coalescing, you can instead abuse the 64KB texture cache available on each multiprocessor, by binding regions of Global Memory to a Texture Reference. Memory accessed using a Texture Reference reads roughly as fast as fully coalesced Global Memory. • Texture References Limitations: • READ ONLY. • Memory must be bound to texture references in host-side code, prior to kernel execution. • Only supports 32-bit and 64-bit ints, floats, doubles, and vectors of these types. • On GTX 280, texture memory is mysteriously limited to 512 MB...
Registers Memory Model Each Multiprocessor contains 16 384 registers, which reside in an on-chip register file. • These registers are divided equally between each of the blocks running on the multiprocessor. • Blocks of 512 threads get 8192 registers each. • Blocks of 256 threads get 4096 registers each. • Access Speed: 0 clock cycles • Exceptions: • Read-After-Write Dependencies: • 24 clock cycles if thread count is less than 192. • 0 otherwise. • Register Bank Conflicts: • Register allocation is handled internally, so cannot be explicitly avoided. • Can be minimised by ensuring a multiple of 64 threads per block. • Register Pressure: • Occurs when registers are over allocated, and are pushed into global memory. • Extremely expensive (200+ clock cycles)
Shared Memory Memory Model • 16KB of Shared memory resides on each multiprocessor, acting as an explicit cache. This memory is shared between all executing blocks. • Blocks of 512 threads get 8KB each. • Blocks of 256 threads get 4KB each. • Values stored in shared memory are accessible by any other thread in the executing block. This allows threads in the same block to communicate with another, while also providing fast temporary storage. • Access Speed: 1 clock cycle • Exceptions: • Memory Bank Conflicts: • If N threads in a half-warp try to access the same memory bank, their requests must be serialised into N separate requests. 1 2 3 4 5 6 7 8 Memory Banks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Shared Memory Indexes
Constant Memory Memory Model CUDA devices have 64KB of constant memory, with 16KB of cache. This region can be used to store program instructions, pointers to global memory, and kernel arguments (which would otherwise consume shared memory). Access Speed on Cache Hit: Fast as a Register (0 – 24 clock cycles) Access Speed on Cache Miss: Slow as Global Memory (200 – 1000+ clock cycles). Exceptions: On a cache hit, constant memory is as fast as a register ONLY if all threads within the half-warp access the same cached value. If threads in the half-warp request different cached values, then the requests are serialised.
Using the CUDA Runtime API Programming with CUDA • Typical Program Structure (Host Code) • Includes: #include “cuda.h”, #include “cuda_runtime_api.h” • Allocate Memory for host and device • cudaMalloc, cudaMallocHost, cudaMallocPitch, etc. • Fill host memory • Copy host memory to device memory • cudaMemcpy, cudaMemcpyHostToDevice • Execute Kernel • myKernel<<<BLOCK_DIM, GRID_DIM [, SHARED_MEM]>>>(args); • Copy device memory to host • cudaMemcpy, cudaMemcpyDeviceToHost • Free allocated memory • cudaFree Fill Memory Execute Kernel Collect Results
Copying Data to and from a Device Programming with CUDA The CUDA Runtime API provides functions for allocating Device Memory, and copying data to and from the device. int * data_h, * data_d; cudaMallocHost((void **) &data_h, 256 * sizeof(int)); //== data_h = (int *) malloc(256 * sizeof(int)); cudaMalloc((void **) &data_d, 256 * sizeof(int)); for (int k = 0; k < 256; ++k) data_h[k] = k; //fill host array // (dest, source, size, type) cudaMemcpy(data_d, data_h, 256 * sizeof(int), cudaMemcpyHostToDevice); /* - run the kernel (asynchronous) - do any host-side processing */ cudaThreadSynchronize(); //ensure all threads have completed cudaMemcpy(data_h, data_d, 256 * sizeof(int), cudaMemcpyDeviceToHost); cudaFreeHost(data_h); //same as free(data_h) cudaFree(data_d); Declare, Allocate and Fill Arrays Copy Data to and from Device Memory Free Memory
Executing a Kernel Programming with CUDA Only CUDA Kernels may be called from the Host. Kernels need to be informed of Grid and Block dimensions (and optionally Shared Memory size) when called. These are passed to the kernel at runtime. This allows kernels to be optimised for a variety of datasets. dim3BlockDim(256); //y and z components set to 1 dim3GridDim(data_size / BlockDim.x); //assume data_size is multiple of 256 intShared = BlockDim.x * 4 * sizeof(int); //each thread uses 4 ints shared storage arbitraryKernel<<<GridDim, BlockDim, Shared>>>(someData); Note that 1D grid and block dimensions may be integers, but dim3 is required for higher dimensions.
Kernel Orienteering Programming with CUDA gridDim.x NOTE: Block execution order is not guaranteed. Ensure block independence. block block block blockDim.x Shared Memory Shared Memory Shared Memory thread thread thread thread thread thread thread thread thread thread thread thread 0 1 2 3 0 1 2 3 0 1 2 3 threadIdx.x 0 1 2 blockIdx.x Global Thread ID =blockDim.x *blockIdx.x +threadIdx.x 6 = 4 * 1 + 2 Max Threads / Block: 512 Max Shared Memory / Block: 16 384 bytes Max Blocks / Grid: 65 536 per dimension (> 281 trillion)
Example 1 – Increasing 1024integers by 1. • __global__ void add_one(int* data_in, int* data_out) • { • //find the index of the thread • intthread = blockDim.x * blockIdx.x + threadIdx.x; • //read in data and increment • inttmp = data_in[thread] + 1; • //copy out data • data_out[thread] = tmp; • } • … • add_one<<< 4, 256 >>>(data_in_host, data_out_host); //on host Programming with CUDA
Declaring Shared Memory Programming with CUDA Shared Memory is typically statically allocated within a kernel, using the __shared__ qualifier. When statically allocated, multiple shared variables and arrays may be declared. __shared__ char temp1[512]; __shared__ inttemp2[SHARED_SIZE]; Shared Memory may also be dynamically allocated using the extern qualifier. All dynamically allocated shared memory starts at the same memory offset, so layout must be explicitly managed in code. extern __shared__ char array[]; short * array0 = (short *) array; float * array1 = (float *) &array0[128]; int* array2 = (int *) &array1[64];
Using Shared Memory Programming with CUDA Once declared, shared memory can be treated as a normal array. Because shared memory is shared by all threads in a block, it is important for threads to orientate themselves, such that they read and write to the correct elements of the array. //each thread deals with 1 piece of data temp1[threadIdx.x] = device_array[blockIdx.x * blockDim.x + threadIdx.x]; //each thread deals with 16 pieces of data for (int k = 0; k < 16; ++k) temp2[ 16 * threadIdx.x + k] = device_array[16 * (blockIdx.x * blockDim.x + threadIdx.x) + k]; if (temp3[threadIdx.x] == 5) /* do something */ When Reading and Writing to Shared Memory operated on by other threads in a block, use __syncthreads() and __threadfence_block() to protect data integrity. Atomic functions are also helpful.
Tools: Occupancy Calculator And Visual Profiler Programming with CUDA • Occupancy Calculator: • Excel Spreadsheet for CUDA algorithm design. • Given Threads, Registers and Shared Memory usage per block, calculates the associated performance implications. • Allows you to maximise GPU utilization within your kernel. • Visual Profiler: • Visualizes Kernels, to help determine bottlenecks, defects and areas of general low performance. • Produces graphs and tables. • Useful for diagnosing poor performance.
Speed Bumps Maximising Performance Memory Access Latency Due to the wide range of explicit criteria for optimal performance, poorly crafted Kernels can suffer significant penalties. Thread Divergence Because warp level thread-divergence is essentially serialised by the instruction register, decisional logic should be eliminated where possible. Host-Device Transfer Overhead PCI-E 16x can transfer at up to 16GB/s, which, in the grand scheme of things, is quite slow. Operator Performance Certain arithmetic operators perform relatively poorly, and should be avoided.
Example: Array Reversal Maximising Performance __global__ void reverse_array(int * array_in, int* array_out) //naive kernel { intcurr_index = blockDim.x * blockIdx.x + threadIdx.x; array_out[curr_index] = array_in[gridDim.x * blockDim.x – 1 – curr_index]; } __global__ void reverse_array_ex(int * array_in, int*array_out) //less naive kernel { __shared__ inttmp[BLOCK_DIM]; tmp[blockDim.x - 1 - threadIdx.x] = array_in[blockDim.x * blockIdx.x + threadIdx.x]; __syncthreads(); array_out[blockDim.x * (gridDim.x - 1 - blockIdx.x) + threadIdx.x] = tmp[threadIdx.x]; } Better Kernel array_in 0 1 2 3 4 5 6 7 8 9 10 11 tmp 3 2 1 0 7 6 5 4 11 10 9 8 array_out 11 10 9 8 7 6 5 4 3 2 1 0
Summing Elements in an Array (Conceptual) Maximising Performance 15 6 32 11 2 27 19 9 0 1 2 3 4 5 6 7 21 0 43 0 29 0 28 0 0 1 2 3 4 5 6 7 64 0 0 0 57 0 0 0 0 1 2 3 4 5 6 7 121 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7
A Better Solution Maximising Performance 15 6 32 11 2 27 19 9 0 1 2 3 4 5 6 7 21 43 29 28 0 0 0 0 0 1 2 3 4 5 6 7 64 57 0 0 0 0 0 0 Minimises Divergence in Thread Warps 0 1 2 3 4 5 6 7 121 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7
Performance difference • With 256 threads in a block, both solutions take 9 iterations to sum 512 elements. Total active warps differ significantly, however.
Nested Loop Unroll-and-Jam Parallel Classification 1. Initial Nested Loop 2. Partially Unrolled Outer Loop for (int j = 0; j < M; j++) { for (int k = 0; k < N; k++) { foo(j, k); } } for (int j = 0; j < M; j += 4) { for (int k = 0; k < N; k++) { foo(j, k); } for (int k = 0; k < N; k++) { foo(j + 1, k); } for (int k = 0; k < N; k++) { foo(j + 2, k); } for (int k = 0; k < N; k++) { foo(j + 3, k); } } 3. Unroll-and-Jam for (int j = 0; j < M; j += 4) { for (int k = 0; k < N; k++) { foo(j, k); foo(j + 1, k); foo(j + 2, k); foo(j + 3, k); } }
Optimising to Host-To-Device Transfer Maximising Performance • CUDA supports two basic memory types: • Pageable Memory (8 GBps) • Pagelocked Memory (16 GBps) • Pagelocked memory transfers faster than pageable memory, and supports several optimisations. It is however a scarce resource, and thus overuse can degrade system performance. • Write-Combined Memory – • Transfers faster over PCI-E, and frees up L1 and L2 cache resources for the rest of the program to use. It should not be read from by the host. • Mapped Memory – • Map a region of host memory onto the GPU. Eliminates all transfer overhead in integrated devices, but is very slow on discreet devices (only 4GBps) . • Asynchronous Streams – • Typically, data transfer and kernel execution happen synchronously (separately). Through streams, data transfer and kernel execution can occur asynchronously (concurrently), dramatically improving the overall speed of the program. Buffers are advised.
Integer Operator Performance Maximising Performance • Most arithmetic integer operators perform well in CUDA (between 2 and 8 operations per clock cycle). However, integer division and modulo operations are extremely expensive, taking up to 10x longer to compute. • This expense can be avoided if the dividend is a multiple of 2 using bitwise operations. These are particularly fast for 32-bit integers. • a / b == a >> log2(b) • a % b == a & (b – 1) • The performance of the operators and functions available to kernels is provided in the CUDA Programming Guide.
Our Results (1): Comparative Packet Classificatin Speeds Results
Our Results (2): Classification Throughput vs. Packet Count Results