580 likes | 617 Views
Learn about CUDA programming for CPU and GPU, memory bandwidth, parallel processing, and floating-point operations. Explore Hillis' Thesis '85 and the evolution of GPU architecture. Understand CUDA thread organization, kernel functions, and thread hierarchy examples.
E N D
Hillis’ Thesis ’85 (back to the future !) Piece of silicon Sequential computer Parallel computer • proposed “The Connection Machine” with massive number of processors each with small memory operating in SIMD mode. • CM-1, CM-2 machines from Thinking Machines Corporation (TMC)were examples of this architecture with 32K-128K processors.
CUDA Supports Various Languages or Application Programming Interfaces
Automatic Scalability A multithreaded program is partitioned into blocks of threads that execute independently from each other, so that a GPU with more cores will automatically execute the program in less time than a GPU with fewer cores.
NVIDIA GPUs have a number of multiprocessors, each of which executes in • parallel with the others. • On Tesla, each multiprocessor has a group of 8 stream processors; • a Fermi multiprocessor has two groups of 16 stream processors. • A core refer to a stream processor. The high end Tesla accelerators have 30 • multiprocessors, for a total of 240 cores; • A high end Fermi has 16 multiprocessors, for 512 cores. • Each core can execute a sequential thread, but the cores execute in what NVIDIA • calls SIMT (Single Instruction, Multiple Thread) fashion; all cores in the same • group execute the same instruction at the same time, much like classical SIMD • processors. • SIMT handles conditionals somewhat differently than SIMD, though the effect is • much the same, where some cores are disabled for conditional operations.
Compute Capability • Compute capability : of a device is defined by a major revision number • and a minor revision number. • Devices with the same major revision number are of the same core • architecture. • The major revision number of devices based on the Fermi architecture is 2. • Prior devices are all of compute capability 1.x (Their major revision • number is 1). • The minor revision number corresponds to an incremental improvement • to the core architecture, possibly including new features.
CUDA-Enabled Devices with Compute Capability, Number of Multiprocessors, and Number of CUDA Cores
CUDA-Enabled Devices with Compute Capability, Number of Multiprocessors, and Number of CUDA Cores
Features and Technical Specifications
Programming Model • Heterogeneous Programming • Serial code executes on the host while • parallel code executes on the device.
CUDA Thread Organization • A thread block can have up to 512 threads
Kernels • CUDA C extends C by allowing the programmer to define C functions, called kernels, • that, when called, are executed N times in parallel by N different CUDA threads, as • opposed to only once like regular C functions. • A kernel is defined using the __global__ declaration specifier and the number of • CUDA threads that execute that kernel for a given kernel call is specified using a new • <<<…>>>execution configuration syntax. • Each thread that executes the kernel is given a unique thread ID that is accessible • within the kernel through the built-in threadIdx variable.
Vector Addition Example // Kernel definition __global__ void VecAdd(float* A, float* B,float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { ... ... // Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C); }
Thread Hierarchy • threadIdx: is a 3-component vector, so that threads can be identified using a one- • dimensional, two-dimensional, or three-dimensional thread index, forming a one- • dimensional, two-dimensional, or three-dimensional thread block. • for a one-dimensional block of size Dx, the thread ID of a thread of index (x) is • x • for a two-dimensional block of size (Dx, Dy), the thread ID of a thread of index (x, y) is • (x + y * Dx) • for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index • (x, y, z) is • (x + y* Dx + z *Dx *Dy)
Matrix Addition Example Using 1 Block // Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } int main() { ... // Kernel invocation with one block of N * N * 1 threads int numBlocks = 1; dim3 threadsPerBlock(N, N); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); }
Matrix Addition Example Using Multiple Blocks // Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() { ... // Kernel invocation dim3 threadsPerBlock(16, 16); dim3 numBlocks(N/threadsPerBlock.x, N/threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); }
CUDA API Functions for Data Transfer Between Memories.
Matrix-Matrix Multiplication Example void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); // 1. Load M and N to device memory cudaMalloc(Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice) ; cudaMalloc(Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(Pd, size); // 2. Kernel invocation code – to be shown later … // 3. Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd); }
Kernel Function // Matrix multiplication kernel – thread specification __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // Pvalue stores the Pd element that is computed by the thread float Pvalue = 0; for (int k = 0; k < Width; ++k) { float Mdelement = Md[ty * Width + k]; float Ndelement = Nd[k * Width + tx]; Pvalue += Mdelement * Ndelement; } // Write the matrix to device memory each thread writes one element Pd[ty * Width + tx] = Pvalue; }
Host Code that Launches a Kernel // Setup the execution configuration dim3 dimBlock(WIDTH, WIDTH); dim3 dimGrid(1, 1); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, WIDTH);
Block / Grid Definition Examples dim3 dimBlock(4, 2, 2); dim3 dimGrid(2, 2, 1); KernelFunction<<<dimGrid, dimBlock>>>(…); dim3 dimBlock(16, 16, 1); dim3 dimGrid(100, 1, 1); KernelFunction<<<dimGrid, dimBlock>>>(…); • Note: the dimension variables can be given as contents of variables ; • they do not need to be compile-time constants
A simple example of using multiple blocks to calculate Pd
global__ void MatrixMulKernel(float **Md, float **Nd, float **Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockId.y * TILE_WIDTH + threadId.y; // Calculate the column index of Pd and N Int Col = blockId.x * TILE_WIDTH + threadId.x; Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row][k] * Nd[k][Col]; Pd[Row][Col] = Pvalue; }
Small Matrix Multiplication Example • Assume 2x2 blocks • Observation: thread0,0 and thread1,0 both access row 0 of Md. • Both threads access these Md elements from the global memory. • If we manage to have thread0,0 and thread1,0 to collaborate so that these Md • elements are only loaded from global memory once, we can reduce the total • number of accesses to the global memory by half. • In general, we can see that every Md and Nd element are accessed exactly twice • during the execution of block0,0. Therefore, if we can have all the four threads to • collaborate in their accesses to global memory, we can reduce the traffic to the global • memory by half.
Global memory accesses performed by threads in block0,0 • The potential reduction of global memory traffic in matrix multiplication is • proportional to the dimension of the blocks used. • With NxN blocks, the potential reduction of global memory traffic would be N.
Tiling Md and Nd to utilize shared memory • Let the threads to collaboratively load • Md and Nd elements into the shared memory • before they individually use these elements • in their dot product calculation. • Note that the size of the shared memory • is quite small and one must be • careful not to exceed the capacity of the • shared memory when loading these • Md and Nd elements into the shared • memory. • This can be accomplished by dividing • the Md and Nd matrices into smaller tiles. • The size of these tiles is chosen so that • they can fit into the shared memory. • For simplicity, tile dimensions can be chosen to be • equal to the block dimensions
Execution phases of a tiled matrix multiplication algorithm (block size=2)
Tiled Matrix Multiplication Kernel using shared memories __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; intbx = blockIdx.x; int by = blockIdx.y; inttx = threadIdx.x; intty = threadIdx.y; // Identify the row and column of the Pd element to work on int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of Md and Nd tiles into shared memory Mds[tx][ty] = Md[Row*Width + m*TILE_WIDTH + tx]; Nds[tx][ty] = Nd[(m*TILE_WIDTH + ty) * Width + Col)]; __syncthreads(); for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[tx][k] * Nds[k][ty]; __syncthreads(); } Pd[Row*Width+Col] = Pvalue; }
Thread Assignment in GeForce-8 Series GPU Devices • In the GeForce-8 series hardware, the execution resources are organized into • Streaming Multiprocessors. For example, • GeForce 8800GTX implementation has 16 Streaming Multiprocessors (SMs). • Up to 8 blocks can be assigned to each SM as long as there are enough • resources to satisfy the needs of all the blocks. • With 16 SMs Multiprocessors in a GeForce 8800 GTX processor, up to 128 • blocks can be simultaneously assigned to all Streaming Multiprocessors.
Thread Assignment in GeForce-8 Series GPU Devices • Most grids contain much more than 128 blocks. • The run-time system maintains a list of blocks that need to execute and • assigns new blocks to Streaming Multiprocessors as they complete the • execution of blocks previously assigned to them. • In the GeForce 8800GTX design, up to 768 threads can be assigned to • each SM. • This could be in the form of : • 3 blocks of 256 threads each, 6 blocks of 128 threads each, etc. • With 16 SMs in GeForce 8800 GTX, there can be up to 12,288 threads • simultaneously residing in SMs for execution.
Thread Scheduling • Implementation specific. • In the GeForce 8800GTX, once a block is assigned to a SM, it is further divided • into 32-thread units called Warps. • Warps are the unit of thread scheduling in SMs. • Each warp consists of 32 threads of consecutive threadId values: thread 0 through 31 • form the first warp, 32 through 63 the second warp, and so on. • Example: • If each block has 256 threads, we should be able to determine the number of • warps that reside in each SM. • Each block has 256/32 or 8 warps. • With three blocks in each SM, we have 8*3 = 24 warps in each SM. • This is in fact the maximal number of warps that can reside in each SM in • GeForce 8800GTX, since there can be no more than 768 threads in each SM • and this amounts to 768/32 = 24 warps. • SMs are designed such that only one of these warps will be actually executed by • the hardware at any point in time.
Warp Based Thread Scheduling • Why do we need so many warps (if only one of them can execute at any point in time) ? • This is how these processors efficiently execute long latency operations such as • access to the global memory. • When an instruction executed by threads in a warp needs to wait for • the result of a previously initiated long-latency operation, the warp is placed into a • waiting area. • One of the other resident warps who are no longer waiting for results is selected for • execution. • If more than one warp is ready for execution, a priority mechanism is used to • select one for execution.
Divergence in Execution • At any point in time, the hardware selects and executes one warp at a time. • An instruction is run for all threads in the same warp, before moving to the next • instruction. • This style of execution is motivated by hardware cost constraints: it allows the cost of • fetching and processing an instruction to be amortized among a large number of • threads. • It works well when all threads within a warp follow the same control flow path when • working their data. • For an if-then else construct, the execution works well when either all threads execute • the then part or all execute the else part. • When threads within a warp take different control • flow paths, that is when some threads execute the then part and others execute the • else part, the simple execution style no longer works well. In such situation, the • execution of the warp will require multiple passes through these divergent paths. • One pass will be needed for those threads that follow the then part and another pass • for those that follow the else part. These passes are sequential to each other, thus will • add to the execution time. • When threads in the same warp follow different paths of control flow, we say that • these threads diverge in their execution.