ME964 High Performance Computing for Engineering Applications

ME964High Performance Computing for Engineering Applications CUDA API Sept. 18, 2008

Before we get started… • Last Time • The CUDA API • Start discussing CUDA programming model • Today • HW2 is due on Friday, 11:59 PM • HW3 available on the class website (due Sept.25) • A look at a matrix multiplication example • The CUDA execution model 2

Going back to the G80 HW… split personality n. Two distinct personalities in the same entity, each of which prevails at a particular time. 3 HK-UIUC

Calling a Kernel Function, and the Concept of Execution Configuration Last slide of previous lecture • A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); // declaration dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); • Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit sync needed for blocking 4 HK-UIUC

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF Texture Texture Texture Texture Texture Texture Texture Texture Texture L1 L1 L1 L1 L1 L1 L1 L1 Host Host Input Assembler Input Assembler Setup / Rstr / ZCull Thread Execution Manager Vtx Thread Issue Geom Thread Issue Pixel Thread Issue Thread Processor Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache L2 L2 L2 L2 L2 L2 Load/store Load/store Load/store Load/store Load/store Load/store FB FB FB FB FB FB Global Memory G80 Thread Computing Pipeline • The future of GPUs is programmable processing • So – build the architecture around the processor • Processors execute computing threads • Alternative operating mode specifically for computing 5 HK-UIUC

Simple Example:Matrix Multiplication • A straightforward matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs • Leave shared memory usage until later • Local variable and register usage • Thread ID usage • Memory data transfer API between host and device 6 HK-UIUC

Square Matrix Multiplication Example N • P = M * N of size WIDTH x WIDTH • Software Design Decisions: • One thread handles one element of P • Each thread will access entries in M and N WIDTH times from global memory WIDTH M P WIDTH WIDTH WIDTH 7 HK-UIUC

Step 1: Matrix Data Transfers(Host Side) // Allocate the device memory where we will copy M to // Although not shown, you do the same for N and P matrices Matrix Md; #define WIDTH 16 Md.width = WIDTH; Md.height = WIDTH; Md.pitch = WIDTH; int size = WIDTH * WIDTH * sizeof(float); cudaMalloc((void**)&Md.elements, size); // Copy M from the host to the device cudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice); ...//do your work here… (see next slides) // Read the result from the device to the host into P cudaMemcpy(P.elements, Pd.elements, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(Md.elements); 8 HK-UIUC

Step 2: Matrix MultiplicationA Simple Host Code in C // Matrix multiplication on the (CPU) host in double precision; void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P) { for (int i = 0; i < M.height; ++i) { for (int j = 0; j < N.width; ++j) { double sum = 0; for (int k = 0; k < M.width; ++k) { double a = M.elements[i * M.width + k]; //you’ll see a lot of this… double b = N.elements[k * N.width + j]; // and of this as well… sum += a * b; } P.elements[i * N.width + j] = sum; } } } 9 HK-UIUC

Step 3: Matrix Multiplication, Host-side. Main Program Code int main(void) { // Allocate and initialize the matrices. // The last argument in AllocateMatrix: should an initialization with // random numbers be done? Yes: 1. No: 0 (everything is set to zero) Matrix M = AllocateMatrix(WIDTH, WIDTH, 1); Matrix N = AllocateMatrix(WIDTH, WIDTH, 1); Matrix P = AllocateMatrix(WIDTH, WIDTH, 0); // M * N on the device MatrixMulOnDevice(M, N, P); // Free matrices FreeMatrix(M); FreeMatrix(N); FreeMatrix(P); return 0; } 10 HK-UIUC

Step 3: Matrix MultiplicationHost-side code // Matrix multiplication on the device void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P) { // Load M and N to the device Matrix Md = AllocateDeviceMatrix(M); CopyToDeviceMatrix(Md, M); Matrix Nd = AllocateDeviceMatrix(N); CopyToDeviceMatrix(Nd, N); // Allocate P on the device Matrix Pd = AllocateDeviceMatrix(P); CopyToDeviceMatrix(Pd, P); // clear memory // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(WIDTH, WIDTH); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd); // Read P from the device CopyFromDeviceMatrix(P, Pd); // Free device matrices FreeDeviceMatrix(Md); FreeDeviceMatrix(Nd); FreeDeviceMatrix(Pd); } Continue here… 11 HK-UIUC

Multiply Using One Thread Block N Grid 1 Block 1 • One Block of threads computes matrix P • Each thread computes one element of P • Each thread • Loads a row of matrix M • Loads a column of matrix N • Perform one multiply and addition for each pair of M and N elements • Compute to off-chip memory access ratio close to 1:1 • Not that good… • Size of matrix limited by the number of threads allowed in a thread block Thread (2, 2) 48 M P BLOCK_SIZE 12 HK-UIUC

Step 4: Matrix Multiplication- Device-side Kernel Function // Matrix multiplication kernel – thread specification __global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < M.width; ++k) { float Melement = M.elements[ty * M.pitch + k]; float Nelement = Nd.elements[k * N.pitch + tx]; Pvalue += Melement * Nelement; } // Write the matrix to device memory; // each thread writes one element P.elements[ty * P.pitch + tx] = Pvalue; } N WIDTH M P tx WIDTH ty 13 WIDTH WIDTH HK-UIUC

Step 5: Some Loose Ends // Allocate a device matrix of same size as M. Matrix AllocateDeviceMatrix(const Matrix M) { Matrix Mdevice = M; int size = M.width * M.height * sizeof(float); cudaMalloc((void**)&Mdevice.elements, size); return Mdevice; } // Copy a host matrix to a device matrix. void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost) { int size = Mhost.width * Mhost.height * sizeof(float); cudaMemcpy(Mdevice.elements, Mhost.elements, size, cudaMemcpyHostToDevice); } // Copy a device matrix to a host matrix. void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice) { int size = Mdevice.width * Mdevice.height * sizeof(float); cudaMemcpy(Mhost.elements, Mdevice.elements, size, cudaMemcpyDeviceToHost); } // Free a device matrix. void FreeDeviceMatrix(Matrix M) { cudaFree(M.elements); } void FreeMatrix(Matrix M) { free(M.elements); } 14 HK-UIUC

Step 6: Handling Arbitrary Sized Square Matrices N • Have each 2D thread block to compute a (BLOCK_WIDTH)2 sub-matrix (tile) of the result matrix • Each has (BLOCK_WIDTH)2 threads • Generate a 2D Grid of (WIDTH/BLOCK_WIDTH)2 blocks WIDTH P M by NOTE: You still need to put a loop around the kernel call for cases where WIDTH is really large ty WIDTH bx tx WIDTH WIDTH 15 HK-UIUC

The Common Pattern to CUDA Programming • Phase 1: Allocate memory on the device and copy to the device the data required to carry out computation on the GPU • Phase 2: Let the GPU crunch the numbers for you based on the kernel that you define • Phase 3: Bring back the results from the GPU. Free memory on the device (clean up…). You’re done. 16

A Common Programming Pattern (Cntd.)BRINGING THE SHARED MEMORY INTO THE PICTURE • Local and global memory reside in device memory (DRAM) - much slower access than shared memory • An advantageous way of performing computation on the device is to block data to take advantage of fast shared memory: • Partition datainto data subsets that fit into shared memory • Handle each data subset with one thread block by: • Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism • Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element • Copying results from shared memory back to global memory 17 HK-UIUC

Device Stream Multiprocessor N Stream Multiprocessor 2 Stream Multiprocessor 1 Shared Memory Registers Registers Registers Instruction Unit … Processor 1 Processor 2 Processor M Constant Cache Texture Cache Device memory How about the memory not in registers or shared? • The local, global, constant, and texture spaces are regions of device memory • Beyond registers and shared memory, each multiprocessor has: • Read/Write global memory • There is a lot of this… • A read-only constant cache • To speed up access to the constant memory space • A read-only texture cache • To speed up access to the texture memory space Global, constant, texture memories 18

A Common Programming Pattern (cont.) • Texture and Constant memory also reside in device memory (DRAM) - much slower access than shared memory • But… cached! • Highly efficient access for read-only data • Carefully divide data according to access patterns • R/O no structure  constant memory • R/O array structured  texture memory • R/W shared within Block  shared memory • R/W registers spill to local memory • R/W inputs/results  global memory 19 HK-UIUC

Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Instruction Unit … Scalar Processor 1 Scalar Processor 2 Scalar Processor M G80 Hardware Implementation:A Set of SIMD Multiprocessors • The device is a set of Stream Multiprocessors (14 on 8800 GT, but that’s totally irrelevant) • Each multiprocessor has a collection of 32-bit Scalar Processors with a Single Instruction Multiple Data architecture – shared instruction unit • At each clock cycle, a Stream Multiprocessor executes the same instruction on a group of threads called a warp • The number of threads in a warp is the warp size 20 HK-UIUC

Hardware Implementation: Execution Model (review) • Each block of a grid issplit into warps, each gets executed by one Stream Multiprocessor (SM) • The device processes only one grid at a time • Each block is executed by one Stream Multiprocessor • The shared memory space resides in the on-chip shared memory • A Stream Multiprocessor can executemultiple blocks concurrently • Shared memory and registers are partitioned among the threads of all concurrent blocks • So, decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently, which is good • Currently, up to eight blocks can be processed concurrently (this is bound to change, as the HW changes…) 21

ME964 High Performance Computing for Engineering Applications