Lecture 2: Introduction to Parallel Computing Using CUDA

Lecture 2: Introduction to Parallel Computing Using CUDA IEEE Boston Continuing Education Program Ken Domino, Domem Technologies May 9, 2011

Announcements • Course website updates: Syllabus- http://domemtech.com/ieee-pp/Syllabus.docx Lecture1– http://domemtech.com/ieee-pp/Lecture1.pptx Lecture2– http://domemtech.com/ieee-pp/Lecture2.pptx References- http://domemtech.com/ieee-pp/References.docx • Ocelot April 5 download is not working

PRAM • Parallel Random Access Machine (PRAM). • Idealized SIMD parallel computing model. • Unlimited RAM’s, called Processing Units (PU). • RAM’s operate with same instructions and synchronously. • Shared Memory unlimited, accessed in one unit time. • Shared Memory access is one of CREW, CRCW, EREW. • Communication between RAM’s is only through Shared Memory.

PRAM pseudo code • Parallel for loop • for Pi , 1 ≤ i ≤ n inparallel do… end • (aka “data-level parallelism)

Synchronization • A simple example from C:

Synchronization • What happens if we have two threads competing for the same resources (char_in/char_out)?

Synchronization • What happens if two threads execute this code serially? No prob!

Synchronization • What happens if two threads execute this code in parallel? We can sometimes get a problem. char_in of T2 overwrites char_in of T1!

Synchronization • Synchronization forces thread serialization, e.g., so concurrent access does not cause problems.

Synchronization • Two types: • Mutual exclusion, using a “mutex” semaphore = a lock • Cooperation, wait on an object until all other threads ready, using wait() + notify(), barrier synchronization

Deadlock • The use of mutual exclusion of two or more resources.

PRAM Synchronization • ”stay idle” – wait until other processors complete, ”cooperative” synchronization

CUDA • “Compute Unified Device Architecture” • Developed by NVIDIA, introduced November 2006 • Based on C, extended later to work with C++. • CUDA provides three key abstractions: • a hierarchy of thread groups • shared memories • barrier synchronization http://www.nvidia.com/object/IO_37226.html, http://www.gpgpu.org/oldsite/sc2006/workshop/presentations/Buck_NVIDIA_Cuda.pdf, Nickolls, J., Buck, I., Garland, M. and Skadron, K. Scalable parallel programming with CUDA. Queue, 6 (2). 40-53.

GPU coprocessor to CPU

NVIDIA GPU Architecture Multiprocessor (MP) = texture/processor cluster (TPC) Dynamic random-access memory (DRAM) aka “global memory” Raster operation processor (ROP) L2 – Level-2 memory cache

NVIDIA GPU Architecture Streaming Multiprocessor (SM) Streaming processor (SP) Streaming multiprocessor control (SMC) Texture processing unit (TPU) Con Cache – “constant” memory Sh. Memory – “shared” memory Multithreaded instruction fetch and issue unit (MTIFI) 1st generation, G80 – 2006 3rd generation, Fermi, GTX 570 - 2010

Single-instruction, multiple-thread • “SIMT” • SIMT = SIMD + SPMD (single program, multiple data). • Multiple threads. • Sort of “Single Instruction”—except that each instruction executed is in multiple independent parallel threads. • Instruction set architecture: a register-based instruction set including floating-point, integer, bit, conversion, transcendental, flow control, memory load/store, and texture operations.

Single-instruction, multiple-thread • The Stream Multiprocessor is a hardware multithreaded unit. • Threads are executed in groups of 32 parallel threads called warps. • Each thread has its own set of registers. • Individual threads composing a warp are of the same program and start together at the same program address, but they are otherwise free to branch and execute independently.

Single-instruction, multiple-thread • Instruction executed is same for each warp. • If threads of a warp diverge via a data dependent conditional branch, the warp serially executes each branch path taken.

Single-instruction, multiple-thread • Warps are serialized if there is: • Divergence in instructions (i.e., conditional branch instruction) • write access to the same memory

Warp Scheduling • SM hardware implements near-zero overhead • Warp scheduling • Warps whose next instruction has its operands ready for consumption can be executed • Eligible Warps are selected for execution by priority • All threads in a Warp execute the same instruction • 4 clock cycles needed to dispatch the instruction for all threads (G80)

Cooperative Thread Array (CTA) • An abstraction to synchronizing threads • AKA a thread block, grid • CTA’s are mapped to warps

Cooperative Thread Array (CTA) • Each thread has a unique integer thread ID (TID). • Threads of a CTA share data in global or shared memory • Threads synchronize with the barrier instruction. • CTA thread programs use their TIDs to select work and index shared data arrays.

Cooperative Thread Array (CTA) • The programmer declares a 1D, 2D, or 3D grid shape and dimensions in threads. • The TID is 1D, 2D, or 3D indice.

Restrictions in grid sizes

Kernel • Every thread in a grid executes the same body of instructions, called a kernel. • In CUDA, it’s just a function.

CUDA Kernels • Kernels declared with __global__ void • Parameters are the same for all threads. __global__ void fun(float * d, int size) { intidx = threadIdx.x + blockDim.x * blockIdx.x + blockDim.x * gridDim.x * blockDim.y * blockIdx.y + blockDim.x * gridDim.x * threadIdx.y; if (idx < 0) return; if (idx >= size) return; d[idx] = idx * 10.0 / 0.1; }

CUDA Kernels • Kernels are called via “chevron syntax” • Func<<< Dg, Db, Ns, S >>>(parameters) • Dg is of type dim3 and specifies the dimension and size of the grid • Db is of type dim3 and specifies the dimension and size of the block • Dg is of type dim3 and specifies the dimension and size of the grid • Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block • S is of type cudaStream_t and specifies the associated stream • Kernel is void type; must return value through cbv parameter • Example: • Foo<<<1, 100>>(1, 2, i);

Memory • CTA’s have various types of memory • Global, shared, constant, textured, registers • Threads can access host memory, too.

Types of memory

CUDA Memory • Data types (int, long, float, double, etc) are the same as in the host. • Shared memory shared between blocks in a thread. • Global memory shared by all threads in all blocks. • Constant memory shared by all threads in all blocks, but it cannot be changed (so, faster). • Host memory (of CPU) can be access by all threads in all blocks.

Shared Memory • __shared__ declares a variable that: • Resides in the shared memory space of a thread block, • Has the lifetime of the block, • Is only accessible from all the threads within the block. • Examples: • extern __shared__ float shared[]; • (or declared on kernel call—later!)

Global Memory • __device__ declares a variable that: • Resides in global memory space; • Has the lifetime of an application; • Is accessible from all the threads within the grid and from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()) • Can be allocated through cudaMalloc() • Examples: • extern __device__ int data[100]; • cudaMalloc(&d, 100*sizeof(int));

Basic host function calls • Global memory allocation via cudaMalloc() • Copying memory between host and GPU via cudaMemcpy() • Kernels are called by chevron syntax

Counting 6’s • Have an array of integers, h[], want to count the number of 6’s that appear in the array. • H[0..size-1] • How do we do this in CUDA?

Counting 6’s • Divide the array into blocks of blocksize threads. • For each block, sum the number of times 6 appears. • Return the sum for each block.

Counting 6’s • Divide the array into blocks of blocksize threads. • For each block, sum the number of times 6 appears. • Return the sum for each block. • #include <stdio.h> • __global__ void c6(int * d_in, int * d_out, int size) • { • int sum = 0; • for (inti=0; i < blockDim.x; i++) • { • int val = d_in[i + blockIdx.x * blockDim.x]; • if (val == 6) • sum++; • } • d_out[blockIdx.x] = sum; • }

Counting 6’s • int main() • { • int size = 300; • int * h = (int*)malloc(size * sizeof(int)); • for (int i = 0; i < size; ++i) • h[i] = i % 10; • int * d_in; • int * d_out; • intbsize = 100; • int blocks = size/bsize + 1; • intthreads_per_block = bsize; • int rv1 = cudaMalloc(&d_in, size*sizeof(int)); • int rv2 = cudaMalloc(&d_out, blocks*sizeof(int)); • int rv3 = cudaMemcpy(d_in, h, size*sizeof(int), cudaMemcpyHostToDevice); • c6<<<blocks, threads_per_block>>>(d_in, d_out, size); • cudaThreadSynchronize(); • int rv4 = cudaGetLastError(); • int * r = (int*)malloc(blocks * sizeof(int)); • int rv5 = cudaMemcpy(r, d_out, blocks*sizeof(int), cudaMemcpyDeviceToHost); • int sum = 0; • for (inti = 0; i < blocks; ++i) • sum += r[i]; • printf("Result = %d\n", sum); • return 0; • } • In main program, call the kernel with the correct dimensions of the block. • Note: size % blocksize = 0. • How would we extend this for arbitrary array size?

Developing CUDA programs • Install CUDA SDK (drivers, Toolkit, examples) • Windows, Linux, Mac: • Use Version 4.0, release candidate 2. (The older 3.2 release does not work with VS2010 easily! You can install both VS2010 and VS2008, but you will have to manage paths.) • http://developer.nvidia.com/cuda-toolkit-40 • Install toolkit, tools SDK, and example code • For drivers, you must have an NVIDIA GPU card • Recommendation: The CUDA examples use definitions in a common library—do not force your code to depend on it by using it.

Developing CUDA programs • Emulation • Do not install CUDA drivers (will fail). • Windows and Mac only • Install VirtualBox. • Create 40GB virtual drive. • Install Ubuntu from ISO image on VirtualBox. • Install Ocelot (http://code.google.com/p/gpuocelot/downloads/list) • Install various dependencies (sudo apt-get xxxx install, for g++, boost, etc.) • Note: There is a problem with the current release of Ocelot—I emailed Gregory.Diamos@gatech.edu to resolve build issue.

Developing CUDA programs • Windows: • Install VS2010 C++ Express (http://www.microsoft.com/visualstudio/en-us/products/2010-editions/visual-cpp-express) • (Test installation with “Hello World” .cpp example.)

Developing CUDA programs • Windows: • Create an empty c++ console project • Create hw.cu “hello world” program in source directory • Project ‐> Custom Build Rules, check box for CUDA 4.0 targets • Add hw.cu into your empty project • Note: “.cu” suffix stands for “CUDA source code”. You can put CUDA syntax into .cpp files, but build environment won’t know what to compile it with (cl/g++ vsnvcc).

Developing CUDA programs #include <stdio.h> __global__ void fun(int * mem) { *mem = 1; } int main() { int h = 0; int * d; cudaMalloc(&d, sizeof(int)); cudaMemcpy(d, &h, sizeof(int), cudaMemcpyHostToDevice); fun<<<1,1>>>(d); cudaThreadSynchronize(); intrv = cudaGetLastError(); cudaMemcpy(&h, d, sizeof(int), cudaMemcpyDeviceToHost); printf("Result = %d\n", h); return 0; } hw.cu:

Developing CUDA programs • Compile, link, and run • (Version 4.0 installation adjusts all environmental variables.)

NVCC • nvcc (NVIDIA CUDA compiler) is a driver program for compiler phases • Use –keep option to see intermediate files. (Need to add “.” to include directories on compile.)

NVCC • Compiles to “.cu” into a “.cu.cpp” file • Two types of targets: virtual and real, represented in PTX assembly code and “cubin” binary code, respectively.

PTXAS • Compiles PTX assembly code into machine code, placed in an ELF module. • # cat hw.sm_10.cubin | od -t x1 | head • 0000000 7f 45 4c 46 01 01 01 33 02 00 00 00 00 00 00 00 • 0000020 02 00 be 00 01 00 00 00 00 00 00 00 34 18 00 00 • 0000040 34 00 00 00 0a 01 0a 00 34 00 20 00 03 00 28 00 • 0000060 16 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 • 0000100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 • 0000120 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 • 0000140 03 00 00 00 00 00 00 00 00 00 00 00 a4 03 00 00 • 0000160 7f 01 00 00 00 00 00 00 00 00 00 00 04 00 00 00 • 0000200 00 00 00 00 0b 00 00 00 03 00 00 00 00 00 00 00 • 0000220 00 00 00 00 23 05 00 00 22 00 00 00 00 00 00 00 • Disassembly of the machine code can be done using cuobjectdump or my own utility nvdis (http://forums.nvidia.com/index.php?showtopic=183438)

PTX, the GPU assembly code • PTX = “Parallel Thread Execution” • Target for PTX is an abstract GPU machine. • Contains operations for load, store, register declarations, add, sub, mul, etc. .version 1.4 .target sm_10, map_f64_to_f32 // compiled with …/be.exe // nvopencc 4.0 built on 2011-03-24 .entry _Z3funPi ( .param .u32 __cudaparm__Z3funPi_mem) { .reg .u32 %r<4>; .loc 16 4 0 $LDWbegin__Z3funPi: .loc 16 6 0 mov.s32 %r1, 1; ld.param.u32 %r2, [__cudaparm__Z3funPi_mem]; st.global.s32 [%r2+0], %r1; .loc 16 7 0 exit; $LDWend__Z3funPi: } // _Z3funPi

CUDA GPU targets • Virtual – PTX code is embedded in executabe as a string, then compiled at runtime “just-in-time”. • Real – PTX code is compiled into target execute.

Next time • For next week, we will go into more detail: • The CUDA runtime API; • Writing efficient CUDA code; • Look at some important examples.

Lecture 2: Introduction to Parallel Computing Using CUDA