ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt

Control Flow These notes will introduce scheduling control-flow statements in kernel code. ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt

Stream processing Term used to denote the processing of a stream of instructions operating in a data parallel fashion as in GPUs. Each execution unit executes the same instruction on different data. A stream is the group of data items being processes A stream processor is the execution unit operating in this fashion with its own local resources (registers etc.)

Stream Processors GPU execution resources organized into “Stream Processors” (SP) – previously referred to as execution “cores” (Term used for data parallel computers.) Each stream processor has compute resources such as register file, instruction scheduler, … A number of blocks are assigned to each stream processor for execution Limits on number of threads that can be simultaneously tracked and scheduled– limits the number of size of blocks that can be assigned to each SM

Terms Sometime see term “CUDA cores” or “thread processors” or “streaming processors” (SP)* NVIDIA NVIVIA group streaming processors (SPs) into streaming multiprocessor (SMs). Each streaming module shares control logic and instruction cache. * *In the book “Programming Massively Parallel Processors” by Kirk and Hwu, Morgan Kaufmann, 2010 page 8

NVIDIA GPUs C2050 Fermi (as in coit-grid06.uncc.edu and coit-grid07.uncc.edu) 14 streaming multiprocessor (SMs), Each SM has 32 streaming processor (cores) So 448 cores Apparently Fermi was originally intended to have 512 cores (16 SM) but too hot. GeForce GTX 480 (March 2010) 15 SMs (480 cores)

Tesla K20 (as in coit-grid08.uncc.edu) 2496 stream processors (SPs) Kepler architecture - “SMX (streaming multiprocessor) design that delivers up to 3x more performance per watt compared to the SM in Fermi.”* * NVIDIA® TESLA® KEPLER GPU COMPUTING ACCELERATORS http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf

Thread Scheduling Once a block assigned to a SM, divided into 32-threads units called warps. Size of a warp could change between implementations One warp is actually executed in hardware at a time (Some docs talk about a half-warp (16 thread units) actually simultaneously) Execution in SM – starts with the first warp in the first block

For a program without control instructions (no if statements etc.), the same instruction is executed for each thread in the warp simultaneously

Control-flow instructions When there is a divergent path, first the instructions on one path are executed and then the instructions in the other path, within each warp. So this causes the two paths to be serialized. But different warps are considered separately. It would be possible for one warp to execute one path and another warp to execute the other path at the same time.

Maximum performance Ideally have not control-flow statements If control-flow statements necessary: Programmer might be able to arrange each warp to execute just one path Example if (threadID < 16) /*do this */ ; if (threadID < 32 /* do this*/ ; if (threadID < 48) /* do this*/ ; Need to test/check

Compiler loop unrolling Sometimes compiler unrolls loops. Then no divergent paths Example for (i = 0; i < 4; i++) a[i] = 0; becomes a[0] = 0; a[1] = 0; a[2] = 0; a[3] = 0;

Branch predication instructions Compiler can also use branch predication instructions to eliminate divergent paths Branch predication instruction – a machine instruction that combines an Boolean condition (predicate) with an operation such as addition Example <p1> ADD R1, R2, R3 where <p1> CC == zero, etc.

Questions

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt