Introduction to CUDA Programming

Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA Robert Strzodka, Dominik Göddeke, NVISION08 presentation http://www.mathematik.uni-dortmund.de/~goeddeke/pubs/NVISION08-long.pdf

The CUDA Profiler • Both GUI and command-line • Non-GUI control: • CUDA_PROFILE • set to 1 or 0 to enable or disable the profiler • CUDA_PROFILE_LOG • set to the name of the log file (will default to ./cuda_profile.log) • CUDA_PROFILE_CSV • set to 1 or 0 to enable or disable a comma separated version of the log • CUDA_PROFILE_CONFIG • specify a config file with up to 4 signals

Profiler Signals

Profiler Counters • Grid size X, Y • Block size X, Y, Z • Dyn smem per block: • Dynamic shared memory • Sta smem per block: • static shared memory • Reg per thread • Mem transfer dir • Direction: 0  host to device, 1  device to host • Mem transfer size • bytes

Interpreting Profiler Counters • Values represent events within a thread warp • Only targets one multiprocessor • Values will not correspond to the total number of warps launched for a particular kernel. • Launch enough thread blocks to ensure that the target multiprocessor is given a consistent percentage of the total work. • Values are best used to identify relative performance differences between non-optimized and optimized code • e.g., make the number of non-coalesced loads go from some non-zero value to zero

CUDA Visual Profiler • Helps measure and find potential performance problem • GPU and CPU timing for all kernel invocations and memcpys • Time stamps • Access to hardware performance counters

Assembly

PTX: Assembly for NVIDIA GPUs • Parallel Thread eXecution • Virtual Assembly • Translated to actual machine code at runtime • Allows for different hardware implementations • Might enable additional optimizations • E.g., %clock register to time blocks of code

C • G80 • … GPU Code Generation Flow • Parallel Thread eXecution (PTX) • Virtual Machine and ISA • Programming model • Execution resources and state • ISA – Instruction Set Architecture • Variable declarations • Instructions and operands • Translator is an optimizing compiler • Translates PTX to Target code • Program install time • Driver implements VM runtime • Coupled with Translator • C/C++ • Application C/C++Compiler • ASM-level • Library • Programmer • PTX Code • PTX Code PTX to Target Translator Target code

How to See the PTX code • nvcc –keep • Produces .ptx and .cubin • nvcc --opencc-options -LIST:source=on

PTX Example float4 me = gx[gtid]; me.x += me.y * me.z; CUDA ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; # 174 me.x += me.y * me.z; mad.f32 $f1, $f5, $f3, $f1; PTX Registers are virtual – The actual hardware registers are hidden from PTX

PTX Syntax Example

Another Example: CUDA Function • CUDA • PTX __device__ void interaction( float4 b0, float4 b1, float3 *accel) { r.x = b1.x - b0.x; r.y = b1.y - b0.y; r.z = b1.z - b0.z; float distSqr = r.x * r.x + r.y * r.y + r.z * r.z; float s = 1.0f/sqrt(distSqr); accel->x += r.x * s; accel->y += r.y * s; accel->z += r.z * s; } sub.f32 $f18, $f1, $f15; sub.f32 $f19, $f3, $f16; sub.f32 $f20, $f5, $f17; mul.f32 $f21, $f18, $f18; mul.f32 $f22, $f19, $f19; mul.f32 $f23, $f20, $f20; add.f32 $f24, $f21, $f22; add.f32 $f25, $f23, $f24; rsqrt.f32 $f26, $f25; mad.f32 $f13, $f18, $f26, $f13; mov.f32 $f14, $f13; mad.f32 $f11, $f19, $f26, $f11; mov.f32 $f12, $f11; mad.f32 $f9, $f20, $f26, $f9; mov.f32 $f10, $f9;

PTX Data types

If (cond) Then Code After Code Predicated Execution • p = Evaluate cond • Branch not true After • Then Code • After: • After Code • p = Evaluate cond • (p) Then Code • After Code

PTX Predicated Execution

Variable Declaration

Parameterized Variable Names • How to create 100 register “variables” • .reg .b32 %r<100> • Declares %r0 - %r99

Addresses as Operands The value of x The value of tbl[12] The base address of tlb

Compiling a loop that calls a function - again • CUDA • sx is shared • mx, accel are local • PTX mov.s32 $r12, 0; $Lt_0_26: setp.eq.u32 $p1, $r12, $r5; @$p1 bra $Lt_0_27; mul.lo.u32 $r13, $r12, 16; add.u32 $r14, $r13, $r1; ld.shared.f32 $f15, [$r14+0]; ld.shared.f32 $f16, [$r14+4]; ld.shared.f32 $f17, [$r14+8]; [func body from previous slide inlined here] $Lt_0_27: add.s32 $r12, $r12, 1; mov.s32 $r15, 128; setp.ne.s32 $p2, $r12, $r15; @$p2 bra $Lt_0_26; for (i = 0; i < K; i++) { if (i != threadIdx.x) { interaction( sx[i], mx, &accel ); } }

Yet Another Example: SAXPY code cvt.u32.u16 $blockid, %ctaid.x; // Calculate i from thread/block IDs cvt.u32.u16 $blocksize, %ntid.x; cvt.u32.u16 $tid, %tid.x; mad24.lo.u32 $i, $blockid, $blocksize, $tid; ld.param.u32 $n, [N]; // Nothing to do if n ≤ i setp.le.u32 $p1, $n, $i; @$p1 bra $L_finish; mul.lo.u32 $offset, $i, 4; // Load y[i] ld.param.u32 $yaddr, [Y]; add.u32 $yaddr, $yaddr, $offset; ld.global.f32 $y_i, [$yaddr+0]; ld.param.u32 $xaddr, [X]; // Load x[i] add.u32 $xaddr, $xaddr, $offset; ld.global.f32 $x_i, [$xaddr+0]; ld.param.f32 $alpha, [ALPHA]; // Compute and store alpha*x[i] + y[i] mad.f32 $y_i, $alpha, $x_i, $y_i; st.global.f32 [$yaddr+0], $y_i; $L_finish: exit;

The %clock register • Real time clock cycle counter • How to read: • mov.u32 $r1, %clock; • Can be used to time code • It measures real time not just time spent executing this thread • If a thread is blocks time still elapses

PTX Reference • Please Read the PTX ISA specification • Posted under the handouts section

Occupancy Calculator • http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls • GPU Occupancy • Active warps / max warps • Threads/block • Registers/thread • Shared memory/block • Nvcc –cubin • code {name = my_kernellmem = 0smem = 24reg = 5bar = 0bincode { � }const { � }}

Occupancy Calculator Example

Floating Point Considerations

Comparison of FP Capabilities

IEEE Floating Point Representation • A floating point binary number consists of three parts: • sign (S), exponent (E), and mantissa (M). • Each (S, E, M) pattern uniquely identifies a floating point number. • For each bit pattern, its IEEE floating-point value is derived as: • value = (-1)S * M * {2E}, where 1.0 ≤ M < 10.0B • The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number.

Normalized Representation • Specifying that 1.0B ≤ M < 10.0B makes the mantissa value for each floating point number unique. • For example, the only one mantissa value allowed for 0.5D is M =1.0 • 0.5D = 1.0B * 2-1 • Neither 10.0B * 2 -2 nor 0.1B * 2 0 qualifies • Because all mantissa values are of the form 1.XX…, one can omit the “1.” part in the representation. • The mantissa value of 0.5D in a 2-bit mantissa is 00, which is derived by omitting “1.” from 1.00.

Exponent Representation • In an n-bits exponent representation, 2n-1-1 is added to its 2's complement representation to form its excess representation. • See Table for a 3-bit exponent representation • A simple unsigned integer comparator can be used to compare the magnitude of two FP numbers • Symmetric range for +/- exponents (111 reserved) E = represented E - BIAS

A Hypothetical 5-bit Floating Point Representation • Assume 1-bit S, 2-bit E, and 2-bit M • 0.5D = 1.00B * 2-1 • 0.5D = 0 00 00, • where • S = 0, • E = 00 • M = (1.)00

Representable Numbers • The representable numbers of a given format is the set of all numbers that can be exactly represented in the format. • See Table for representable numbers of an unsigned 3-bit integer format -1 0 1 2 3 4 5 6 7 8 9

Hypothetical 5-bit FP: Representable Numbers

Flush To Zero • Treat all bit patterns with E=0 as 0.0 • This takes away several representable numbers near zero and lump them all into 0.0 • For a representation with large M, a large number of representable numbers numbers will be removed. 0 1 2 3 4

Denormalized Numbers • The actual method adopted by the IEEE standard is called denormalized numbers or gradual underflow. • The method relaxes the normalization requirement for numbers very close to 0. • whenever E=0, the mantissa is no longer assumed to be of the form 1.XX. Rather, it is assumed to be 0.XX. In general, if the n-bit exponent is 0, the value is • 0.M * 2 - 2 ^(n-1) + 2 0 2 1 3

Floating Point Numbers • As the exponent gets larger • The distance between two representable numbers increases

Arithmetic Instruction Throughput • int and float add, shift, min, max and float mul, mad: 4 cycles per warp • int multiply (*) is by default 32-bit • requires multiple cycles / warp • Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply • For G80, for G20 should be OK • Integer divide and modulo are expensive • Compiler will convert literal power-of-2 divides to shifts • Be explicit in cases where compiler can’t tell that divisor is a power of 2 • Useful trick: foo % n == foo & (n-1) if n is a power of 2

Arithmetic Instruction Throughput • Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp • These are the versions prefixed with “__” • Examples:__rcp(), __sin(), __exp() • Other functions are combinations of the above • y / x == rcp(x) * y == 20 cycles per warp • sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp

Runtime Math Library • There are two types of runtime math operations • __func(): direct mapping to hardware ISA • Fast but low accuracy (see prog. guide for details) • Examples: __sin(x), __exp(x), __pow(x,y) • func() : compile to multiple instructions • Slower but higher accuracy (5 ulp, units in the least place, or less) • Examples: sin(x), exp(x), pow(x,y) • The -use_fast_math compiler option forces every func() to compile to __func()

Make your program float-safe! • G20 has double precision support • G80 is single-precision only • Double precision has additional performance cost • Only one unit per multiprocessor • Careless use of double or undeclared types may run more slowly on G80+ • Important to be float-safe (be explicit whenever you want single precision) to avoid using double precision where it is not needed • Add ‘f’ specifier on float literals: • foo = bar * 0.123; // double assumed • foo = bar * 0.123f; // float explicit • Use float version of standard library functions • foo = sin(bar); // double assumed • foo = sinf(bar); // single precision explicit

Deviations from IEEE-754 • Addition and Multiplication are IEEE 754 compliant • Maximum 0.5 ulp (units in the least place) error • However, often combined into multiply-add (FMAD) • Intermediate result is truncated • Division is non-compliant (2 ulp) • Not all rounding modes are supported • Denormalized numbers are not supported • No mechanism to detect floating-point exceptions

Units in the Last Place Error • If the result of a FP computation is: • 3.12 x 10^-2 = 0.0312 • But the answer when computed to infinite precision is: • -0.0312159 • Then ulp is: • 0.0314 – 0.0312 = 0.159 • For binary representations the maximum ulp is 0.5 • Round to nearest number

Mixed Precision Methods From slides by Robert Strzodka Dominik Göddeke http://www.mathematik.uni-dortmund.de/~goeddeke/pubs/NVISION08-long.pdf

What is a Mixed Precision Method?

Mixed Precision Performance Gains

Single vs Double Precision FP Float s23e8 Double s53e11

Round off and Cancellation

Double Precision != Better Accuracy

Introduction to CUDA Programming