1 / 57

Introduction to CUDA Programming

Introduction to CUDA Programming. Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA Robert Strzodka, Dominik Göddeke, NVISION08 presentation http://www.mathematik.uni-dortmund.de/~goeddeke/pubs/NVISION08-long.pdf.

hua
Download Presentation

Introduction to CUDA Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA Robert Strzodka, Dominik Göddeke, NVISION08 presentation http://www.mathematik.uni-dortmund.de/~goeddeke/pubs/NVISION08-long.pdf

  2. The CUDA Profiler • Both GUI and command-line • Non-GUI control: • CUDA_PROFILE • set to 1 or 0 to enable or disable the profiler • CUDA_PROFILE_LOG • set to the name of the log file (will default to ./cuda_profile.log) • CUDA_PROFILE_CSV • set to 1 or 0 to enable or disable a comma separated version of the log • CUDA_PROFILE_CONFIG • specify a config file with up to 4 signals

  3. Profiler Signals

  4. Profiler Counters • Grid size X, Y • Block size X, Y, Z • Dyn smem per block: • Dynamic shared memory • Sta smem per block: • static shared memory • Reg per thread • Mem transfer dir • Direction: 0  host to device, 1  device to host • Mem transfer size • bytes

  5. Interpreting Profiler Counters • Values represent events within a thread warp • Only targets one multiprocessor • Values will not correspond to the total number of warps launched for a particular kernel. • Launch enough thread blocks to ensure that the target multiprocessor is given a consistent percentage of the total work. • Values are best used to identify relative performance differences between non-optimized and optimized code • e.g., make the number of non-coalesced loads go from some non-zero value to zero

  6. CUDA Visual Profiler • Helps measure and find potential performance problem • GPU and CPU timing for all kernel invocations and memcpys • Time stamps • Access to hardware performance counters

  7. Assembly

  8. PTX: Assembly for NVIDIA GPUs • Parallel Thread eXecution • Virtual Assembly • Translated to actual machine code at runtime • Allows for different hardware implementations • Might enable additional optimizations • E.g., %clock register to time blocks of code

  9. C • G80 • … GPU Code Generation Flow • Parallel Thread eXecution (PTX) • Virtual Machine and ISA • Programming model • Execution resources and state • ISA – Instruction Set Architecture • Variable declarations • Instructions and operands • Translator is an optimizing compiler • Translates PTX to Target code • Program install time • Driver implements VM runtime • Coupled with Translator • C/C++ • Application C/C++Compiler • ASM-level • Library • Programmer • PTX Code • PTX Code PTX to Target Translator Target code

  10. How to See the PTX code • nvcc –keep • Produces .ptx and .cubin • nvcc --opencc-options -LIST:source=on

  11. PTX Example float4 me = gx[gtid]; me.x += me.y * me.z; CUDA ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; # 174 me.x += me.y * me.z; mad.f32 $f1, $f5, $f3, $f1; PTX Registers are virtual – The actual hardware registers are hidden from PTX

  12. PTX Syntax Example

  13. Another Example: CUDA Function • CUDA • PTX __device__ void interaction( float4 b0, float4 b1, float3 *accel) { r.x = b1.x - b0.x; r.y = b1.y - b0.y; r.z = b1.z - b0.z; float distSqr = r.x * r.x + r.y * r.y + r.z * r.z; float s = 1.0f/sqrt(distSqr); accel->x += r.x * s; accel->y += r.y * s; accel->z += r.z * s; } sub.f32 $f18, $f1, $f15; sub.f32 $f19, $f3, $f16; sub.f32 $f20, $f5, $f17; mul.f32 $f21, $f18, $f18; mul.f32 $f22, $f19, $f19; mul.f32 $f23, $f20, $f20; add.f32 $f24, $f21, $f22; add.f32 $f25, $f23, $f24; rsqrt.f32 $f26, $f25; mad.f32 $f13, $f18, $f26, $f13; mov.f32 $f14, $f13; mad.f32 $f11, $f19, $f26, $f11; mov.f32 $f12, $f11; mad.f32 $f9, $f20, $f26, $f9; mov.f32 $f10, $f9;

  14. PTX Data types

  15. If (cond) Then Code After Code Predicated Execution • p = Evaluate cond • Branch not true After • Then Code • After: • After Code • p = Evaluate cond • (p) Then Code • After Code

  16. PTX Predicated Execution

  17. Variable Declaration

  18. Parameterized Variable Names • How to create 100 register “variables” • .reg .b32 %r<100> • Declares %r0 - %r99

  19. Addresses as Operands The value of x The value of tbl[12] The base address of tlb

  20. Compiling a loop that calls a function - again • CUDA • sx is shared • mx, accel are local • PTX mov.s32 $r12, 0; $Lt_0_26: setp.eq.u32 $p1, $r12, $r5; @$p1 bra $Lt_0_27; mul.lo.u32 $r13, $r12, 16; add.u32 $r14, $r13, $r1; ld.shared.f32 $f15, [$r14+0]; ld.shared.f32 $f16, [$r14+4]; ld.shared.f32 $f17, [$r14+8]; [func body from previous slide inlined here] $Lt_0_27: add.s32 $r12, $r12, 1; mov.s32 $r15, 128; setp.ne.s32 $p2, $r12, $r15; @$p2 bra $Lt_0_26; for (i = 0; i < K; i++) { if (i != threadIdx.x) { interaction( sx[i], mx, &accel ); } }

  21. Yet Another Example: SAXPY code cvt.u32.u16 $blockid, %ctaid.x; // Calculate i from thread/block IDs cvt.u32.u16 $blocksize, %ntid.x; cvt.u32.u16 $tid, %tid.x; mad24.lo.u32 $i, $blockid, $blocksize, $tid; ld.param.u32 $n, [N]; // Nothing to do if n ≤ i setp.le.u32 $p1, $n, $i; @$p1 bra $L_finish; mul.lo.u32 $offset, $i, 4; // Load y[i] ld.param.u32 $yaddr, [Y]; add.u32 $yaddr, $yaddr, $offset; ld.global.f32 $y_i, [$yaddr+0]; ld.param.u32 $xaddr, [X]; // Load x[i] add.u32 $xaddr, $xaddr, $offset; ld.global.f32 $x_i, [$xaddr+0]; ld.param.f32 $alpha, [ALPHA]; // Compute and store alpha*x[i] + y[i] mad.f32 $y_i, $alpha, $x_i, $y_i; st.global.f32 [$yaddr+0], $y_i; $L_finish: exit;

  22. The %clock register • Real time clock cycle counter • How to read: •   mov.u32         $r1, %clock; • Can be used to time code • It measures real time not just time spent executing this thread • If a thread is blocks time still elapses

  23. PTX Reference • Please Read the PTX ISA specification • Posted under the handouts section

  24. Occupancy Calculator • http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls • GPU Occupancy • Active warps / max warps • Threads/block • Registers/thread • Shared memory/block • Nvcc –cubin • code {name = my_kernellmem = 0smem = 24reg = 5bar = 0bincode { � }const { � }}

  25. Occupancy Calculator Example

  26. Floating Point Considerations

  27. Comparison of FP Capabilities

  28. IEEE Floating Point Representation • A floating point binary number consists of three parts: • sign (S), exponent (E), and mantissa (M). • Each (S, E, M) pattern uniquely identifies a floating point number. • For each bit pattern, its IEEE floating-point value is derived as: • value = (-1)S * M * {2E}, where 1.0 ≤ M < 10.0B • The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number.

  29. Normalized Representation • Specifying that 1.0B ≤ M < 10.0B makes the mantissa value for each floating point number unique. • For example, the only one mantissa value allowed for 0.5D is M =1.0 • 0.5D  = 1.0B * 2-1 • Neither 10.0B * 2 -2 nor 0.1B * 2 0 qualifies • Because all mantissa values are of the form 1.XX…, one can omit the “1.” part in the representation.  • The mantissa value of 0.5D in a 2-bit mantissa is 00, which is derived by omitting “1.” from 1.00.

  30. Exponent Representation • In an n-bits exponent representation, 2n-1-1 is added to its 2's complement representation to form its excess representation. • See Table for a 3-bit exponent representation • A simple unsigned integer comparator can be used to compare the magnitude of two FP numbers • Symmetric range for +/- exponents (111 reserved) E = represented E - BIAS

  31. A Hypothetical 5-bit Floating Point Representation • Assume 1-bit S, 2-bit E, and 2-bit M • 0.5D  = 1.00B * 2-1 • 0.5D = 0 00 00,  • where • S = 0, • E = 00 • M = (1.)00

  32. Representable Numbers • The representable numbers of a given format is the set of all numbers that can be exactly represented in the format. • See Table for representable numbers of an unsigned 3-bit integer format -1 0 1 2 3 4 5 6 7 8 9

  33. Hypothetical 5-bit FP: Representable Numbers

  34. Flush To Zero • Treat all bit patterns with E=0 as 0.0 • This takes away several representable numbers near zero and lump them all into 0.0 • For a representation with large M, a large number of representable numbers numbers will be removed. 0 1 2 3 4

  35. Hypothetical 5-bit FP: Representable Numbers

  36. Denormalized Numbers • The actual method adopted by the IEEE standard is called denormalized numbers or gradual underflow. • The method relaxes the normalization requirement for numbers very close to 0. • whenever E=0, the mantissa is no longer assumed to be of the form 1.XX. Rather, it is assumed to be 0.XX. In general, if the n-bit exponent is 0, the value is • 0.M * 2 - 2 ^(n-1) + 2 0 2 1 3

  37. Hypothetical 5-bit FP: Representable Numbers

  38. Floating Point Numbers • As the exponent gets larger • The distance between two representable numbers increases

  39. Arithmetic Instruction Throughput • int and float add, shift, min, max and float mul, mad: 4 cycles per warp • int multiply (*) is by default 32-bit • requires multiple cycles / warp • Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply • For G80, for G20 should be OK • Integer divide and modulo are expensive • Compiler will convert literal power-of-2 divides to shifts • Be explicit in cases where compiler can’t tell that divisor is a power of 2 • Useful trick: foo % n == foo & (n-1) if n is a power of 2

  40. Arithmetic Instruction Throughput • Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp • These are the versions prefixed with “__” • Examples:__rcp(), __sin(), __exp() • Other functions are combinations of the above • y / x == rcp(x) * y == 20 cycles per warp • sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp

  41. Runtime Math Library • There are two types of runtime math operations • __func(): direct mapping to hardware ISA • Fast but low accuracy (see prog. guide for details) • Examples: __sin(x), __exp(x), __pow(x,y) • func() : compile to multiple instructions • Slower but higher accuracy (5 ulp, units in the least place, or less) • Examples: sin(x), exp(x), pow(x,y) • The -use_fast_math compiler option forces every func() to compile to __func()

  42. Make your program float-safe! • G20 has double precision support • G80 is single-precision only • Double precision has additional performance cost • Only one unit per multiprocessor • Careless use of double or undeclared types may run more slowly on G80+ • Important to be float-safe (be explicit whenever you want single precision) to avoid using double precision where it is not needed • Add ‘f’ specifier on float literals: • foo = bar * 0.123; // double assumed • foo = bar * 0.123f; // float explicit • Use float version of standard library functions • foo = sin(bar); // double assumed • foo = sinf(bar); // single precision explicit

  43. Deviations from IEEE-754 • Addition and Multiplication are IEEE 754 compliant • Maximum 0.5 ulp (units in the least place) error • However, often combined into multiply-add (FMAD) • Intermediate result is truncated • Division is non-compliant (2 ulp) • Not all rounding modes are supported • Denormalized numbers are not supported • No mechanism to detect floating-point exceptions

  44. Units in the Last Place Error • If the result of a FP computation is: • 3.12 x 10^-2 = 0.0312 • But the answer when computed to infinite precision is: • -0.0312159 • Then ulp is: • 0.0314 – 0.0312 = 0.159 • For binary representations the maximum ulp is 0.5 • Round to nearest number

  45. Mixed Precision Methods From slides by Robert Strzodka Dominik Göddeke http://www.mathematik.uni-dortmund.de/~goeddeke/pubs/NVISION08-long.pdf

  46. What is a Mixed Precision Method?

  47. Mixed Precision Performance Gains

  48. Single vs Double Precision FP Float s23e8 Double s53e11

  49. Round off and Cancellation

  50. Double Precision != Better Accuracy

More Related