580 likes | 763 Views
Automated Floating-Point Precision Analysis. Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor. Context. Floating-point arithmetic is ubiquitous. Context. Floating-point arithmetic represents real numbers as ( ± 1. frac × 2 exp ) Sign bit Exponent
E N D
Automated Floating-PointPrecision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor
Context • Floating-point arithmetic is ubiquitous
Context Floating-point arithmetic represents real numbers as (±1.frac× 2exp) • Sign bit • Exponent • Significand (“mantissa” or “fraction”) 8 4 32 0 16 Single Precision Exponent (8 bits) Significand (23 bits) 8 4 64 32 0 16 Double Precision Exponent (11 bits) Significand (52 bits)
Context Floating-point arithmetic represents real numbers as (±1.frac× 2exp) • Sign bit • Exponent • Significand (“mantissa” or “fraction”) 8 4 32 0 16 0x40000000 Representing 2.0: Exponent (8 bits) Significand (23 bits) 8 4 64 32 0 16 0x4000000000000000 Exponent (11 bits) Significand (52 bits)
Context Floating-point arithmetic represents real numbers as (±1.frac× 2exp) • Sign bit • Exponent • Significand (“mantissa” or “fraction”) 8 4 32 0 16 0x40200000 Representing 2.625: Exponent (8 bits) Significand (23 bits) 8 4 64 32 0 16 0x4005000000000000 Exponent (11 bits) Significand (52 bits)
Context Floating-point arithmetic represents real numbers as (±1.frac× 2exp) • Sign bit • Exponent • Significand (“mantissa” or “fraction”) 8 4 32 0 16 0x3DCCCCCD Representing 0.1: Exponent (8 bits) Significand (23 bits) 8 4 64 32 0 16 0x3FB999999999999A Exponent (11 bits) Significand (52 bits)
Context Floating-point arithmetic represents real numbers as (±1.frac× 2exp) • Sign bit • Exponent • Significand (“mantissa” or “fraction”) 8 4 32 0 16 0x3F9DF3B6 Representing 1.234: Exponent (8 bits) Significand (23 bits) 8 4 64 32 0 16 0x3FF3BE76C8B43958 Exponent (11 bits) Significand (52 bits)
Context • Floating-point is ubiquitous but problematic • Rounding error • Accumulates after many operations • Not always intuitive (e.g., non-associative) • Naïve approach: higher precision • Lower precision is preferable • Tesla K20Xis 2.3X faster in single precision • Xeon Phi is 2.0X faster in single precision • Single precision uses 50% of the memory bandwidth
Problem • Current analysis solutions are lacking • Numerical analysis methods are difficult • Static analysis is too conservative • Trial-and-error is time-consuming • We need better analysis solutions • Produce easy-to-understand results • Incorporate runtime effects • Automated or semi-automated
Thesis Automated runtime analysis techniques can inform application developers regarding floating-point behavior, and can provide insights to guide developers towards reducing precision with minimal impact on accuracy.
Contributions • Floating-point software analysis framework • Cancellation detection • Mixed-precision configuration • Reduced-precision analysis Initial emphasis on capability over performance 2.7182818284590452353603...
Example: Sum2PI_X intsum2pi_x() { inti, j, k; real x, y, acc, sum; real final = PI * OUTER; /* correct answer */ sum = 0.0; for(i=0; i<OUTER; i++) { acc= 0.0; for(j=1; j<INNER; j++) { /* calculate 2^j */ x = 1.0; for(k=0; k<j; k++) x *= 2.0; /* 870K execs */ /* approximately calculate pi */ y = (real)PI / x; /* 58K execs */ acc+= y; /* 58K execs */ } sum += acc; /* 2K execs */ } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } /* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */ /* constants */ #define PI 3.14159265359 #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30
Contribution 1 of 4 Software Framework
Framework CRAFT: Configurable Runtime Analysis for Floating-point Tuning 2.7182818284590452353603...
Framework • Dyninst: a binary analysis library • Parses executable files (InstructionAPI & ParseAPI) • Inserts instrumentation (DyninstAPI) • Supports full binary modification (PatchAPI) • Rewrites binary executable files (SymtabAPI) • Binary-level analysis benefits • Programming language-agnostic • Supports closed third-party libraries • Sensitive to compiler transformations
Framework • CRAFT framework • Dyninst-based binary mutator (C/C++) • Swing-based GUI viewers (Java) • Automated search scripts (Ruby) • Proof-of-concept analyses • Instruction counting • Not-a-Number (NaN) detection • Range tracking (from Brown et al. 2007)
Sum2PI_X No NaNs detected
Contribution 2 of 4 Cancellation Detection
Cancellation • Loss of significant digits due to subtraction • Cancellation detection • Instrument every addition and subtraction • Report cancellation events 2.491264 (7) 1.613647 (7) - 2.491252 (7) - 1.613647 (7) 0.000012 (2) 0.000000 (0) (5 digits cancelled) (all digits cancelled) PRECISION
Cancellation: Results • Gaussian elimination • Detect effects of a small pivot value • Highlight algorithmic differences • Domain-specific insights • Dense point fields • Color saturations • Error checking • Larger cancellations are better
Cancellation: Conclusions • Automated analysis can detect cancellation • Cancellation detection serves a wide variety of purposes • Later work expanded the ability to identify problematic cancellation [Benz et al. 2012]
Contribution 3 of 4 Mixed Precision
Mixed Precision • Tradeoff: Single (32 bits) vs. Double (64 bits) • Single precision is faster • 2X+ computational speedup in recent hardware • 50% reduction in memory storage and bandwidth • Double precision is more accurate • 16 digits vs. 7 digits
Mixed Precision • Most operations use single precision • Crucial operations use double precision Mixed-precision linear solver [Buttari 2008] 1: LU ← PA 2: solveLy = Pb 3: solve Ux0 = y 4: for k = 1, 2, ... do 5: rk ← b – Axk-1 6: solveLy = Prk 7: solveUzk = y 8: xk ← xk-1 + zk 9: checkforconvergence 10: end for Red text indicates double-precision (all other steps are single-precision) 50% speedup on average (12X in special cases) Difficult to prototype
Mixed Precision Double Precision Mixed Precision Original Binary CRAFT Modified Binary Mixed Config
Mixed Precision • Simulate single precision by storing 32-bit version inside 64-bit double-precision field 8 4 64 32 0 16 Double downcast conversion 8 4 64 32 0 16 Replaced Double 7 F F 4 D E A D Non-signallingNaN 8 4 32 0 16 Single
Mixed Precision gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1 movsd0x601e38(%rax, %rbx, 8) %xmm0 2 mulsd-0x78(%rsp) * %xmm0 %xmm0 3 addsd-0x4f02(%rip) + %xmm0 %xmm0 4 movsd %xmm0 0x601e38(%rax, %rbx, 8)
Mixed Precision gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1 movsd0x601e38(%rax, %rbx, 8) %xmm0 check/replace -0x78(%rsp) and %xmm0 2 mulss-0x78(%rsp) * %xmm0 %xmm0 check/replace -0x4f02(%rip) and %xmm0 3 addss-0x4f02(%rip) + %xmm0 %xmm0 4 movsd %xmm0 0x601e38(%rax, %rbx, 8)
Mixed Precision push%rax push%rbx <for each input operand> <copy input into %rax> mov%rbx, 0xffffffff00000000 and%rax, %rbx# extract high word mov%rbx, 0x7ff4dead00000000 test%rax, %rbx# check for flag jenext# skip if replaced <copy input into %rax> cvtsd2ss%rax, %rax# down-cast value or%rax, %rbx# set flag <copy %rax back into input> next: <next operand> pop%rbx pop%rax <replaced instruction> # e.g. addsd => addss
Mixed Precision • Question: Which parts to replace? • Answer: Automatic search • Empirical, iterative feedback loop • User-defined verification routine • Heuristic search optimization
Automated Search • Keys to search algorithm • Depth-first search • Look for replaceable larger structures first • Modules, functions, blocks, etc. • Prioritization • Inspect highly-executed routines first
Mixed Precision: Sum2PI_X Failed single-precision replacement
Mixed Precision: Sum2PI_X intsum2pi_x() { inti, j, k; real x, y, acc; sum_typesum; real final = PI * OUTER; sum = 0.0; for(i=0; i<OUTER; i++) { acc= 0.0; for(j=1; j<INNER; j++) { x = 1.0; for(k=0; k<j; k++) x *= 2.0; y = (real)PI / x; acc += y; } sum += acc; } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } /* SUM2PI_X – approximate pi*x in a computationally- * heavyway to demonstrate various CRAFT analyses */ /* constants */ #define PI 3.14159265359 #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30
Mixed Precision: Sum2PI_X intsum2pi_x() { inti, j, k; real x, y, acc; sum_typesum; real final = PI * OUTER; sum = 0.0; for(i=0; i<OUTER; i++) { acc= 0.0; for(j=1; j<INNER; j++) { x = 1.0; for(k=0; k<j; k++) x *= 2.0; y = (real)PI / x; acc += y; } sum += acc; } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } /* SUM2PI_X – approximate pi*x in a computationally- * heavyway to demonstrate various CRAFT analyses */ /* constants */ #define PI 3.14159265359 #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30
Mixed Precision: Results • SuperLU • Lower error threshold = fewer replacements
Mixed Precision: Results • AMGmk • Highly-adaptive multigrid microkernel • Built-in error tolerance • Search found complete replacement • Manual conversion • Speedup: 175s to 95s (1.8X) • Conventional x86_64 hardware
Mixed Precision: Results • Memory-based analysis • Replacement candidates: output operands • Generally higher replacement rates • Analysis found several valid variable-level replacements
Mixed Precision: Conclusions • Automated tools can prototype mixed-precision configurations • Automated search can provide precision-level replacement insights • Precision analysis could provide another “knob” for application tuning • Even if computation requires double precision, storage/communication may not
Contribution 4 of 4 Reduced Precision
Reduced Precision • Simulate reduced precision with truncation • Truncate result after every operation • Allows zero up to double (64-bit) precision • Less overhead (fewer added operations) • Search routine • Identifies component-level precision requirements vs. Single Double 0 Single Double
Reduced Precision: GUI • Bit-level precision requirements Single 0 Double
Reduced Precision: Sum2PI_X 0 bits (single – exponent only) 22 bits (single) 27 bits (double – overly conservative) 32 bits (double)
Reduced Precision • Faster search convergence compared to mixed-precision analysis