240 likes | 253 Views
This paper discusses fault detection in numerical libraries, distinguishing between errors and round-offs in computed results.
E N D
Tests and Tolerances for High-Performance Software-Implemented Fault Detection Michael Turmon, Robert Granat, Daniel S.Katz, John Z.Lou
Objective • Software fault detection in common numerical libraries by checking computed output • Faulty environment here essentially constitutes bit flips in application’s state space • Distinguish between errors and round-offs in computed results
Faults and EDMs • Single Event Upsets • Radiation induced errors causing bit flips in memory, cache • Effects application data and code • Data errors are more difficult to detect • Error Detecting Middleware • Wrap existing numerical libraries • Avoid altering internals of the library • More efficient than original computation
Numerical Error Checking - Summary • Consider common numerical matrix computations • Use “post-conditions” to evaluate correctness • Post-condition: Necessary relation between inputs & computed outputs • Use well-known upper bounds on error propagation within numerical algorithms for matrix computations • Define tests and tolerances to separate errors and round-offs • Develop input-independent tolerances
Definitions: Vector & Matrix norms • Vector: ||v||1 = ∑ |vi| ||v||∞ = max|vi| ||v||2 = (∑|vi|2)1/2 • Matrices: ||A||1 = max. column sum of A ||A||∞ = max. row sum of A ||A||2 = largest singular value of A ||A||F = ( |aij| 2)1/2
Matrices review • Orthogonal Matrix A AT = I => A-1 = AT • Unitary Matrix A*T= A-1 • Permutation Matrix Reordered rows of I • Sub-multiplicative property ||Av|| ≤||A|| ||v|| ||AB|| ≤||A|| ||B||
Numerical Functions • Matrix multiplication • QR decomposition A= Q * R • A = input matrix • Q = Orthogonal matrix • R = upper triangular matrix • Singular Value decomposition A = U * D * VT • A = input matrix • D = diagonal matrix • U & V = orthogonal matrices
Numerical Functions (contd.) • LU decomposition • A = P* L*U • P = permutation matrix • L = lower triangular matrix • U = upper triangular matrix • System Solution • Solve for x in Ax=b , given A & b • Matrix inverse • Given A, find B such that A*B = I
Numerical functions (contd.) • Fourier transform • Given x, find y such that y=W x, where W is the matrix of Fourier basis, Wnk = e-j2kn/N • Inverse Fourier transform • Given y, find x such that x = n-1WTy where W is n*n matrix of Fourier bases (WT = W-1)
Probe Vector ^ ^ • Post-condition check A = Q * R -> computationally intense • Instead multiply with probe vector w and compare vectors • w A >< w Q R • Choice of w • Elements of w should not vary greatly in magnitude • w should be non-zero everywhere • Can be a vector of all ones, except for FFT ^ ^
Error Propagation – Matrix multiplication ^ • Error matrix E = P – AB • P = mult(A,B) • ||E||∞ n ||A||∞ ||B||∞ u • u = difference between unity & next larger float number, n = dimension common to A & B • d = P w – A B w = E w • ||d||∞ = ||E w||∞ ||E||∞ ||w||∞ n ||A||∞ ||B||∞ ||w||∞ u • ||d||∞ /||A||∞ ||B||∞ ||w||∞>< u • n is ignored – in average case, round-off errors independent of dimension ^ ^
Error Propagation • QRD: • ||d||F / (||A||F ||w||F ) >< u • d = Q R w – A w • SVD: • ||d||/ (||A|| ||w|| ) >< u • d = U D VT w – A w • LUD: • ||d||/ (||A|| ||w|| ) >< u • d = P L U w – A w ^ ^ ^ ^ ^ ^ ^ ^
Error Propagation (contd.) • Solve Ax = b: • ||d||/ (||A|| ||x|| ) >< u • d = A x – b • Matrix inverse: • ||d||/ (||A|| ||B|| ||w|| ) >< u • d = BA w - w ^ ^ ^ ^
Error Propagation - FFT • Forward Transform: • d = (y – Wx)T w • W is the n*n forward transform matrix containing the Fourier basis functions • w cannot have a sparse transform • Error propagation: ||e|| 5nlog2n ||x|| u • |d| /(nlog2n ||x||2 ||w||2) >< u • Inverse Transform: • d = (x – n-1 WT y)T w • |d|/(log2n ||y||2 ||w||2) >< u ^
Comparison Tests • = RHS – LHS and = || w|| • ( never actually computed) • T0: /||w|| >< u • Trivial test:Un-normalized comparison • T1: /(1 ||w||) >< u • Ideal test: may not always be computable • T2: /(2 ||w||) >< u • Approx. matrix test: based on computed quantities • T3: /(||w||+3) >< u • Approx. vector test: higher chance of false alarms
Experiments • Faults are injected in half the runs by changing a random bit of the algorithm’s state space • Faults are injected at random point of execution • The threshold value is chosen based on error quantity computed in the faulty and fault-free conditions
T2, ,T 1 T3 T0
Alternate tests for FFT: Parseval’s condition: (||x||2- n-1/2 ||y||2)/ ||x||2>< u Choosing a vector w2 with real & imag. parts equal to : cos(4(k – n/2)/n), k=0,1,….n-1 and compute difference as before ROC for FFT
Related work • ABFT – introduced by Huang & Abraham for matrix operations, 1984 • Error detection based on algorithm employed – matrix encoded with checksum matrix • Vastly extended by others for various numerical operations • Result Checking – introduced by Blum & Wasserman – focus on computation errors,1996 • Prata & Silva compared the two, found for Matrix mult. & QRD, RC more efficient than ABFT, 1999
Summary • Faults detected based on conditions that numerical output must satisfy • Implemented as wrappers around existing libraries • Run experiments under fault-free & faulty conditions and observe decision criterion • ub >> * => can be set based on an average-case outlook rather than assuming worst-case scenario • Selecting a trade-off between fault detection & false alarms • Can be extended to other common computations like Sorting, Integration, etc.