Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Tests and Tolerances for High-Performance Software-Implemented Fault Detection Michael Turmon, Robert Granat, Daniel S.Katz, John Z.Lou

Objective • Software fault detection in common numerical libraries by checking computed output • Faulty environment here essentially constitutes bit flips in application’s state space • Distinguish between errors and round-offs in computed results

Faults and EDMs • Single Event Upsets • Radiation induced errors causing bit flips in memory, cache • Effects application data and code • Data errors are more difficult to detect • Error Detecting Middleware • Wrap existing numerical libraries • Avoid altering internals of the library • More efficient than original computation

Numerical Error Checking - Summary • Consider common numerical matrix computations • Use “post-conditions” to evaluate correctness • Post-condition: Necessary relation between inputs & computed outputs • Use well-known upper bounds on error propagation within numerical algorithms for matrix computations • Define tests and tolerances to separate errors and round-offs • Develop input-independent tolerances

Definitions: Vector & Matrix norms • Vector: ||v||1 = ∑ |vi| ||v||∞ = max|vi| ||v||2 = (∑|vi|2)1/2 • Matrices: ||A||1 = max. column sum of A ||A||∞ = max. row sum of A ||A||2 = largest singular value of A ||A||F = ( |aij| 2)1/2

Matrices review • Orthogonal Matrix A AT = I => A-1 = AT • Unitary Matrix A*T= A-1 • Permutation Matrix Reordered rows of I • Sub-multiplicative property ||Av|| ≤||A|| ||v|| ||AB|| ≤||A|| ||B||

Numerical Functions • Matrix multiplication • QR decomposition A= Q * R • A = input matrix • Q = Orthogonal matrix • R = upper triangular matrix • Singular Value decomposition A = U * D * VT • A = input matrix • D = diagonal matrix • U & V = orthogonal matrices

Numerical Functions (contd.) • LU decomposition • A = P* L*U • P = permutation matrix • L = lower triangular matrix • U = upper triangular matrix • System Solution • Solve for x in Ax=b , given A & b • Matrix inverse • Given A, find B such that A*B = I

Numerical functions (contd.) • Fourier transform • Given x, find y such that y=W x, where W is the matrix of Fourier basis, Wnk = e-j2kn/N • Inverse Fourier transform • Given y, find x such that x = n-1WTy where W is n*n matrix of Fourier bases (WT = W-1)

Operations & Post-conditions

Probe Vector ^ ^ • Post-condition check A = Q * R -> computationally intense • Instead multiply with probe vector w and compare vectors • w A >< w Q R • Choice of w • Elements of w should not vary greatly in magnitude • w should be non-zero everywhere • Can be a vector of all ones, except for FFT ^ ^

Error Propagation – Matrix multiplication ^ • Error matrix E = P – AB • P = mult(A,B) • ||E||∞  n ||A||∞ ||B||∞ u • u = difference between unity & next larger float number, n = dimension common to A & B • d = P w – A B w = E w • ||d||∞ = ||E w||∞  ||E||∞ ||w||∞  n ||A||∞ ||B||∞ ||w||∞ u • ||d||∞ /||A||∞ ||B||∞ ||w||∞>< u • n is ignored – in average case, round-off errors independent of dimension ^ ^

Error Propagation • QRD: • ||d||F / (||A||F ||w||F ) >< u • d = Q R w – A w • SVD: • ||d||/ (||A|| ||w|| ) >< u • d = U D VT w – A w • LUD: • ||d||/ (||A|| ||w|| ) >< u • d = P L U w – A w ^ ^ ^ ^ ^ ^ ^ ^

Error Propagation (contd.) • Solve Ax = b: • ||d||/ (||A|| ||x|| ) >< u • d = A x – b • Matrix inverse: • ||d||/ (||A|| ||B|| ||w|| ) >< u • d = BA w - w ^ ^ ^ ^

Error Propagation - FFT • Forward Transform: • d = (y – Wx)T w • W is the n*n forward transform matrix containing the Fourier basis functions • w cannot have a sparse transform • Error propagation: ||e||  5nlog2n ||x|| u • |d| /(nlog2n ||x||2 ||w||2) >< u • Inverse Transform: • d = (x – n-1 WT y)T w • |d|/(log2n ||y||2 ||w||2) >< u ^

Comparison Tests •  = RHS – LHS and  = || w|| • ( never actually computed) • T0: /||w|| >< u • Trivial test:Un-normalized comparison • T1: /(1 ||w||) >< u • Ideal test: may not always be computable • T2: /(2 ||w||) >< u • Approx. matrix test: based on computed quantities • T3: /(||w||+3) >< u • Approx. vector test: higher chance of false alarms

Experiments • Faults are injected in half the runs by changing a random bit of the algorithm’s state space • Faults are injected at random point of execution • The threshold value is chosen based on error quantity computed in the faulty and fault-free conditions

Choosing 

T2, ,T 1 T3 T0

Alternate tests for FFT: Parseval’s condition: (||x||2- n-1/2 ||y||2)/ ||x||2>< u Choosing a vector w2 with real & imag. parts equal to : cos(4(k – n/2)/n), k=0,1,….n-1 and compute difference as before ROC for FFT

Related work • ABFT – introduced by Huang & Abraham for matrix operations, 1984 • Error detection based on algorithm employed – matrix encoded with checksum matrix • Vastly extended by others for various numerical operations • Result Checking – introduced by Blum & Wasserman – focus on computation errors,1996 • Prata & Silva compared the two, found for Matrix mult. & QRD, RC more efficient than ABFT, 1999

Summary • Faults detected based on conditions that numerical output must satisfy • Implemented as wrappers around existing libraries • Run experiments under fault-free & faulty conditions and observe decision criterion • ub >> * =>  can be set based on an average-case outlook rather than assuming worst-case scenario • Selecting  a trade-off between fault detection & false alarms • Can be extended to other common computations like Sorting, Integration, etc.

Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Presentation Transcript

Fault Detection Tools and Techniques

Analytics Software for Energy Management, Building Systems Optimization and Equipment Fault Detection

Automatic and Scalable Fault Detection for Mobile Applications

Line Fault Detection

Automatic and Scalable Fault Detection for Mobile Applications

Fault detection

High-Performance, Low Fault-Tolerant Schools

Building Blocks for High-Performance, Fault-Tolerant Distributed Systems

“Software Fault Interactions and Implications for Software Testing”

Fault Detection

High-performance Pattern Detection and Discovery for Databases and Data Streams

Sophistocation of Fault Detection

Fault Tolerant, High Performance Computing Payload for Space Missions

Fault Detection and Diagnosis (II)

Soft-Error Detection Through Software Fault-Tolerance Techniques

Classifying Software Faults to Improve Fault Detection Effectiveness

High-Performance, Low Fault-Tolerant Schools

Management: Fault Detection and Troubleshooting

Fault detection

Fault Detection and Diagnosis

Tests and Tolerances for High-Performance Software-Implemented Fault Detection