Algorithm-Based Fault Tolerance for Matrix Operations

Algorithm-Based Fault Tolerance for Matrix Operations Proposed by: Kuang-Hua Huang Jacob A. Abraham

Problem Description • Achieving a fault tolerant model that is algorithm based rather than hardware based • Existing techniques require high overhead cost • Error Masking (hardware redundancy) • Error Detection and Recovery (hardware/time redundancy)

Existing Techniques • Error Masking • Triple Module Redundancy – 200% • Quadded Logic – 300% • Error Detection and Recovery • TSC – 73%-83% hardware • Alternating Logic – 100% time + 85% hardware • RESO – 100% time • Watchdog processors

Algorithm-Based Fault Tolerance • Pros • Detects and corrects errors • Extremely low overhead • Cons • Not generally applicable (mostly useful for MPP systems) • Undetectable patterns if more than one error

Approach • Encoding of data • Redesign of Algorithm • Information must be easy to recover • Time overhead must not be low • Distribution of computation steps • All errors can be detected and corrected

Checksum Matrices • Definitions • Column checksum matrix • Row checksum matrix • Full checksum matrix

Theorems • Matrix Multiplication • A * B = C  Ac * Br = Cf • LU Decomposition • C = L * U  Cf = Lc * Ur • Addition • A + B = C  Af + Bf = Cf • Scalar Multiplication • c * Af = (c * A)f • Transpose • AfT = (AT)f

Error Detection and Correction • Detection • Compute the sum (S1) of information in each row/column and compare to the corresponding checksum (S2) • Location • Intersection of the inconsistent row and column (S1 S2) • Correction • Correction of the error: E = E’ + (S2 – S1) • Correct the error in checksum: S1 S2

Mesh Connected Processor Arrays In a mesh connected system, each processor individually handles a calculation in the resultant matrix. In a systolic array, an array of processes handles a row of values Array B Array A

Overhead for MPP systems Mesh Connected Arrays Systolic Arrays

Undetectable Loop Patterns X X • Certain Patterns of error mask the errors • Caused by faulty processors • Requires a minimum number of processors to detect error X X X X X X X X X X X X

Uniprocessor Systems • In uniprocessor system, a faulty processor can cause all elements to be incorrect

Conclusion • Algorithm-based fault tolerance applied to matrix operations • Low ratio of redundancy • Ability to detect and correct errors • Ongoing research

Algorithm-Based Fault Tolerance for Matrix Operations

Algorithm-Based Fault Tolerance for Matrix Operations

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Algorithm-Based Fault Tolerance Matrix Multiplication

Algorithm-Based Fault Tolerance Theory of Check Placement

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance