140 likes | 494 Views
Algorithm-Based Fault Tolerance for Matrix Operations. Proposed by: Kuang-Hua Huang Jacob A. Abraham. Problem Description. Achieving a fault tolerant model that is algorithm based rather than hardware based Existing techniques require high overhead cost Error Masking (hardware redundancy)
E N D
Algorithm-Based Fault Tolerance for Matrix Operations Proposed by: Kuang-Hua Huang Jacob A. Abraham
Problem Description • Achieving a fault tolerant model that is algorithm based rather than hardware based • Existing techniques require high overhead cost • Error Masking (hardware redundancy) • Error Detection and Recovery (hardware/time redundancy)
Existing Techniques • Error Masking • Triple Module Redundancy – 200% • Quadded Logic – 300% • Error Detection and Recovery • TSC – 73%-83% hardware • Alternating Logic – 100% time + 85% hardware • RESO – 100% time • Watchdog processors
Algorithm-Based Fault Tolerance • Pros • Detects and corrects errors • Extremely low overhead • Cons • Not generally applicable (mostly useful for MPP systems) • Undetectable patterns if more than one error
Approach • Encoding of data • Redesign of Algorithm • Information must be easy to recover • Time overhead must not be low • Distribution of computation steps • All errors can be detected and corrected
Checksum Matrices • Definitions • Column checksum matrix • Row checksum matrix • Full checksum matrix
Theorems • Matrix Multiplication • A * B = C Ac * Br = Cf • LU Decomposition • C = L * U Cf = Lc * Ur • Addition • A + B = C Af + Bf = Cf • Scalar Multiplication • c * Af = (c * A)f • Transpose • AfT = (AT)f
Error Detection and Correction • Detection • Compute the sum (S1) of information in each row/column and compare to the corresponding checksum (S2) • Location • Intersection of the inconsistent row and column (S1 S2) • Correction • Correction of the error: E = E’ + (S2 – S1) • Correct the error in checksum: S1 S2
Mesh Connected Processor Arrays In a mesh connected system, each processor individually handles a calculation in the resultant matrix. In a systolic array, an array of processes handles a row of values Array B Array A
Overhead for MPP systems Mesh Connected Arrays Systolic Arrays
Undetectable Loop Patterns X X • Certain Patterns of error mask the errors • Caused by faulty processors • Requires a minimum number of processors to detect error X X X X X X X X X X X X
Uniprocessor Systems • In uniprocessor system, a faulty processor can cause all elements to be incorrect
Conclusion • Algorithm-based fault tolerance applied to matrix operations • Low ratio of redundancy • Ability to detect and correct errors • Ongoing research