1 / 13

Algorithm-Based Fault Tolerance for Matrix Operations

Algorithm-Based Fault Tolerance for Matrix Operations. Proposed by: Kuang-Hua Huang Jacob A. Abraham. Problem Description. Achieving a fault tolerant model that is algorithm based rather than hardware based Existing techniques require high overhead cost Error Masking (hardware redundancy)

benjamin
Download Presentation

Algorithm-Based Fault Tolerance for Matrix Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithm-Based Fault Tolerance for Matrix Operations Proposed by: Kuang-Hua Huang Jacob A. Abraham

  2. Problem Description • Achieving a fault tolerant model that is algorithm based rather than hardware based • Existing techniques require high overhead cost • Error Masking (hardware redundancy) • Error Detection and Recovery (hardware/time redundancy)

  3. Existing Techniques • Error Masking • Triple Module Redundancy – 200% • Quadded Logic – 300% • Error Detection and Recovery • TSC – 73%-83% hardware • Alternating Logic – 100% time + 85% hardware • RESO – 100% time • Watchdog processors

  4. Algorithm-Based Fault Tolerance • Pros • Detects and corrects errors • Extremely low overhead • Cons • Not generally applicable (mostly useful for MPP systems) • Undetectable patterns if more than one error

  5. Approach • Encoding of data • Redesign of Algorithm • Information must be easy to recover • Time overhead must not be low • Distribution of computation steps • All errors can be detected and corrected

  6. Checksum Matrices • Definitions • Column checksum matrix • Row checksum matrix • Full checksum matrix

  7. Theorems • Matrix Multiplication • A * B = C  Ac * Br = Cf • LU Decomposition • C = L * U  Cf = Lc * Ur • Addition • A + B = C  Af + Bf = Cf • Scalar Multiplication • c * Af = (c * A)f • Transpose • AfT = (AT)f

  8. Error Detection and Correction • Detection • Compute the sum (S1) of information in each row/column and compare to the corresponding checksum (S2) • Location • Intersection of the inconsistent row and column (S1 S2) • Correction • Correction of the error: E = E’ + (S2 – S1) • Correct the error in checksum: S1 S2

  9. Mesh Connected Processor Arrays In a mesh connected system, each processor individually handles a calculation in the resultant matrix. In a systolic array, an array of processes handles a row of values Array B Array A

  10. Overhead for MPP systems Mesh Connected Arrays Systolic Arrays

  11. Undetectable Loop Patterns X X • Certain Patterns of error mask the errors • Caused by faulty processors • Requires a minimum number of processors to detect error X X X X X X X X X X X X

  12. Uniprocessor Systems • In uniprocessor system, a faulty processor can cause all elements to be incorrect

  13. Conclusion • Algorithm-based fault tolerance applied to matrix operations • Low ratio of redundancy • Ability to detect and correct errors • Ongoing research

More Related