Algorithm-Based Fault Tolerance Matrix Multiplication

Algorithm-Based Fault ToleranceMatrix Multiplication Greg Bronevetsky

Problem at Hand • Have matrices A and B • Want to compute their product: AB • Ask a matrix-matrix-multiply (MMM) implementation to compute product • Answer: C • Question: Is C the correct answer? How could we know for sure?

Algorithm-Based Fault Tolerance • Encode input matrices via error-correcting code • Run regular MMM algorithm on encoded matrices • Encoding invariant under MMM • Naturally outputs encoded matrices • Encoding guarantees: • If upto t errors in output, will detect error • If upto c<t errors in output, can decode correct output matrix

Outline Linear Error Correcting Codes ABFT = Linear Encoding of Matrices Algorithm-Based Fault Tolerance

Error Correcting Codes • Map f: k  n • k-long data words  n-long codewords • We use ={0, 1} • Code of length n is a “sparse” subset of n • Very few possible words are valid codewords • Rate of code Amount of information communicated by each codeword

Minimum Distance • Minimum Distance: d() = Hamming distance • Hamming distance: number of spots where words differ • Measures difficulty of decoding/correcting corrupted codewords

Detection and Correction • Code may detect errors in dmin spots • No error can morph one codeword into another • May correct errors in (dmin-1)/2 spots • Can still find “closest” codeword • More details later… Each codeword defines circle around itself of radius dmin/2

Linear Codes • Codewords form linear subspace inside n • In rowspace of generator matrix G: a (n=7, k=3) code

Property 1 • Linear combination of any codewords is also a codeword: For any x,yC, (x+y)C • Codeword*constant is codeword For any zC, k*zC • <0,0…0> always a codeword • Proof: basic properties of linear spaces

Property 2 • Minimum distance of linear code = • Where • Proof:

Parity Check Matrix • H: dual matrix to G • Contains basis of space orthogonal to G’s row space • n-k dimentional space • H is (n-k)xn • Space defined as: • Note: H also defines a linear code

Property 3 • dmin=min # of columns of H that can sum to 0 • Proof:

Property 4 • Minimum distance of linear code  n-k+1 • Proof • Total n dimensions (since codewords are n-vectors) • G’s rowspace rank = k • Thus, H’s columspace rank = n-k • Thus, n-k+1 columns will be linearly dependent • Add up to 0 • By Property 3, this is  dmin

Encoding a Matrix • Algorithm-Based Fault Tolerance introduced by Huang and Abraham in 1984 • Encode each row of matrix via extra column • Column entries = sums of matrix rows

Encoding a Matrix • Encode each column of matrix via extra row • Row entries = sums of matrix columns • Full Encoding:

Detecting Errors • Suppose matrix A is corrupted to matrix Â • entry âi,j is wrong • Can detect error’s exact position: <i,j>

Correcting Errors • Can correct error using row or col checksum

Big Trick: Preservation of Encoding • Column-encoded mtx * Row-encoded mtx = = Fully-encoded mtx • Can check MMM computation by checking encoding of output • If product matrix has an erroneous entry • Can detect • Can correct

Applications • Matrix Multiplication • Given encoded A and B, • Check whether MMM result C (?=AB) has valid encoding • Matrix Factorization • Given a factorization A=WZ • Verify correctness by verifying encodings of factors • Factors row- OR column-encoded • Can only detect, not correct errors

Weighted ABFT • Oftentimes need to check row- or column-encoded matrices • Ex: factorization, data integrity check • Can only detect errors in such matrices • Can we also correct? • Yes, by generalizing to weighted checking rows/columns

Weighting • Suppose we have d n-vectors w1…wd • Can column-encode matrix A: • Lets try out:

Weighted Error Detection

Weighted Error Correction • Weighted encoding Detects and Corrects single errors • Even for non full-encoding

“Surprise” • But this is all just a linear code! • Generator matrix for above scheme:

Generating Encodings • Given m=<ai,1, ai,2, …, ai,k> as message word (or matrix row/column)

Surprise?? • Not too surprising really • Why else would MMM preserve encoding? • Another possibility: • Efficient: can be implemented via bit shifts • Room open for using any linear code!

Error Detection/Correction in General • To show for linear codes: • Can detect dmin errors • Can correct (dmin-1)/2 errors • Let be original codeword • Let be the corrupted codeword • e: error vector

Error Detection in General • s called the “syndrome vector” • Independent of original codeword • Note: weight(e) <dmin since <dmin errors • Thus: • Detection: if , then ERROR

Error Correction in General • Clearly e is correction vector • corrects error in • Sufficient to prove: weight(e)(dmin-1)/2  H is isomorphism: correction vectors  syndrome vectors • i.e. for each correction vector (want to know)  unique syndrome vector • Thus, possible to correct any error • may not be efficient

H is Onto • weight(e)  (dmin-1)/2 < dmin • rank(H) = n-k  (dmin-1)/2 • Thus, rank(H)  weight(e) and He  0 • Not enough 1’s in e to sum H’s columns to 0 • H maps onto its range • Thus,

H is 1-1 • Let e1 and e2 be correction vectors, e1  e2 • Suppose that: • weight(e1&e2)  (dmin-1)/2 • He1 = He2 = s • He1-He2 = H(e1-e2) = s-s = 0 • And so, (e1-e2) is a codeword • Thus, weight(e1-e2)  dmin • But weight(e1&e2)  (dmin-1)/2 and so weight(e1-e2) dmin-1 • Contradiction! e1 = e2

Other Encoding Schemes • Linear codes preserved by matrix multiplication • Presumably, fancier codes might be preserved by fancier computations • Limit: • S. Winograd showed in 1962 that any code s.t. f(xy) = f(x)  f(y) has rate (k/n) or minimum weight0 as k • How general can we get? • Do good solutions exist for small k? • k=64 bits should be good enough

Summary • For Matrix Multiplication can encode input via linear codes • Solutions exist for more complex codes • Ex: Fourier Transforms • On parallel systems must ensure: • No processor touches >1 element per row/column • Else, if one processor fails, encoding overwhelmed with errors • To ensure this must modify algorithm • Separate check placement theory

Algorithm-Based Fault Tolerance Matrix Multiplication

Algorithm-Based Fault Tolerance Matrix Multiplication

Presentation Transcript

Algorithm-Based Fault Tolerance for Matrix Operations

Fault Tolerance

Fault Tolerance

Matrix-Matrix Multiplication

Fault Tolerance

Fault Tolerance

Enhanced matrix multiplication algorithm for FPGA

Fault tolerance

Strassen Matrix Multiplication Algorithm

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Algorithm-Based Fault Tolerance Theory of Check Placement

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance