360 likes | 569 Views
Algorithm-Based Fault Tolerance Matrix Multiplication. Greg Bronevetsky. Problem at Hand. Have matrices A and B Want to compute their product: AB Ask a matrix-matrix-multiply (MMM) implementation to compute product Answer: C Question: Is C the correct answer? How could we know for sure?.
E N D
Algorithm-Based Fault ToleranceMatrix Multiplication Greg Bronevetsky
Problem at Hand • Have matrices A and B • Want to compute their product: AB • Ask a matrix-matrix-multiply (MMM) implementation to compute product • Answer: C • Question: Is C the correct answer? How could we know for sure?
Algorithm-Based Fault Tolerance • Encode input matrices via error-correcting code • Run regular MMM algorithm on encoded matrices • Encoding invariant under MMM • Naturally outputs encoded matrices • Encoding guarantees: • If upto t errors in output, will detect error • If upto c<t errors in output, can decode correct output matrix
Outline Linear Error Correcting Codes ABFT = Linear Encoding of Matrices Algorithm-Based Fault Tolerance
Error Correcting Codes • Map f: k n • k-long data words n-long codewords • We use ={0, 1} • Code of length n is a “sparse” subset of n • Very few possible words are valid codewords • Rate of code Amount of information communicated by each codeword
Minimum Distance • Minimum Distance: d() = Hamming distance • Hamming distance: number of spots where words differ • Measures difficulty of decoding/correcting corrupted codewords
Detection and Correction • Code may detect errors in dmin spots • No error can morph one codeword into another • May correct errors in (dmin-1)/2 spots • Can still find “closest” codeword • More details later… Each codeword defines circle around itself of radius dmin/2
Linear Codes • Codewords form linear subspace inside n • In rowspace of generator matrix G: a (n=7, k=3) code
Property 1 • Linear combination of any codewords is also a codeword: For any x,yC, (x+y)C • Codeword*constant is codeword For any zC, k*zC • <0,0…0> always a codeword • Proof: basic properties of linear spaces
Property 2 • Minimum distance of linear code = • Where • Proof:
Parity Check Matrix • H: dual matrix to G • Contains basis of space orthogonal to G’s row space • n-k dimentional space • H is (n-k)xn • Space defined as: • Note: H also defines a linear code
Property 3 • dmin=min # of columns of H that can sum to 0 • Proof:
Property 4 • Minimum distance of linear code n-k+1 • Proof • Total n dimensions (since codewords are n-vectors) • G’s rowspace rank = k • Thus, H’s columspace rank = n-k • Thus, n-k+1 columns will be linearly dependent • Add up to 0 • By Property 3, this is dmin
Outline Linear Error Correcting Codes ABFT = Linear Encoding of Matrices Algorithm-Based Fault Tolerance
Encoding a Matrix • Algorithm-Based Fault Tolerance introduced by Huang and Abraham in 1984 • Encode each row of matrix via extra column • Column entries = sums of matrix rows
Encoding a Matrix • Encode each column of matrix via extra row • Row entries = sums of matrix columns • Full Encoding:
Detecting Errors • Suppose matrix A is corrupted to matrix  • entry âi,j is wrong • Can detect error’s exact position: <i,j>
Correcting Errors • Can correct error using row or col checksum
Big Trick: Preservation of Encoding • Column-encoded mtx * Row-encoded mtx = = Fully-encoded mtx • Can check MMM computation by checking encoding of output • If product matrix has an erroneous entry • Can detect • Can correct
Applications • Matrix Multiplication • Given encoded A and B, • Check whether MMM result C (?=AB) has valid encoding • Matrix Factorization • Given a factorization A=WZ • Verify correctness by verifying encodings of factors • Factors row- OR column-encoded • Can only detect, not correct errors
Weighted ABFT • Oftentimes need to check row- or column-encoded matrices • Ex: factorization, data integrity check • Can only detect errors in such matrices • Can we also correct? • Yes, by generalizing to weighted checking rows/columns
Weighting • Suppose we have d n-vectors w1…wd • Can column-encode matrix A: • Lets try out:
Weighted Error Correction • Weighted encoding Detects and Corrects single errors • Even for non full-encoding
Outline Linear Error Correcting Codes ABFT = Linear Encoding of Matrices Algorithm-Based Fault Tolerance
“Surprise” • But this is all just a linear code! • Generator matrix for above scheme:
Generating Encodings • Given m=<ai,1, ai,2, …, ai,k> as message word (or matrix row/column)
Surprise?? • Not too surprising really • Why else would MMM preserve encoding? • Another possibility: • Efficient: can be implemented via bit shifts • Room open for using any linear code!
Error Detection/Correction in General • To show for linear codes: • Can detect dmin errors • Can correct (dmin-1)/2 errors • Let be original codeword • Let be the corrupted codeword • e: error vector
Error Detection in General • s called the “syndrome vector” • Independent of original codeword • Note: weight(e) <dmin since <dmin errors • Thus: • Detection: if , then ERROR
Error Correction in General • Clearly e is correction vector • corrects error in • Sufficient to prove: weight(e)(dmin-1)/2 H is isomorphism: correction vectors syndrome vectors • i.e. for each correction vector (want to know) unique syndrome vector • Thus, possible to correct any error • may not be efficient
H is Onto • weight(e) (dmin-1)/2 < dmin • rank(H) = n-k (dmin-1)/2 • Thus, rank(H) weight(e) and He 0 • Not enough 1’s in e to sum H’s columns to 0 • H maps onto its range • Thus,
H is 1-1 • Let e1 and e2 be correction vectors, e1 e2 • Suppose that: • weight(e1&e2) (dmin-1)/2 • He1 = He2 = s • He1-He2 = H(e1-e2) = s-s = 0 • And so, (e1-e2) is a codeword • Thus, weight(e1-e2) dmin • But weight(e1&e2) (dmin-1)/2 and so weight(e1-e2) dmin-1 • Contradiction! e1 = e2
Other Encoding Schemes • Linear codes preserved by matrix multiplication • Presumably, fancier codes might be preserved by fancier computations • Limit: • S. Winograd showed in 1962 that any code s.t. f(xy) = f(x) f(y) has rate (k/n) or minimum weight0 as k • How general can we get? • Do good solutions exist for small k? • k=64 bits should be good enough
Summary • For Matrix Multiplication can encode input via linear codes • Solutions exist for more complex codes • Ex: Fourier Transforms • On parallel systems must ensure: • No processor touches >1 element per row/column • Else, if one processor fails, encoding overwhelmed with errors • To ensure this must modify algorithm • Separate check placement theory