ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolereance: ECC

Overview • Introduction • Motivation and Background • Hamming Codes – by example • SEC-DED Codes – Algebraic method • SEC-DED Codes – Hardware • SEC-DED-SBD Codes • Cyclic Codes – (time permitting) • Summary ECE 753 Fault Tolerant Computing

Introduction • References • Chapter 3 of Koren and Krishna • Appendix A of the book [siew:92] – also included in the set of reading material • Following references • Reddy – “A class of linear codes …” IEEETC, May 1978 • Any book on coding theory ECE 753 Fault Tolerant Computing

Motivation and Background • Memories are integral part of digital systems (computers) • Majority of chip and/or board area is taken by memories • Hence – reliability improvement methods must pay attention to memories (RAMs, ROMs, etc.) ECE 753 Fault Tolerant Computing

Motivation and Background (contd.) • Types of faults prevalent in memories • During manufacturing • Stuck-at • Timing faults • Coupling and pattern sensitive faults • During operation • Cell failures due to life, stress – same as stuck-at • Alpha particle hits – cell content change • Sensitive to system location. Higher hits at altitudes and in flight • Need non-testing based solutions • Random failures – bit/nibble/byte/card failures ECE 753 Fault Tolerant Computing

Motivation and Background (contd.) • Theoretical Foundation • Linear and modern algebra • Concept of groups, fields, and vector spaces • We will focus on binary codes but will have to include polynomial algebra • Theory – Informal definitions and results • Vector: A collection of bits represented as a string • Information bits - collection of k-bits • Code word: encoded information bit string • k information bits encoded to n bits. Encoded information word is a code word. • Check bits: r (= n-k) extra bits used to encode information bits ECE 753 Fault Tolerant Computing

Motivation and Background (contd.) • Theory – Informal definitions and results • Hamming weight of a vector v: Number of 1’s in v • Hamming distance (HD) between a pair of vectors v1 and v2: number of places two vectors differ from each other. HD(v1, v2) = HW(v1v2) • Code: Collection of code words. • Block code: each code word contains same number of bits. • Minimum Hamming distance of a code: Minimum of all HDs between all pairs of code words in a code. ECE 753 Fault Tolerant Computing

Motivation and Background (contd.) Theory – Informal definitions and results (contd.) • Error detection: Erroneous word (a code word with one or more bit errors) is not a code word • Basic results 1: A code is capable of t error detection if and only if min HD of the code is at least t+1. • Proof: use sphere packing argument to show this. • Example: Use of parity –we know that we can detect single error. What is the minimum HD for such a code? Prove that the min HD is 2 using the argument that no two binary strings with even (odd) Hamming weight can have a HD of 1. ECE 753 Fault Tolerant Computing

Motivation and Background (contd.) Theory – Informal definitions and results (contd.) • Basic results 2: A code is capable of correcting t errors if and only if min HD of the code is at least 2t+1. • Proof: use sphere packing argument as before. • Combine the two results: A code is a capable of correcting t errors and detecting d errors (d  t) if and only if min HD of the code is at least t+d+1. ECE 753 Fault Tolerant Computing

Hamming Codes – by example • A linear block code • Consider a (7,4) Hamming code • Let i1 i2 i3 i4 be information symbols • Let p1p2 p4 be check symbols • The parity equations: p1 = i1 i2 i4 p2 = i1 i3 i4 p4 = i2 i3 i4 ECE 753 Fault Tolerant Computing

Hamming Codes – by example (contd.) • Can write the equations as follows (easy to remember) p1 p2 i1 p4 i2 i3 i4 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 1 2 3 4 5 6 7 This encodes a 4-bit information word into a 7-bit codeword ECE 753 Fault Tolerant Computing

Hamming Codes – by example (contd.) • Properties of the code • If there is no error, all parity equations will be satisfied • Denote the outcomes of these equation checks as c1, c2, c4 • If there is exactly one error, then c1, c2, c4 point to the error • The vector c1, c2, c4 is called syndrome • The above (7,4) Hamming code is SEC code ECE 753 Fault Tolerant Computing

Hamming Codes – by example (contd.) • The above method of construction can be generalized to construct an (n,k) Hamming code • Simple bound k = number of information bits r = number of check bits n = k + r = total number of bits n + 1 = number of single or fewer errors Each error (including no error) must have a distinct syndrome With r check bits max possible syndrome = 2r Hence: 2r n + 1 ECE 753 Fault Tolerant Computing

Hamming Codes – by example (contd.) Simple bound When: 2r= n + 1 the corresponding Hamming code is a perfect code • Perfect Hamming codes can be constructed as follows: p1 p2 i1 p4 i2 i3 i4 p8 i5 . . . . . . 20 21 3 22 5 6 7 23 9 . . . . . . Parity equations can be written as before from the above matrix representation ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method • Definitions • (G, *) – An abelian (commutative) Group • There is a 0 in G (identity) • For every a in G a-1 is also in G (inverses) • For all a and b in a*b = b*a is also in G (closed) • Examples • G = (0, 1); * =  (Exclusive-OR) • (Z3, +3) is a commutative group ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.) • Definitions (contd.) • (F, +, .) – A Field if • (F, +) is an abelian group with identity of 0 • (F - 0, .) is an abelian group • Examples • (F, , .) is a Field • F = (0, 1);  = Exclusive-OR; . = AND • The above Field is called GF(2) ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.) • Definitions (contd.) • Vector space over a field F • (V, +) is an abelian group • v in V and c in F  cv is V • c(u + v) = cu + cv • (c+d)v = cv + dv • C(dv) = (cd)v • S  V is a subspace if S is a vector space • A linear combination of vectors is a vector • u = c1v1 + c2v2 + c3v3 + … + cnvn ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.) • Some results and more definitions • Over GF(2) a collection of all n-bit vectors forms a vector space • Let v1, v2, … , vk be n-bit vectors each. Then all 2k linear combinations of these k vectors form a subspace • A set of k vectors v1, v2, … , vk is linearly independent if for not all ci = 0, i = 1, …, k c1v1 + c2v2 + c3v3 + … + ckvk  0 ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.) • Some results and more definitions (contd.) • Largest number of linearly independent vectors in a vector space is the dimension of the space. • Dimension of the space containing all n-bit vectors is n • Dimension of the space containing all 2k linear combinations of k vectors was no more than k. • A binary (n,k) linear block code is a k-dimensional subspace of an n-dimensional vector space ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.) • A binary (n,k) linear block code can be described by a collection of k carefully chosen vectors. Each code word is a linear combination of these k-vectors, thus forming a k-dimensional subspace. • These k-vectors can be written as a kn matrix G, called Generator matrix. A code word for a k-bit information word, say vector a, is obtained by aG • Example: For the (7,4) Hamming code described earlier p1 p2 i1 p4 i2 i3 i4 1 1 1 0 0 0 0 1 0 0 1 1 0 0 = G 0 1 0 1 0 1 0 1 1 0 1 0 0 1 Note: a code word is a linear combination of rows of G ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.) • Two vectors v1 and v2 are orthogonal if v1 . v2 = 0 • The G matrix can also be represented by an rn matrix H in which each n vector of H is orthogonal to every vector of G. • Hence GHT = 0 • dim G + dim H = n • Example: For the (7,4) Hamming code described earlier the H matrix is: p1 p2 i1 p4 i2 i3 i4 1 0 1 0 1 0 1 0 1 1 0 0 1 1 = H 0 0 0 1 1 1 1 • Check that GHT = 0 ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.) • There are two ways to encode data words • Use G (generator) matrix • Use H (parity check) matrix • We will use H – being of lower dimensionality • Consider the following representation of H H = [ Pr| Ir ], where Pr is rk matrix and Ir is rr matrix • Consider a code word (a1, a2, … , ak, p1, p2 … pr) • We can wirite parity check equations from the above H, i.e. from HaT • Example: For the (7,4) Hamming code we can write H matrix as: a1 a2 a3 a4 p1 p2 p4 1 0 1 1 1 0 0 1 1 0 1 0 1 0 = H 0 1 1 1 0 0 1 • Can obtain previous parity equations from this H in a simple manner ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.) • Note the H is specified such that all information bits stay intact & together and check bits stay together and depend only on information bits • A code specified by an H of the above type is called a systematic code • Data bits and check bits stay separate from each other • It is easy to extract data bits from a code word • Statement: rearrangement of columns of H does not change the code. All it does is that it changes the position of the check bits and information bits • Question: when can we write an arbitrary H in systematic form? ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.) • Theorem: H, an rn matrix and rank(H) = r (rank r means H contains r linearly independent columns), then H can be transformed to a systematic form • Row operation on H means linear combination of parity check equations. Thus solution of equations does not change • First rearrange columns of H such that last r columns are linearly independant • Next find a matrix M such that M performs row operations on H such that M when multiplies the last r columns, it gives an unity rr matrix. Thus M in fact is the inverse of the matrix that consists of the last r columns of H • Now the the matrix MH will be in systematic form • Example in class ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.) • Definition: Syndrome S of an n-bit x word is S = HxT Note – S is an r-bit vector • Note also in the above equation xT provides a linear combination of columns of H • Example consider a (6,3) systematic H and consider a 6-bit vector x • Theorem: for an (n,k) linear block code represented by H the syndrome of every code word is 0 • Proof is more or less based on the way we have defined a block code and H matrix • Definition: Error word, E, is a vector that represents where a codeword is erroneous • Example in class to define all these terms ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.) • Theorem: let C be a code word and E be an error word, i.e. C’ = C + E is the erroneous word (code word with error in it). Let S’ be the syndrome of the word C’ then S’ = HET • Theorem: A linear block code represented by H is SEC if and only if the columns of H are distinct and non zero • Theorem: A linear block code represented by H is SEC-DED if: • All columns of H are distinct and non zero • Sum of any two columns of H is non zero and is not equal to a third column of H ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.) • Consider an H matrix in which each column has odd number of 1’s code generated by such an H matrix is called odd weight column code • Example: consider r = 4. Let us consider an H, a 48 matrix: 1 0 0 0 0 1 1 1 0 1 0 0 1 0 1 1 = H 0 0 1 0 1 1 0 1 0 0 0 1 1 1 1 0 wt = 1 columns wt = 3 columns This is a (8,4) SEC-DED code • Theorem: Odd weight column code is a SEC-DED code • Theorem: Hamming code with overall parity is a SEC-DED code ECE 753 Fault Tolerant Computing

SEC-DED Codes – Algebraic method (contd.) • Shortened codes • Some times we are interested in code that do not exactly satisfy the bound derived for perfect Hamming codes. For example consider the case when k=8. Clearly we will need r=5. But we do not want to have a (15,11). What we want a (12,8) code. Following result comes handy to design such codes and still have error correction capability • Result: Deleting columns of H does not alter the error correction capability of the corresponding code • Proof: the conditions stated in the theorem (for example columns remaining odd weight columns, or no two columns being identical) do not change by deleting columns of H. • What columns to delete? See next hardware issue. ECE 753 Fault Tolerant Computing

K inf bits XOR Tree K inf bits R check bits SEC-DED Codes –Hardware • Encoding hardware ECE 753 Fault Tolerant Computing

SEC-DED Codes –Hardware (contd.) • Decoding hardware – Algorithm • Compute syndrome S • If S = 0 then no error • If S  0 { decode S • If S is in range (decoded S  n) then correct sth bit • Else there is an uncorrectable error } • Note: it is easy to determine if S is 0 • Decoding S is also straight forward • Correction implies a bit flip (EOR operation) ECE 753 Fault Tolerant Computing

r k EOR tree Syndrome or and decoder . . . n nor Error corrector n EORs Corrected word SEC-DED Codes –Hardware (contd.) • Decoding hardware – Implementation ECE 753 Fault Tolerant Computing

SEC-DED Codes –Hardware (contd.) • Hardware simplification • Reduce number of EORs • Have as few 1s in the matrix as possible • Reduce delay – depth of EOR tree • Have as few 1s in each row of H as possible ECE 753 Fault Tolerant Computing

SEC-DED-SBD Codes • Motivation • Many memories are organizes as byte oriented • Failures manifest themselves as follows • Random failure – bit error • Chip failure – byte error • Objective is to detect such byte errors while detect and correct random errors. Hence the error model • Single random error • Multiple errors limited within a byte ECE 753 Fault Tolerant Computing

SEC-DED-SBD Codes (contd.) • Theorem (Reddy): Let E1 and E2 be two sets of error patterns and E1E2 = . A linear block described by H can correct all errors in E1 and detect all errors in E2 if and only if • For e in E1E2 HeT  0 • For ei, ej in E1 HeiT  HejT and • For an ei in E2 there is no ej in E1 such that HeiT = HejT ECE 753 Fault Tolerant Computing

SEC-DED-SBD Codes (contd.) • To demonstrate the use of the theorem, let us look at an example H matrix and its capabilities for a small byte (nibble) size • b = number of bits in each memory card • n = total number of bits in a code word • r = number of check bits • n = b(2r-b+1 –1) • For b = 4 and r = 5 we have n = 12. Thus we will construct a (12,7) code which will be able to correct any single error and detect errors confined to 4-bit nibbles ECE 753 Fault Tolerant Computing

SEC-DED-SBD Codes (contd.) • Many parts of the code are shown as blocks in the following figure Correction part Detect mult Errors in byte ECE 753 Fault Tolerant Computing

SEC-DED-SBD Codes (contd.) • Now let us look at the complete matrix 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 ECE 753 Fault Tolerant Computing

SEC-DED-SBD Codes (contd.) • The capability can be proven as follows • E1 single error, E2 errors limited to 4-bit nibbles • All columns are non-zero and any combinations of columns within 4-bit nibble are also non-zero • All columns are distinct – providing single error correction capability • The last 3 rows provide guarantee that no combination of errors limited to a nibble will have a syndrome identical to single error syndrome ECE 753 Fault Tolerant Computing

SEC-DED-SBD Codes (contd.) • Two comments • The code can be converted to a systematic code • Distance of the code can be increased by 1 to make it a DED code • This code can also be shortened ECE 753 Fault Tolerant Computing

Summary • Why ECC in Fault tolerance • Hamming code – by example • Algebra and Algebraic coding • Codes • Hardware • SEC-SBD code ECE 753 Fault Tolerant Computing

ECE 753: FAULT-TOLERANT COMPUTING