1 / 40

ECE 753: FAULT-TOLERANT COMPUTING

ECE 753: FAULT-TOLERANT COMPUTING. Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolereance: ECC. Overview. Introduction Motivation and Background Hamming Codes – by example SEC-DED Codes – Algebraic method SEC-DED Codes – Hardware SEC-DED-SBD Codes

Mia_John
Download Presentation

ECE 753: FAULT-TOLERANT COMPUTING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolereance: ECC

  2. Overview • Introduction • Motivation and Background • Hamming Codes – by example • SEC-DED Codes – Algebraic method • SEC-DED Codes – Hardware • SEC-DED-SBD Codes • Cyclic Codes – (time permitting) • Summary ECE 753 Fault Tolerant Computing

  3. Introduction • References • Chapter 3 of Koren and Krishna • Appendix A of the book [siew:92] – also included in the set of reading material • Following references • Reddy – “A class of linear codes …” IEEETC, May 1978 • Any book on coding theory ECE 753 Fault Tolerant Computing

  4. Motivation and Background • Memories are integral part of digital systems (computers) • Majority of chip and/or board area is taken by memories • Hence – reliability improvement methods must pay attention to memories (RAMs, ROMs, etc.) ECE 753 Fault Tolerant Computing

  5. Motivation and Background (contd.) • Types of faults prevalent in memories • During manufacturing • Stuck-at • Timing faults • Coupling and pattern sensitive faults • During operation • Cell failures due to life, stress – same as stuck-at • Alpha particle hits – cell content change • Sensitive to system location. Higher hits at altitudes and in flight • Need non-testing based solutions • Random failures – bit/nibble/byte/card failures ECE 753 Fault Tolerant Computing

  6. Motivation and Background (contd.) • Theoretical Foundation • Linear and modern algebra • Concept of groups, fields, and vector spaces • We will focus on binary codes but will have to include polynomial algebra • Theory – Informal definitions and results • Vector: A collection of bits represented as a string • Information bits - collection of k-bits • Code word: encoded information bit string • k information bits encoded to n bits. Encoded information word is a code word. • Check bits: r (= n-k) extra bits used to encode information bits ECE 753 Fault Tolerant Computing

  7. Motivation and Background (contd.) • Theory – Informal definitions and results • Hamming weight of a vector v: Number of 1’s in v • Hamming distance (HD) between a pair of vectors v1 and v2: number of places two vectors differ from each other. HD(v1, v2) = HW(v1v2) • Code: Collection of code words. • Block code: each code word contains same number of bits. • Minimum Hamming distance of a code: Minimum of all HDs between all pairs of code words in a code. ECE 753 Fault Tolerant Computing

  8. Motivation and Background (contd.) Theory – Informal definitions and results (contd.) • Error detection: Erroneous word (a code word with one or more bit errors) is not a code word • Basic results 1: A code is capable of t error detection if and only if min HD of the code is at least t+1. • Proof: use sphere packing argument to show this. • Example: Use of parity –we know that we can detect single error. What is the minimum HD for such a code? Prove that the min HD is 2 using the argument that no two binary strings with even (odd) Hamming weight can have a HD of 1. ECE 753 Fault Tolerant Computing

  9. Motivation and Background (contd.) Theory – Informal definitions and results (contd.) • Basic results 2: A code is capable of correcting t errors if and only if min HD of the code is at least 2t+1. • Proof: use sphere packing argument as before. • Combine the two results: A code is a capable of correcting t errors and detecting d errors (d  t) if and only if min HD of the code is at least t+d+1. ECE 753 Fault Tolerant Computing

  10. Hamming Codes – by example • A linear block code • Consider a (7,4) Hamming code • Let i1 i2 i3 i4 be information symbols • Let p1p2 p4 be check symbols • The parity equations: p1 = i1 i2 i4 p2 = i1 i3 i4 p4 = i2 i3 i4 ECE 753 Fault Tolerant Computing

  11. Hamming Codes – by example (contd.) • Can write the equations as follows (easy to remember) p1 p2 i1 p4 i2 i3 i4 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 1 2 3 4 5 6 7 This encodes a 4-bit information word into a 7-bit codeword ECE 753 Fault Tolerant Computing

  12. Hamming Codes – by example (contd.) • Properties of the code • If there is no error, all parity equations will be satisfied • Denote the outcomes of these equation checks as c1, c2, c4 • If there is exactly one error, then c1, c2, c4 point to the error • The vector c1, c2, c4 is called syndrome • The above (7,4) Hamming code is SEC code ECE 753 Fault Tolerant Computing

  13. Hamming Codes – by example (contd.) • The above method of construction can be generalized to construct an (n,k) Hamming code • Simple bound k = number of information bits r = number of check bits n = k + r = total number of bits n + 1 = number of single or fewer errors Each error (including no error) must have a distinct syndrome With r check bits max possible syndrome = 2r Hence: 2r n + 1 ECE 753 Fault Tolerant Computing

  14. Hamming Codes – by example (contd.) Simple bound When: 2r= n + 1 the corresponding Hamming code is a perfect code • Perfect Hamming codes can be constructed as follows: p1 p2 i1 p4 i2 i3 i4 p8 i5 . . . . . . 20 21 3 22 5 6 7 23 9 . . . . . . Parity equations can be written as before from the above matrix representation ECE 753 Fault Tolerant Computing

  15. SEC-DED Codes – Algebraic method • Definitions • (G, *) – An abelian (commutative) Group • There is a 0 in G (identity) • For every a in G a-1 is also in G (inverses) • For all a and b in a*b = b*a is also in G (closed) • Examples • G = (0, 1); * =  (Exclusive-OR) • (Z3, +3) is a commutative group ECE 753 Fault Tolerant Computing

  16. SEC-DED Codes – Algebraic method (contd.) • Definitions (contd.) • (F, +, .) – A Field if • (F, +) is an abelian group with identity of 0 • (F - 0, .) is an abelian group • Examples • (F, , .) is a Field • F = (0, 1);  = Exclusive-OR; . = AND • The above Field is called GF(2) ECE 753 Fault Tolerant Computing

  17. SEC-DED Codes – Algebraic method (contd.) • Definitions (contd.) • Vector space over a field F • (V, +) is an abelian group • v in V and c in F  cv is V • c(u + v) = cu + cv • (c+d)v = cv + dv • C(dv) = (cd)v • S  V is a subspace if S is a vector space • A linear combination of vectors is a vector • u = c1v1 + c2v2 + c3v3 + … + cnvn ECE 753 Fault Tolerant Computing

  18. SEC-DED Codes – Algebraic method (contd.) • Some results and more definitions • Over GF(2) a collection of all n-bit vectors forms a vector space • Let v1, v2, … , vk be n-bit vectors each. Then all 2k linear combinations of these k vectors form a subspace • A set of k vectors v1, v2, … , vk is linearly independent if for not all ci = 0, i = 1, …, k c1v1 + c2v2 + c3v3 + … + ckvk  0 ECE 753 Fault Tolerant Computing

  19. SEC-DED Codes – Algebraic method (contd.) • Some results and more definitions (contd.) • Largest number of linearly independent vectors in a vector space is the dimension of the space. • Dimension of the space containing all n-bit vectors is n • Dimension of the space containing all 2k linear combinations of k vectors was no more than k. • A binary (n,k) linear block code is a k-dimensional subspace of an n-dimensional vector space ECE 753 Fault Tolerant Computing

  20. SEC-DED Codes – Algebraic method (contd.) • A binary (n,k) linear block code can be described by a collection of k carefully chosen vectors. Each code word is a linear combination of these k-vectors, thus forming a k-dimensional subspace. • These k-vectors can be written as a kn matrix G, called Generator matrix. A code word for a k-bit information word, say vector a, is obtained by aG • Example: For the (7,4) Hamming code described earlier p1 p2 i1 p4 i2 i3 i4 1 1 1 0 0 0 0 1 0 0 1 1 0 0 = G 0 1 0 1 0 1 0 1 1 0 1 0 0 1 Note: a code word is a linear combination of rows of G ECE 753 Fault Tolerant Computing

  21. SEC-DED Codes – Algebraic method (contd.) • Two vectors v1 and v2 are orthogonal if v1 . v2 = 0 • The G matrix can also be represented by an rn matrix H in which each n vector of H is orthogonal to every vector of G. • Hence GHT = 0 • dim G + dim H = n • Example: For the (7,4) Hamming code described earlier the H matrix is: p1 p2 i1 p4 i2 i3 i4 1 0 1 0 1 0 1 0 1 1 0 0 1 1 = H 0 0 0 1 1 1 1 • Check that GHT = 0 ECE 753 Fault Tolerant Computing

  22. SEC-DED Codes – Algebraic method (contd.) • There are two ways to encode data words • Use G (generator) matrix • Use H (parity check) matrix • We will use H – being of lower dimensionality • Consider the following representation of H H = [ Pr| Ir ], where Pr is rk matrix and Ir is rr matrix • Consider a code word (a1, a2, … , ak, p1, p2 … pr) • We can wirite parity check equations from the above H, i.e. from HaT • Example: For the (7,4) Hamming code we can write H matrix as: a1 a2 a3 a4 p1 p2 p4 1 0 1 1 1 0 0 1 1 0 1 0 1 0 = H 0 1 1 1 0 0 1 • Can obtain previous parity equations from this H in a simple manner ECE 753 Fault Tolerant Computing

  23. SEC-DED Codes – Algebraic method (contd.) • Note the H is specified such that all information bits stay intact & together and check bits stay together and depend only on information bits • A code specified by an H of the above type is called a systematic code • Data bits and check bits stay separate from each other • It is easy to extract data bits from a code word • Statement: rearrangement of columns of H does not change the code. All it does is that it changes the position of the check bits and information bits • Question: when can we write an arbitrary H in systematic form? ECE 753 Fault Tolerant Computing

  24. SEC-DED Codes – Algebraic method (contd.) • Theorem: H, an rn matrix and rank(H) = r (rank r means H contains r linearly independent columns), then H can be transformed to a systematic form • Row operation on H means linear combination of parity check equations. Thus solution of equations does not change • First rearrange columns of H such that last r columns are linearly independant • Next find a matrix M such that M performs row operations on H such that M when multiplies the last r columns, it gives an unity rr matrix. Thus M in fact is the inverse of the matrix that consists of the last r columns of H • Now the the matrix MH will be in systematic form • Example in class ECE 753 Fault Tolerant Computing

  25. SEC-DED Codes – Algebraic method (contd.) • Definition: Syndrome S of an n-bit x word is S = HxT Note – S is an r-bit vector • Note also in the above equation xT provides a linear combination of columns of H • Example consider a (6,3) systematic H and consider a 6-bit vector x • Theorem: for an (n,k) linear block code represented by H the syndrome of every code word is 0 • Proof is more or less based on the way we have defined a block code and H matrix • Definition: Error word, E, is a vector that represents where a codeword is erroneous • Example in class to define all these terms ECE 753 Fault Tolerant Computing

  26. SEC-DED Codes – Algebraic method (contd.) • Theorem: let C be a code word and E be an error word, i.e. C’ = C + E is the erroneous word (code word with error in it). Let S’ be the syndrome of the word C’ then S’ = HET • Theorem: A linear block code represented by H is SEC if and only if the columns of H are distinct and non zero • Theorem: A linear block code represented by H is SEC-DED if: • All columns of H are distinct and non zero • Sum of any two columns of H is non zero and is not equal to a third column of H ECE 753 Fault Tolerant Computing

  27. SEC-DED Codes – Algebraic method (contd.) • Consider an H matrix in which each column has odd number of 1’s code generated by such an H matrix is called odd weight column code • Example: consider r = 4. Let us consider an H, a 48 matrix: 1 0 0 0 0 1 1 1 0 1 0 0 1 0 1 1 = H 0 0 1 0 1 1 0 1 0 0 0 1 1 1 1 0 wt = 1 columns wt = 3 columns This is a (8,4) SEC-DED code • Theorem: Odd weight column code is a SEC-DED code • Theorem: Hamming code with overall parity is a SEC-DED code ECE 753 Fault Tolerant Computing

  28. SEC-DED Codes – Algebraic method (contd.) • Shortened codes • Some times we are interested in code that do not exactly satisfy the bound derived for perfect Hamming codes. For example consider the case when k=8. Clearly we will need r=5. But we do not want to have a (15,11). What we want a (12,8) code. Following result comes handy to design such codes and still have error correction capability • Result: Deleting columns of H does not alter the error correction capability of the corresponding code • Proof: the conditions stated in the theorem (for example columns remaining odd weight columns, or no two columns being identical) do not change by deleting columns of H. • What columns to delete? See next hardware issue. ECE 753 Fault Tolerant Computing

  29. K inf bits XOR Tree K inf bits R check bits SEC-DED Codes –Hardware • Encoding hardware ECE 753 Fault Tolerant Computing

  30. SEC-DED Codes –Hardware (contd.) • Decoding hardware – Algorithm • Compute syndrome S • If S = 0 then no error • If S  0 { decode S • If S is in range (decoded S  n) then correct sth bit • Else there is an uncorrectable error } • Note: it is easy to determine if S is 0 • Decoding S is also straight forward • Correction implies a bit flip (EOR operation) ECE 753 Fault Tolerant Computing

  31. r k EOR tree Syndrome or and decoder . . . n nor Error corrector n EORs Corrected word SEC-DED Codes –Hardware (contd.) • Decoding hardware – Implementation ECE 753 Fault Tolerant Computing

  32. SEC-DED Codes –Hardware (contd.) • Hardware simplification • Reduce number of EORs • Have as few 1s in the matrix as possible • Reduce delay – depth of EOR tree • Have as few 1s in each row of H as possible ECE 753 Fault Tolerant Computing

  33. SEC-DED-SBD Codes • Motivation • Many memories are organizes as byte oriented • Failures manifest themselves as follows • Random failure – bit error • Chip failure – byte error • Objective is to detect such byte errors while detect and correct random errors. Hence the error model • Single random error • Multiple errors limited within a byte ECE 753 Fault Tolerant Computing

  34. SEC-DED-SBD Codes (contd.) • Theorem (Reddy): Let E1 and E2 be two sets of error patterns and E1E2 = . A linear block described by H can correct all errors in E1 and detect all errors in E2 if and only if • For e in E1E2 HeT  0 • For ei, ej in E1 HeiT  HejT and • For an ei in E2 there is no ej in E1 such that HeiT = HejT ECE 753 Fault Tolerant Computing

  35. SEC-DED-SBD Codes (contd.) • To demonstrate the use of the theorem, let us look at an example H matrix and its capabilities for a small byte (nibble) size • b = number of bits in each memory card • n = total number of bits in a code word • r = number of check bits • n = b(2r-b+1 –1) • For b = 4 and r = 5 we have n = 12. Thus we will construct a (12,7) code which will be able to correct any single error and detect errors confined to 4-bit nibbles ECE 753 Fault Tolerant Computing

  36. SEC-DED-SBD Codes (contd.) • Many parts of the code are shown as blocks in the following figure Correction part Detect mult Errors in byte ECE 753 Fault Tolerant Computing

  37. SEC-DED-SBD Codes (contd.) • Now let us look at the complete matrix 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 ECE 753 Fault Tolerant Computing

  38. SEC-DED-SBD Codes (contd.) • The capability can be proven as follows • E1 single error, E2 errors limited to 4-bit nibbles • All columns are non-zero and any combinations of columns within 4-bit nibble are also non-zero • All columns are distinct – providing single error correction capability • The last 3 rows provide guarantee that no combination of errors limited to a nibble will have a syndrome identical to single error syndrome ECE 753 Fault Tolerant Computing

  39. SEC-DED-SBD Codes (contd.) • Two comments • The code can be converted to a systematic code • Distance of the code can be increased by 1 to make it a DED code • This code can also be shortened ECE 753 Fault Tolerant Computing

  40. Summary • Why ECC in Fault tolerance • Hamming code – by example • Algebra and Algebraic coding • Codes • Hardware • SEC-SBD code ECE 753 Fault Tolerant Computing

More Related