1 / 15

CprE 545: Fault Tolerant Systems

CprE 545: Fault Tolerant Systems. System Level Fault Diagnosis. Introduction. The basic goal of system level diagnosis is to identify all the faulty units in a system. In order to determine how diagnosable a system is and for performing diagnosis the following PMC model is used

wes
Download Presentation

CprE 545: Fault Tolerant Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CprE 545: Fault Tolerant Systems System Level Fault Diagnosis CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  2. Introduction • The basic goal of system level diagnosis is to identify all the faulty units in a system. • In order to determine how diagnosable a system is and for performing diagnosis the following PMC model is used • PMC model was introduced by Preparata, Metze, and Chien in 1967. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  3. PMC Model • In the PMC model, a system S is decomposed into “n” units, not necessarily identical, denoted by U = {u1, u2, ….un}. • Each unit is considered to be completely working or completely faulty. There is no intermediate state. • The status of the components do not change during the diagnosis. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  4. PMC Model (Contd..) • In the PMC model, each unit belonging to U is assigned a particular subset of U to test ( no unit tests itself). The complete set of tests is called connection assignment, and is represented as a graph G = (U, E). • In this graph, each node represents a unit, and each edge represents a testing link. • An edge (Ui, Uj) exists in G if and only if node Ui tests node Uj. • aij = outcome of the test (Ui, Uj) • The value of aij is arbitrary if the node Ui is faulty. • The set of test outcomes of a system S is called the syndrome of S CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  5. Connection Assignment Graph aij = 0, if Uj is non faulty aij = 1, if Uj is faulty U1 a12 = X a51 = 1 U5 U2 a23 = 0 a45 = 0 U3 U4 Is it 1-fault diagnosable? a34 = 0 Is it 2-fault diagnosable? The syndrome of this system is a 5-bit vector: (a12, a23, a34, a45, a51) = (x, 0, 0, 0, 1) CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  6. Centralized Diagnosis • In the PMC model, the syndrome is assumed to be analyzed by a centralized supervisor, which is an ultra-reliable processor. • t-fault diagnosable: A system S is t-diagnosable if, given a syndrome, all faulty units S can be identified, provided that the number of faulty units does not exceed t. • Two conditions form sufficient condition for a system with “n” units to be t-diagnosable • n ≥ 2t + 1 • Each unit is tested by at least t others • Several centralized algorithms exist to analyze the syndrome. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  7. Diagnosability vs. Diagnosis problems • Diganosability problem: In t-diagnosable systems, the problem of determining “t” for a given system, i.e., determining the maximum number of units that can be faulty, such that the set of faulty units can be uniquely identified on the basis of any syndrome. • Diagnosis: the problem of determining the faulty units from any syndrome, given that there are at most “t” faulty units. • The diagnosability problem is concerned only with what is theoretically possible. • The diagnosis problem is concerned with actually finding an algorithm for diagnosis (provided, of course, the system is diagnosable) from a given syndrome. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  8. Distributed Diagnosis • The centralized approach is not suitable for the distributed systems. • The goal of system diagnosis in distributed systems is to ensure that if some nodes fail (or recover), then the other nodes in the system find out about the failure (recovery) in a finite time. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  9. Adaptive Distributed System Level Diagnosis • The Adaptive DSD algorithm is executed by each node in the system. • Each node “i” maintains an array “TESTED_UPi”. It contains “n” elements, indexed by the node number. • Each element of “TESTED_UPi” contains a node number. • The entry TESTED_UPi[k] = j means that the node “i” has received diagnostic information from a fault-free node specifying that the node “k” has tested “j” to be fault-free • An entry TESTED_UPi[m] may be arbitrary if the node “m” is faulty. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  10. Adaptive DSD: Overview • The nodes are sequentially ordered in a circular list, say as, 1, 2, …, n, 1. • A node “i” sequentially tests nodes (i+1)%n, (i+2)%n,…till it finds a fault-free node. • Diagnostic information from this fault-free node is copied to the local TESTED_UP array. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  11. Adaptive DSD: Algorithm for node “i” (one round) • t = i • Repeat • t = ( t + 1) mod n • Request t to forward TESTED_UPt to “i” • Until( i tests t as “fault-free”) • TESTED_UPi[i] = t • For j = 1 to (n-1) do • If( i != t ) /* copies the array contents */ • TESTED_UPi[j] = TESTED_UPt[j] CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  12. Adaptive DSD: Example TESTED_UP1 0 1 TESTED_UP7 7 2 TESTED_UP2 3 TESTED_UP3 6 TESTED_UP6 4 5 Over several rounds the information in the TESTED_UP array is spread to all the nodes CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  13. The Diagnose Algorithm • Uses STATEi[k] = FAULTY / FAULT-FREE i.e state of node “k” as found by the node “i” • Algorithm • Initialize STATEi[j] = FAULTY for all j • t = i • Repeat • STATEi[t] = FAULT-FREE • t = TESTED_UPi[t] • Until (t = i) • Intuitively, it is like going backwards through the “test edges” on the circular list. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  14. Properties – Adaptive DSD algorithm • It takes “n” rounds to fill up TESTED_UP array. • STATE array can be filled in at most “n” steps • Arbitrary number of faulty units can be detected (up to n-1). • Assumption: There are no failures or recovery during the execution of the algorithm (i.e., during “n” rounds) CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

  15. What is the “test”? • Node “i” test node “j”: a process is created at node “j” • Process creation itself verifies that the process scheduler is operational • The process checks several hardware and software facilities, the disk subsystem, and performs some known arithmetic operations • If the results of the test is not provided within a “timeout” period, then the node tested is assumed to have failed. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

More Related