CprE 545: Fault Tolerant Systems

CprE 545: Fault Tolerant Systems System Level Fault Diagnosis CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

Introduction • The basic goal of system level diagnosis is to identify all the faulty units in a system. • In order to determine how diagnosable a system is and for performing diagnosis the following PMC model is used • PMC model was introduced by Preparata, Metze, and Chien in 1967. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

PMC Model • In the PMC model, a system S is decomposed into “n” units, not necessarily identical, denoted by U = {u1, u2, ….un}. • Each unit is considered to be completely working or completely faulty. There is no intermediate state. • The status of the components do not change during the diagnosis. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

PMC Model (Contd..) • In the PMC model, each unit belonging to U is assigned a particular subset of U to test ( no unit tests itself). The complete set of tests is called connection assignment, and is represented as a graph G = (U, E). • In this graph, each node represents a unit, and each edge represents a testing link. • An edge (Ui, Uj) exists in G if and only if node Ui tests node Uj. • aij = outcome of the test (Ui, Uj) • The value of aij is arbitrary if the node Ui is faulty. • The set of test outcomes of a system S is called the syndrome of S CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

Connection Assignment Graph aij = 0, if Uj is non faulty aij = 1, if Uj is faulty U1 a12 = X a51 = 1 U5 U2 a23 = 0 a45 = 0 U3 U4 Is it 1-fault diagnosable? a34 = 0 Is it 2-fault diagnosable? The syndrome of this system is a 5-bit vector: (a12, a23, a34, a45, a51) = (x, 0, 0, 0, 1) CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

Centralized Diagnosis • In the PMC model, the syndrome is assumed to be analyzed by a centralized supervisor, which is an ultra-reliable processor. • t-fault diagnosable: A system S is t-diagnosable if, given a syndrome, all faulty units S can be identified, provided that the number of faulty units does not exceed t. • Two conditions form sufficient condition for a system with “n” units to be t-diagnosable • n ≥ 2t + 1 • Each unit is tested by at least t others • Several centralized algorithms exist to analyze the syndrome. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

Diagnosability vs. Diagnosis problems • Diganosability problem: In t-diagnosable systems, the problem of determining “t” for a given system, i.e., determining the maximum number of units that can be faulty, such that the set of faulty units can be uniquely identified on the basis of any syndrome. • Diagnosis: the problem of determining the faulty units from any syndrome, given that there are at most “t” faulty units. • The diagnosability problem is concerned only with what is theoretically possible. • The diagnosis problem is concerned with actually finding an algorithm for diagnosis (provided, of course, the system is diagnosable) from a given syndrome. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

Distributed Diagnosis • The centralized approach is not suitable for the distributed systems. • The goal of system diagnosis in distributed systems is to ensure that if some nodes fail (or recover), then the other nodes in the system find out about the failure (recovery) in a finite time. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

Adaptive Distributed System Level Diagnosis • The Adaptive DSD algorithm is executed by each node in the system. • Each node “i” maintains an array “TESTED_UPi”. It contains “n” elements, indexed by the node number. • Each element of “TESTED_UPi” contains a node number. • The entry TESTED_UPi[k] = j means that the node “i” has received diagnostic information from a fault-free node specifying that the node “k” has tested “j” to be fault-free • An entry TESTED_UPi[m] may be arbitrary if the node “m” is faulty. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

Adaptive DSD: Overview • The nodes are sequentially ordered in a circular list, say as, 1, 2, …, n, 1. • A node “i” sequentially tests nodes (i+1)%n, (i+2)%n,…till it finds a fault-free node. • Diagnostic information from this fault-free node is copied to the local TESTED_UP array. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

Adaptive DSD: Algorithm for node “i” (one round) • t = i • Repeat • t = ( t + 1) mod n • Request t to forward TESTED_UPt to “i” • Until( i tests t as “fault-free”) • TESTED_UPi[i] = t • For j = 1 to (n-1) do • If( i != t ) /* copies the array contents */ • TESTED_UPi[j] = TESTED_UPt[j] CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

Adaptive DSD: Example TESTED_UP1 0 1 TESTED_UP7 7 2 TESTED_UP2 3 TESTED_UP3 6 TESTED_UP6 4 5 Over several rounds the information in the TESTED_UP array is spread to all the nodes CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

The Diagnose Algorithm • Uses STATEi[k] = FAULTY / FAULT-FREE i.e state of node “k” as found by the node “i” • Algorithm • Initialize STATEi[j] = FAULTY for all j • t = i • Repeat • STATEi[t] = FAULT-FREE • t = TESTED_UPi[t] • Until (t = i) • Intuitively, it is like going backwards through the “test edges” on the circular list. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

Properties – Adaptive DSD algorithm • It takes “n” rounds to fill up TESTED_UP array. • STATE array can be filled in at most “n” steps • Arbitrary number of faulty units can be detected (up to n-1). • Assumption: There are no failures or recovery during the execution of the algorithm (i.e., during “n” rounds) CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

What is the “test”? • Node “i” test node “j”: a process is created at node “j” • Process creation itself verifies that the process scheduler is operational • The process checks several hardware and software facilities, the disk subsystem, and performs some known arithmetic operations • If the results of the test is not provided within a “timeout” period, then the node tested is assumed to have failed. CprE 545 Fault Tolerant Systems (G. Manimaran), Iowa State University

CprE 545: Fault Tolerant Systems

CprE 545: Fault Tolerant Systems

Presentation Transcript

presentation handout (.ppt format, 4818.0 kb)

Automatic Synthesis of Fault-Tolerance

Optical Encoder for a Game Steering Wheel May05-26

Web Services

Systems Thinking and SSM

Testing Semiconductor Memories

Fault Tolerance in Distributed Systems

Fault Tolerance

Surge current protection using superconductor.

Computing in the

Fault Tree Analysis

Fault-Tolerance: Practice Chapter 7

Fault Tolerance

Understanding PRAM as Fault Line: Too Easy? or Too difficult?

CS 347: Parallel and Distributed Data Management Notes X: S4

Chapter 4

Delay-Tolerant Networks

Dave Angell Idaho Power 21st Annual Hands-On Relay School

Grounding, bonding, and ground fault currents

Testing Semiconductor Memories