“Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“

“Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“ Oğuzhan YILDIRIM – Erkin GÜVEL Boğaziçi University Computer Engineering Department oguzhan.yildirim@boun.edu.tr erkin.guvel@boun.edu.tr

Introduction • Masking fault-tolerance guarantees that programscontinually satisfy their specification in the presence offaults. • Nonmasking fault-tolerancedoes not guarantee as much: it merely guaranteesthat when faults stop occurring, program executionsconverge to states from where programs continually(re)satisfy their specification.

Objectives • We will show a practical methodto design masking fault-tolerance is to first designnonmasking fault-tolerance and then transform thenonmasking fault-tolerant program minimally so asto achieve masking fault-tolerance

Novel Method Critical – Noncritical Case Study Overview Outline • Novel method for the design of “masking” fault-tolerant system • Actions • Critical • Noncritical • Overview on Methodolgy • Case Study

The Importance • It is often simpler and cheaper to design nonmasking fault-tolerance than to design masking fault tolerance. • It is often simpler and cheaper to design safe programs or programs with well-defined failure than to design masking fault-tolerant programs

Critical Actions • Critical actions are those actions whose executionin the presence of faults can violate the system specification. • Database transactions, the actions that produce an output or commit a result are critical.

Noncritical Actions • The execution of noncritical actions should not necessarily have to mask faults; in other words, when noncritical actions execute, the system state may be “unsafe”. • The execution of the noncritical actions in unsafe states should not allow the system to remain in unsafe states forever, otherwise the system will never execute its critical actions.

Overview • First Stage: The system is designed so that after faults stop occurring, subsequent execution of the system actions guarantees that the system reaches a safe state. • Second Stage: the critical actions are modified so that their execution always masks faults.

First Stage • In this stage, first, a nonmasking fault-tolerant version of the program is designed. Then, certain actions of the nonmasking fault-tolerant program are distinguished as being critical. • No specific approaches, many acceptable methods exist.

First Stage (Cont.) • To design the tolerance requirement hand-in-hand with the other requirements of the program. • Transform an existing faultin tolerant program into one that is nonmasking faulttolerant.

Second Stage • In this stage, first, a “safe predicate” is identified for each critical action. Then,each critical action is augmented, so that it is executed only in states where its safe predicate holds. Finally, the augmentation is shown to itself mask the effects of faults. The resulting program is masking fault tolerant. • No specific approaches, many acceptable methods exist.

Second Stage (Cont.) • Add actions that check whether the program state satisfies the state predicate, and allow execution to proceed only when the check succeeds. • To enforce real-time constraints on the execution of critical actions.

Application Case Study:Leader Election SystemLogic Arora’s Program: Spanning tree Leader Election Study

System Logic • A system consists of processes, that have unique integer ids, and channels, that each connect a unique pair of nodes. • At any instant, each process is either “up” or “down”. • Systems are subject to fail-stop and repair of processes.

Arora’s Nonmasking Program • Arora’s nonmasking fault-tolerant program for distributed maintenance of a rooted spanning tree. • Specifically, it allows faults to yield program states where there are multiple trees and unrooted trees. • To deal with unrooted trees, the program has actions that inform all processes in unrooted trees that they have no root process.

Leader Election Problem • The action that declares a process to be the leader. • A unique process is to be elected as the leader; at no point during election may multiple processes declare themselves as leaders. • And the purpose is to design a masking fault tolerant program for leader election.

Leader Election Problem • Our tree maintenance program elects a unique process as leader. • However, in the presence of faults, our tree maintenance program allows multiple processes to declare themselves as leaders.

Defining Critical Action • In keeping with the proposed method, we proceedby identifying the critical actions in the nonmaskingfault-tolerant tree maintenance program.. • After this the identification the non-masking fault tolerated program is augmented to result in a masking fault tolerated system.

Defining Critical Action(Cont.) • The critical actions in the tree maintenance programare the actions that elect a process as leader. • This action is safely executedonly in states where no process is elected as leader

Section 2: • Checking that the critical action is executed in a safe state. • And to guarantee the critical action is implemented in a masking-fault tolerant way.

Checking Critical Action • A diffusing computation is usedtocheck whether the critical action is executed in a safe state. • This diffusing computation verifies the safe statement requirement by reaching all other processes and determines that no process is leader.

Diffusing Computation • The diffusing computation we design consists of two phases: “propagate” and “complete“. • The computation extends in an up-down manner: Upon receiving a diffusing computation from its parent in the tree. a process enters the propagate phase, and propagates the computation to all of its neighbors. Upon receiving a response from all of its neighbors, the process sends a response to its parent and reverts its phase to complete.

Masking The Critical Action • If the child falls in a fail-stop fault, let the parent has a premature result with the value=false. • Then create a new diffusing computation by assigning a sequence number to it. • This way masking is done via redundancy of diffusing method.

ROOT waiting for answer FAULT!!! Diffusing computation Recomputation in case of a fault Repair Fail-Stop

Conclusion • In this presentation, we presented a novel method for designingmasking fault-tolerant programs. First, a nonmaskingfault-tolerant program was designed to ensurethat once faults stop occurring the program eventuallyreaches a safe state. • Then, a masking componentwas designed to ensure that the composite programis masking fault-tolerant.

References • Designing Masking Fault-tolerance via Nonmasking Fault-tolerance,Department of Computer and Information Science The Ohio State University, Columbus, Ohio • B. Randell. System structure for software fault tolerance. IEEE Transactions on Software Eng., pages 220-232, 1975. • J.-C. Laprie. Dependable computing and fault tolerance: Concepts and terminology. Proceedings of the 15th International Symposium on Fault-Tolerant Computing, pages 2-11, 1985. • Internet Research

Thanks For Listening… • Any Questions ?

“Designing Masking Fault Tolerance via Nonmasking Fault Tolerance“