Integrated Fault Tolerant Techniques Using Parallel Processing

Integrated Fault Tolerant Techniques Using Parallel Processing Nasser Alsaedi

Introduction The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system reliability we need fault tolerant and to increase system performance we need parallel processing. My presentation talk about Integrated fault tolerant techniques that tolerate hardware and software fault in parallel computers.

Integrated Fault -Tolerant ( IFT)Techniques Introduction The proposed IFT Techniques is devised for reliable execution of tasks and concurrent on-line system-level fault diagnosis, where both hardware (processors and communication channels) and software are subjected to failure.

Integrated Fault -Tolerant Techniques For reliable execution of tasks, different program versions of each task are assigned to a group of processors. Processors are grouped using the DGMM algorithm. A task is released if at least (th + 1) processors agree with each other on the outputs for at least (ts + 1) different program versions and the outputs of all the program versions are the same, 5

Integrated Fault -Tolerant Techniques The proposed work High Reliability Approach: IFT considers the system as a whole, an integration of hardware and software. Here, both hardware failures and software failures are considered in contrast to the most of the existing works that have assumed that only one of them, not both, could be faulty. High Performance Approach: In contrast to most of the existing works that have focused mainly on improving the system reliability and have used system resources lavishly, IFT attempt to maximize the performance concurrently. 6

The list concerns for High Reliability and Performance Approach 1) Since every system is fault-free most of the time, allocating a task Ti to (2thi + 1) processors to tolerate thi hardware faults, as is done in some of the existing works, is a waste of the system resources. Instead, we allocate initially (thi + 1) processors to the task Ti, which is minimal for tolerating thi hardware faults, and in case of failures we add more processors as needed. 2) A similar procedure is used for tolerating software failures. It is important to realize that software is fault-free most of the time as well. 7

3) Dynamic Group Maximum Matching (DGMM) algorithm for grouping the system graph. The DGMM algorithm always attempts to maximize the system performance by increasing the number of concurrent tasks in the system( parallel processing). 8

4) On- Line Fault Diagnoses: In IFT, faults will be diagnosed by running user programs, in contrast to some of the existing works that require running diagnostic programs. By implementing an on-line fault diagnosis, the system will be continuously executing useful application programs instead of executing diagnostic programs for failure detection which add extra overhead and may not providing 100% fault coverage. 9

Integrated Fault -Tolerant Techniques Each task has hardware reliability degree th where th denotes the upper bound for the number of faulty processors and communication channels the system can tolerate with respect to the task Ti Each task has software reliability degree ts where ts denotes the upper bound for the number of faulty program versions (software reliability degree) that the system can tolerate with respect to a task Ti . 10

Dynamic Group Maximum matching (DGMM) Algorithm The function of DGMM algorithm is finding group of connected processors and assign these processors to the task. And maximize System performance. For Example, if the task hardware reliability degree th=2 DGMM attempts to find group g of connected processors. where g = th + 1 = 2+1=3.

System Model A system is modeled by a graph G ( N, E), where N and E are the nodes set and the edge set of the graph G respectively. A node represents a processor with its local memory while edge represents a communication channel between two neighboring processors.

A Task Ti finish execution if there are thi +1 processors agree with each other on tsi +1 program versions

Dynamic Group Maximum matching (DGMM) algorithm The proposed ( DGMM) algorithm is a generalization of the group maximum matching concept. In this generalization, the system is partitioned into disjoint groups with different sizes dynamically. At the same time the DGMM algorithm attempts to minimize the time needed to release the correct outputs and maximize the on-line faults diagnoses capabilities. This is achieved by trying to increase the group connectivity. 14

Dynamic Group Maximum matching (DGMM) algorithm specification Algorithm 1. If | Gi | = 0 then (a) Find a free processor Pj with the lowest degree in the system graph G. In case of a tie, choose a processor randomly. (b) If such a processor Pj exists then i. Gi = Pj. /* add the processor Pj to the group Gi of the task Ti */ ii. Delete the processor Pj with all edges incident to it from the system graph G. 2. While (system graph G is non-empty) and ( | Gi | < gi) and (Gi has free neighboring processors) do 15

(a) Find a neighboring processor Pj with the lowest degree among the neighboring group Gi of the task Ti. In case of a tie, choose a neighboring processor with the highest number of links connected to the processors already in the group Gi (b) Gi = Gi + Pj./* add the processor Pjto the group Gi of the task Ti */ (c) Delete the processor Pj with all edges incident to it from the system graph 16

Dynamic Group Maximum matching (DGMM) Algorithm Example Consider a binary 3-cube system shown . Assume that a task T1with a group size of g1 = 3 is scheduled for execution. Then a task T2with a group size of g2 = 2. Then a task T3with a group size of g3 = 5. 17

DGMM Example

In this section I am going to introduce twointegrated fault-tolerant scheduling algorithms. These scheduling algorithms are based on the Integrated Fault-Tolerant (IFT) technique and the Dynamic Group Maximum Matching (DGMM) algorithm.

The Integrated Fault-Tolerant First-Come, First-Served (FCFS) scheduling algorithm • When a task Ti ,which may have more than one program version,arrives at the system, it is inserted along with their group sizes in the queue Q. When a task Tiis scheduled for execution, the DGMM algorithm is called to find the required group size for the task Ti. • If the returned group size is equalto the required group size, the first program version V1iof the task Tiis assigned to the group Gifor execution.

The Integrated Fault-Tolerant First-Come, First-Served (FCFS) scheduling algorithm • If the DGMM can not find the required group size gi.The DGMM is called each time a task leaves the system or is inserted in the aborted task queue Qa. • If DGMM returned the required group size, the first program version V1iof the task Tiis assigned to the group Gifor execution.

The Integrated Fault-Tolerant First-Come, First-Served (FCFS) scheduling algorithm • When a a task Ti version Vjicompletes its execution by all the processors in the group Gi, neighboring processors exchange and compare their outputs. Then, the disagreement graph DGiis obtained. • If there is a disagreement between (thi+ 1) processors on the outputs. The DGMM is called to increase group size of task Ti by one. gi = gi + 1. And the system execute the first version of Ti again. • Else the next version of Ti isexecuted.

The Integrated Fault-Tolerant First-Come, First-Served (FCFS) scheduling algorithm • A task Tiis released if at least (thi+ 1) different processors agree with each other on the output for at least (tsi+1) different program versions and the output for all the program versions are the same. • When the task Ti finish its execution, the detected faulty components are deleted from the system. • Otherwise, a task Tiis aborted for later execution

The Integrated Fault-Tolerant First-Come, First-Served (FCFS) scheduling algorithm Example Consider a binary 3-cube system , where processor P4 and P3 are faulty and the link between processors P6 and P7 is faulty. Assume tasks arrive and are queued in the task queue Q in the following order. Assume all task have one version. First, a task T1 with group size g1 =3(th2 =2) . Second a task T2 with group size g2 =2 (th2 =1). Finally, T3 with group size g3 =3 (th3 =2). Show how the tasks are executed by the system.

InFCFS Example DGMM allocate G1 = {P0,P1,P2} for the task T1. DGMM allocate G2 = {P3,P7} for the task T2. DGMM allocate G3 = {P4,P5,P6} for the task T3

InFCFS Example The System obtain the DG1 for the task T1 DG1 has a node with three processors > th1+1=2, then the output of the processors in that node is released.

InFCFS Example The System obtain the DG2 for the task T2 DG2 has two node with different outputs. DGMM increase G2 by 1 ( add processors P1 to the group G2 )

InFCFS Example System obtains the DG2 for the task T2. P3 disagree with more than ( th2 =1 ) neighboring processors, P3 conclude to be faulty.

InFCFS Example System obtains the DG3 for the task T3. DGMM increase G3 by 1 (add processors P7 to the group G3)

InFCFS Example System obtains the DG3 for the task T3. DGMM increase G3 by 1 (add processors P1 to the group G3)

InFCFS Example System obtain the DG3 for the task T3 DG3 has a node Z with three processors > th3, then the output of the processors in that node is released

The Integrated Fault-Tolerant FCFS + Smallest Fit First scheduling algorithm • When a task Ti,which may have more than one program version,arrives at the system, it is inserted along with their group sizes in the queue Q. When a task Tiis scheduled for execution, the DGMM algorithm is called to find the required group size for the task Ti. • If the returned group size is equalto the required group size, the first program version V1iof the task Tiis assigned to the group Gifor execution.

The Integrated Fault-Tolerant FCFS + Smallest Fit First scheduling algorithm • If the returned group size by the DGMM algorithm is smaller than the required group size, then the returned group is allocated to the first program version V1j of the first task Tj in the task queue that fits the returned group. Next, the DGMM algorithm is called to find another subgraph of size gi in a different part of the system graph to allocate the task Ti • If the DGMM returned the required group size, the first program version V1iof the task Tiis assigned to the group Gifor execution.

The Integrated Fault-Tolerant FCFS + Smallest Fit First scheduling algorithm • When a a task Ti version Vjicompletes its execution by all the processors in the group Gi, neighboring processors exchange and compare their outputs. Then, the disagreement graph DGiis obtained. • If there is a disagreement between (thi+ 1) processors on the outputs of first version of Ti. The DGMM is called to increase group size of task Ti by one. gi = gi + 1. And the system execute the first version of Ti again. • Else the next version of Ti isexecuted.

The Integrated Fault-Tolerant FCFS + Smallest Fit First scheduling algorithm • A task Tiis released if at least (thi+ 1) different processors agree with each other on the output for at least (tsi+1) different program versions and the output for all the program versions are the same. • When the task Ti finish its execution, the detected faulty components are deleted from the system. • Otherwise, a task Tiis aborted for later execution

The features of the simulator The computing environment is an MMtorus system (M 1) connected to a host machine where scheduling and obtaining tasks disagreement graphs take place. Each task (program) Ti which arrives at the system along with its reliability degree ti will be assigned to a group Gi of size gi (initially gi = ti + 1). Tasks interarrival times are exponentially distributed with the average arrival rate . Tasks mean execution times are exponentially distributed. Tasks arrived at the system could have different mean execution times.

Simulation Model In our simulation we consider a 6 x 6 torus system (M = 6). We assume that there are long tasks and short tasks. Mean execution time of long task is 10 time units and mean execution time of short taskis 1 time unit. we assume that there are three types of task hardware reliability degrees: thi = 0 (type0), thi = 1 (type1) and thi = 2 (type2). we assume that the task software reliability tsi= 1

36 processors each processor connected with three processors

Simulation Result we consider four failure cases with each type of tasks software reliability. First case, processors and communication links are fault-free, Second case, only communication links are subject to failures. Third case, only processors are subject to failures. Fourth case, both processors and communication links are subject to failures. We evaluate two performance metrics. 1- system mean response time. 2- percentage of tasks of type i completed, for i = 0, 1, 2. =

FCFS performance 41

FCFS performance In FCFS we can see from the plots as the task arrival rate λ increases, the average response time also increases. Also, we can see as the task arrival rate λ increases, the percentage of tasks completed of all tasks types decreases. Furthermore, the percentage of tasks completed of all tasks types under each one of the failure cases is almost the same. In other words, FCFS does not favor one type of task over another type of task for execution.

FCFSSFF performance 45

FCFSSFF performance

Simulation Result FCFS + Smallest Fit First Performance Under the Integrated Fault-Tolerant First-Come, First-Served + Smallest Fits First (FCFSSFF) scheduling algorithm, our simulation study showed that under the conditions experimented here, beyond a point, as arrival rate λ increases, the system average response time decreases. With a higher task arrival rate, the system average response time increases. Also FCFSSFF scheduling algorithm favors tasks with small group over tasks with large group for execution.

Presentation Question What is the goal of Integrated Fault Tolerant Techniques? IFT attempts to maximize the system reliability and the system performance while concurrently diagnosing both hardware and software faults.

Integrated Fault Tolerant Techniques Using Parallel Processing

Integrated Fault Tolerant Techniques Using Parallel Processing

Presentation Transcript

Fault Tolerant FPGA Co-processing Toolkit

Fault-Tolerant Broadcast

Fault Tolerant Parallel Data-Intensive Algorithms

Fault Analysis in HVDC Systems Using Signal Processing Techniques

Fault-Tolerant Broadcast

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

Fault Tolerant Parallel Data-Intensive Algorithms

Fault-Tolerant Techniques and Nanoelectronic Devices

RAMS Parallel Processing Techniques

Fault-tolerant Stream Processing using a Distributed, Replicated File System

Fault Tolerant Configuration

Integrated Fault Tolerant Techniques Using Parallel Processing

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

Fault Tolerant Stream Processing using Distributed Replicated File System

fault-tolerant

FAULT-TOLERANT TECHNIQUES FOR NANOCOMPUTERS

Fault-tolerant routing

Fault-Tolerant Consensus