320 likes | 420 Views
A new transformation scheme based on active replication strategy that tolerates failures. Hamoudi Kalla , Alain Girault and Yves Sorel. Pop Art team and Aoste team. Paris, April 23, 2004. Outline. Introduction Model and problem State of the art
E N D
A new transformation scheme based on active replication strategy that tolerates failures Hamoudi Kalla, Alain Girault and Yves Sorel Pop Art team and Aoste team Paris, April 23, 2004
Outline • Introduction • Model and problem • State of the art • The proposed fault-tolerant method for tolerating : • Processors failures • Communication media failures • Both processors and communication media failures • Example • Conclusion and future work
Introduction High level program Compiler Model of the algorithm Architecture specification Distribution constraints Execution times Real-time constraints Failure specification Distribution and scheduling fault-tolerant heuristic Fault-tolerant distributed static schedule Code generator Fault-tolerant distributed code
Models : Application algorithm • Algorithm graph I1 A O C I2 B « I1 and I2 » are inputs operations (sensors) « O » is output operation (actuator) « A, B and C » are computations operations « A C » is data-dependence
Models : Hardware architecture • Architecture graph P1 P2 L12 P1 P2 B1 L13 L23 P3 P3 Architecture with point-to-point links Architecture with multipoint links Memory « P1, P2 and P3 » are processors « L12, L13 and L23 » are point-to-point communication links « B1 » is multipoint communication link « com1 and com2 » are communicators com1 operator com2 Processor
Models : Component Failures • Only processors and communication media (point-to-point and multipoint) can fails. • Failures can be characterized as transient or permanent. • At least a fixed number ofprocessors can fail-stop. • At least a fixed number of communication media can fail-stop : partially or completely. L12 P1 P2 P1 P2 P1 P2 m1 m1 L13 L23 P3 P3 P3 Processor failures Partial communication media failures complete communication media failures
Problem ? • Find a distributed schedule of the algorithm on the architecture which is fault-tolerant toprocessors and communication media failures ? I1 A O C I2 B SynDEx * algorithm graph Distribution/scheduling L12 P1 P2 L13 L23 P3 architecture graph *SynDEx is a system level CAD software tool for optimizing the implementation of real-time embeded applications on multicomponenet architecture
P4 State of the art • “ A system is fault tolerant if it can mask the presence of faults in the system by using hardware and/or softwareredundancy ” I1 A O C P4 I2 B SynDEx * algorithm graph Distribution/scheduling L12 P1 P2 L13 L23 P3 architecture graph (a) Approaches for tolerating processor failures (b) Approaches for tolerating communication media failures
I1 A State of the art • “ A system is fault tolerant if it can mask the presence of faults in the system by using hardware and/or softwareredundancy ” I1 A O C I2 B SynDEx * algorithm graph Distribution/scheduling L12 P1 P2 L13 L23 P3 architecture graph (a) Approaches for tolerating processor failures (b) Approaches for tolerating communication media failures
State of the art • “ A system is fault tolerant if it can mask the presence of faults in the system by using hardware and/or softwareredundancy ” Active software redundancy : (Hashimoto et al., 2002(a); Fragopoulou and Akl, 1995(b))(a) Multiple redundant copies of an operation are scheduled on different processors.(b) Multiple redundant copies of a message are sent along disjoint paths. Passive software redundancy : (Qin et al., 2002(a); Sriram et al., 1999(b))(a) each operation is replicated on primary and backups copies, but only the primary is executed.(b) One copy of the message is sent, and if it fails, another copy will be transmitted. (a) Approaches for tolerating processor failures (b) Approaches for tolerating communication media failures
Outline • Introduction • Model and problem • State of the art • The proposed fault-tolerant method for tolerating : • Processor failures • Communication media failures (point-to-point links) • Both processor and communication media failures • Example • Conclusion and future work
The Proposed fault-tolerant method Principle (1) : We use active software redundancy for both operations and communications. Motivations : • Makes the recovery from failures bounded. • Makes the system predictable. • Easier to integrate to SynDEx.
Fault-tolerant distributed real-time executive Distribution and scheduling fault-tolerant heuristic (SynDEx) The Proposed fault-tolerant method Principle (2) : Algorithm graph (Alg) Graph transformation NPF processors failures NLF links failures New Alg with redundancy and exclusion relations Architecture graph (Arc) Real-time and embedding constraints
data B A The Proposed fault-tolerant method Algorithm graph transformation (1) : Tolerating NPF processors failures A B . . . . . . NPF+1 replicas of B NPF+1replicas of A A B b1. final algorithm sub-graph a. initial algorithm sub-graph
data A B The Proposed fault-tolerant method Algorithm graph transformation (2) : Tolerating NLF links failures A B One replica of B one replica of A NLF+1 replicas of data b2. final algorithm sub-graph a. initial algorithm sub-graph
A B A data data B A Bi two replicas of A two replicas of A two replicas of B two replicas of B A B A a. Initial algorithm sub-graph b. Operations redundancy c. Data-dependence redundancy A data A Bi data two replicas of A two replicas of B A R Bi two replicas of B two replicas of A A d. Data-dependence distribution (1) e. Data-dependence distribution (2) The Proposed fault-tolerant method Algorithm graph transformation (3) : Tolerating NPF processors and NLF links failures NPF=1 and NLF=1
A1 B1 data R two replicas of B two replicas of A A1 data A2 B2 R Bi two replicas of B two replicas of A Case 1 A1 A1 B1 data e. Data-dependence distribution (2) R two replicas of B two replicas of A A2 B2 Case 2 The Proposed fault-tolerant method Algorithm graph transformation (4) : Tolerating NPF processors and NLF links failures NPF=1 and NLF=1
data A B The Proposed fault-tolerant method Algorithm graph transformation (5) : Tolerating NPF processors and NLF links failures NPF>=1 and NLF>=1 A R ... B ... NPF+1 replicas of A NPF+1 replicas of B R A NLF routing operations R a. initial algorithm sub-graph b. final algorithm sub-graph
data A B Distribution and scheduling fault-tolerant heuristic (SynDEx) The Proposed fault-tolerant method Graph transformation NPF processors failures NLF links failures A R New Alg with redundancy and exclusion relations B ... NPF+1 replica of B ... NPF+1 replica of A R A NLF routing operations R Fault-tolerant distributed real-time executive Architecture graph Arc Real-time and embedding constraints
A1 A2 data data data data R R data The Proposed fault-tolerant method Implantation • B1 will receive its input data NPF+NLF+1 times (NPF=1, NLF=1);as soon as it receives the first input, B1 is executed, and it ignores the later inputs A1 data L34 L14 L24 L12 L23 P1 P2 P3 P4 R B1 two replicas of B two replicas of A A2 B1 SynDEx a transformed algorithm sub-graph B1 L12 P1 P2 time L14 L24 L23 Temporary schedule start time (B1) = min ( end communication [A1,A2,R] ) L34 P4 P3 architecture graph
Outline • Introduction • Model and problem • State of the art • The proposed fault-tolerant method for tolerating : • Processor failures • Communication media failures (multipoint links) • Both processor and communication media failures • Example • Conclusion and future work
The Proposed fault-tolerant method We use the active software redundancy of operations; where each operation is replicated on NPF+1 different processors to tolerate NPF processors failures. P1 P2 B1 B2 P3 P4 Temporary schedule Algorithm sub-graph architecture graph
Split each data communication on NLF messages(data fragmentation) The Proposed fault-tolerant method Use the passive software redundancy of communication
The Proposed fault-tolerant method Why data fragmentation ? Distinction between complete and partial communication links failures Enable rapid recovery from processors and communication links failures
The Proposed fault-tolerant method • Recovery from processor failures
The Proposed fault-tolerant method • Recovery from partial communication links failures
The Proposed fault-tolerant method • Recovery from complete communication media failures
Conclusion and future work Result • A new method to tolerate both communication links and processor failures in distributed real-time systems, which may be reduce the overhead of the recovery from failures. Future work • Benchmarks. • Using passive redundancy to tolerate communication links failures. • Taking into account sensors and actuators failures.
References [Fragopoulou and Akl, 1995]. Fragopoulou, P. and Akl, S.G. (1995). Fault tolerant communication algorithms on the star network using disjoint paths. In Proceedings of the 28th Hawaii International Conference on System Sciences, HICSS’95, Kingston, Canada. [Sriram et al., 1999]. Sriram, R., Manimaran, G., and Murthy, C.S.R. (1999). An integrated scheme for establishing dependable real-time channels in multihop networks. In Proc. ICCCN, pages 528–533. [Qin et al., 2002]. Qin, X., Jiang, H., and Swanson, D.R. (2002). An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems. In Proceedings of the 31th International Conference on Parallel Processing, Vancouver, Canada. [Hashimoto et al., 2002]. Hashimoto, K., Tsuchiya, T., and Kikuno, T. (2002). Effective scheduling of duplicated tasks for fault tolerance in multiprocessor systems. IEICE Transactions on Information and Systems.