240 likes | 346 Views
POPART. Rhones-Alpes. Tolerating Communication and Processor Failures in Distributed Real-Time Systems. Hamoudi Kalla , Alain Girault and Yves Sorel. Grenoble, November 13, 2003. Outline. Introduction Modeling distributed real-time systems The Fault model Related work
E N D
POPART Rhones-Alpes Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003
Outline • Introduction • Modeling distributed real-time systems • The Fault model • Related work • Processor fault tolerance • Communication fault tolerance • Conclusion and future work
Introduction High level program Compiler Model of the algorithm Architecture specification Distribution constraints Execution times Real-time constraints Failure specification Distribution and scheduling fault-tolerant heuristic Fault-tolerant distributed static schedule Code generator Fault-tolerant distributed code
Modeling distributed real-time systems • Algorithm Model I1 B O C I2 A « I1 and I2 » are inputs operations (sensors) « O » is output operation (actuator) « A, B and C » are computations operations
Modeling distributed real-time systems • Architecture Model P1 P3 Computation unit B1 B2 memory co-processor … P2 co-processor « P1, P2 and P3 » are processors « B1 and B2 » are communication buses Processor
The Fault Model • Tolerating a fixed number of fail-silent processors. • Tolerating a fixed number of fail-silent bus: complete and partial faults. P1 P3 P1 P3 B1 B1 B2 B2 P2 P2 Partial bus faults Processors faults P1 P3 B1 B2 P2 Complete bus faults
Problem ? • Find a distributed schedule of the algorithm on the architecture which is fault-tolerant toprocessors and communications failures ? P1 I1 B schedule O C B1 B2 I2 A P2 P3
Related Work (1) Time-Triggered Architecture (TTA): active replication of operations and communications. (20 years = 100 masters theses and 25 doctoral) Forward Error Correction (FEC): passive or active replication of operations and active replication of communication.
Related Work (2) Time-Triggered Architecture (TTA): • Processor fault tolerance: k replicas or copies of each operation are actively allocated to separate processors. • Communication fault tolerance: k’ replicas or copies of each communication are actively allocated to separate buses.
Related Work (3) Forward Error Correction (FEC): • Processor fault tolerance: k replicas or copies of each operation are actively or passively allocated to separate processors. • Communication fault tolerance: First, each communication is coded by the FEC code on k’ messages with redundant informations. Next, the k’ messages are actively allocated to separate buses.
Outline • Introduction • Modeling distributed real-time systems • The Fault model • Related work • Processor fault tolerance • Communication fault tolerance • Conclusion and future work
Processor fault tolerance • Use the active software replication of operations; where each operation is replicated on k different processors to tolerate k processors failures.
Communication fault tolerance (1) Use the passive software replication of communication, which need « watchdog timer », Split each data communication on k messages. (data fragmentation)
Communication fault tolerance (2) Use the passive software replication of communication, which need « watchdog timer »,
Communication fault tolerance (3) Split each data communication on k messages. (data fragmentation)
Communication fault tolerance (3) Why data fragmentation of communication ? Distinction between complete and partial communication fault !
Communication fault tolerance (4) Why data fragmentation of communication ? Enable rapid recovery from processors and buses failures
Recovery from failures (1) • Processor fault
Recovery from failures (2) • Partial bus fault
Recovery from failures (3) • Complete bus fault
Conclusion and future work Result • A new method to tolerate both communication and processor failures in distributed real-time systems, which may be reduce the load and the overhead of the recovery from failures. Future work • Implementation of the proposed method into the SynDEx tool. • Simulations.