Tolerating Communication and Processor Failures in Distributed Real-Time Systems

POPART Rhones-Alpes Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003

Outline • Introduction • Modeling distributed real-time systems • The Fault model • Related work • Processor fault tolerance • Communication fault tolerance • Conclusion and future work

Introduction High level program Compiler Model of the algorithm Architecture specification Distribution constraints Execution times Real-time constraints Failure specification Distribution and scheduling fault-tolerant heuristic Fault-tolerant distributed static schedule Code generator Fault-tolerant distributed code

Modeling distributed real-time systems • Algorithm Model I1 B O C I2 A « I1 and I2 » are inputs operations (sensors) « O » is output operation (actuator) « A, B and C » are computations operations

Modeling distributed real-time systems • Architecture Model P1 P3 Computation unit B1 B2 memory co-processor … P2 co-processor « P1, P2 and P3 » are processors « B1 and B2 » are communication buses Processor

The Fault Model • Tolerating a fixed number of fail-silent processors. • Tolerating a fixed number of fail-silent bus: complete and partial faults. P1 P3 P1 P3 B1 B1 B2 B2 P2 P2 Partial bus faults Processors faults P1 P3 B1 B2 P2 Complete bus faults

Problem ? • Find a distributed schedule of the algorithm on the architecture which is fault-tolerant toprocessors and communications failures ? P1 I1 B schedule O C B1 B2 I2 A P2 P3

Related Work (1) Time-Triggered Architecture (TTA): active replication of operations and communications. (20 years = 100 masters theses and 25 doctoral) Forward Error Correction (FEC): passive or active replication of operations and active replication of communication.

Related Work (2) Time-Triggered Architecture (TTA): • Processor fault tolerance: k replicas or copies of each operation are actively allocated to separate processors. • Communication fault tolerance: k’ replicas or copies of each communication are actively allocated to separate buses.

Related Work (3) Forward Error Correction (FEC): • Processor fault tolerance: k replicas or copies of each operation are actively or passively allocated to separate processors. • Communication fault tolerance: First, each communication is coded by the FEC code on k’ messages with redundant informations. Next, the k’ messages are actively allocated to separate buses.

Outline • Introduction • Modeling distributed real-time systems • The Fault model • Related work • Processor fault tolerance • Communication fault tolerance • Conclusion and future work

Processor fault tolerance • Use the active software replication of operations; where each operation is replicated on k different processors to tolerate k processors failures.

Communication fault tolerance (1) Use the passive software replication of communication, which need « watchdog timer », Split each data communication on k messages. (data fragmentation)

Communication fault tolerance (2) Use the passive software replication of communication, which need « watchdog timer »,

Communication fault tolerance (3) Split each data communication on k messages. (data fragmentation)

Communication fault tolerance (3) Why data fragmentation of communication ? Distinction between complete and partial communication fault !

Communication fault tolerance (4) Why data fragmentation of communication ? Enable rapid recovery from processors and buses failures

Recovery from failures (1) • Processor fault

Recovery from failures (2) • Partial bus fault

Recovery from failures (3) • Complete bus fault

Example (1)

Example (2)

Conclusion and future work Result • A new method to tolerate both communication and processor failures in distributed real-time systems, which may be reduce the load and the overhead of the recovery from failures. Future work • Implementation of the proposed method into the SynDEx tool. • Simulations.

Questions ?

Tolerating Communication and Processor Failures in Distributed Real-Time Systems