1 / 24

Tolerating Communication and Processor Failures in Distributed Real-Time Systems

POPART. Rhones-Alpes. Tolerating Communication and Processor Failures in Distributed Real-Time Systems. Hamoudi Kalla , Alain Girault and Yves Sorel. Grenoble, November 13, 2003. Outline. Introduction Modeling distributed real-time systems The Fault model Related work

skah
Download Presentation

Tolerating Communication and Processor Failures in Distributed Real-Time Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. POPART Rhones-Alpes Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003

  2. Outline • Introduction • Modeling distributed real-time systems • The Fault model • Related work • Processor fault tolerance • Communication fault tolerance • Conclusion and future work

  3. Introduction High level program Compiler Model of the algorithm Architecture specification Distribution constraints Execution times Real-time constraints Failure specification Distribution and scheduling fault-tolerant heuristic Fault-tolerant distributed static schedule Code generator Fault-tolerant distributed code

  4. Modeling distributed real-time systems • Algorithm Model I1 B O C I2 A « I1 and I2 » are inputs operations (sensors) « O » is output operation (actuator) « A, B and C » are computations operations

  5. Modeling distributed real-time systems • Architecture Model P1 P3 Computation unit B1 B2 memory co-processor … P2 co-processor « P1, P2 and P3 » are processors « B1 and B2 » are communication buses Processor

  6. The Fault Model • Tolerating a fixed number of fail-silent processors. • Tolerating a fixed number of fail-silent bus: complete and partial faults. P1 P3 P1 P3 B1 B1 B2 B2 P2 P2 Partial bus faults Processors faults P1 P3 B1 B2 P2 Complete bus faults

  7. Problem ? • Find a distributed schedule of the algorithm on the architecture which is fault-tolerant toprocessors and communications failures ? P1 I1 B schedule O C B1 B2 I2 A P2 P3

  8. Related Work (1) Time-Triggered Architecture (TTA): active replication of operations and communications. (20 years = 100 masters theses and 25 doctoral) Forward Error Correction (FEC): passive or active replication of operations and active replication of communication.

  9. Related Work (2) Time-Triggered Architecture (TTA): • Processor fault tolerance: k replicas or copies of each operation are actively allocated to separate processors. • Communication fault tolerance: k’ replicas or copies of each communication are actively allocated to separate buses.

  10. Related Work (3) Forward Error Correction (FEC): • Processor fault tolerance: k replicas or copies of each operation are actively or passively allocated to separate processors. • Communication fault tolerance: First, each communication is coded by the FEC code on k’ messages with redundant informations. Next, the k’ messages are actively allocated to separate buses.

  11. Outline • Introduction • Modeling distributed real-time systems • The Fault model • Related work • Processor fault tolerance • Communication fault tolerance • Conclusion and future work

  12. Processor fault tolerance • Use the active software replication of operations; where each operation is replicated on k different processors to tolerate k processors failures.

  13. Communication fault tolerance (1) Use the passive software replication of communication, which need « watchdog timer », Split each data communication on k messages. (data fragmentation)

  14. Communication fault tolerance (2) Use the passive software replication of communication, which need « watchdog timer »,

  15. Communication fault tolerance (3) Split each data communication on k messages. (data fragmentation)

  16. Communication fault tolerance (3) Why data fragmentation of communication ? Distinction between complete and partial communication fault !

  17. Communication fault tolerance (4) Why data fragmentation of communication ? Enable rapid recovery from processors and buses failures

  18. Recovery from failures (1) • Processor fault

  19. Recovery from failures (2) • Partial bus fault

  20. Recovery from failures (3) • Complete bus fault

  21. Example (1)

  22. Example (2)

  23. Conclusion and future work Result • A new method to tolerate both communication and processor failures in distributed real-time systems, which may be reduce the load and the overhead of the recovery from failures. Future work • Implementation of the proposed method into the SynDEx tool. • Simulations.

  24. Questions ?

More Related