1 / 21

Feedback Based Real-Time Fault Tolerance Issues and Possible Solutions

Feedback Based Real-Time Fault Tolerance Issues and Possible Solutions. Xue Liu, Hui Ding, Kihwal Lee , Marco Caccamo, Lui Sha. Major Issues in Software Reliability. Software becoming more and more complex More features → larger code size Rapid evolution → introduction of new code.

neith
Download Presentation

Feedback Based Real-Time Fault Tolerance Issues and Possible Solutions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feedback Based Real-Time Fault Tolerance Issues and Possible Solutions Xue Liu, Hui Ding, Kihwal Lee, Marco Caccamo, Lui Sha

  2. Major Issues in Software Reliability • Software becoming more and more complex • More features → larger code size • Rapid evolution → introduction of new code E.g. Apache 1998 0.8 MLOC 2002 10 MLOC 2004 27 MLOC E.g. Windows XP 40-50 MLOC Gray’s Estimate : 1 bug / KLOC

  3. Growing Software Complexity Poorly managed or maintained; Software bugs and errors. • Managed by human operators • Shortage of skilled operators due to the growing complexity • Costly • To err is human • Faults • Complexity adds difficulty to management and breeds bugs. • Control the complexity in computer systems! • Build systems that are robust against software bugs

  4. Feedback Control Computing Systems Feedback Control Reflection • Successful track record in controlling electro/mechanical systems • Observation 1: Computing systems haven been crucial in the success of feedback control • Digital designs & implementations etc • Observation 2: Feedback control have appealing properties • Toleranceoferrors (model/sensing/actuation etc) in the physical process • Utilize runtime feedback for error correction Reflection: Can feedback control help to solve fault tolerance problem in computing systems? Fault tolerance

  5. Targeted applications: Real-time control systems Q: Feedback control can help to tolerate errors in mechanical systems, can feedback control help to tolerate software errors also? Feedback Control Tolerant of Errors in Software Systems Tolerant of Errors in Mechanical Systems Idea 1: Feedback Control of Software Execution Mechanical systems: Sense (feedback)->Control (error correction) -> Actuation Software systems:Sense (feedback)->Control (error correction) -> Execution Idea 2: Using Simplicity to Control Complexity • A simple and reliable core which gives acceptable performance; • The system under complex control software remains in states that are recoverable by the simple core. (achieve fault tolerance)

  6. Reference Input (Decision) (Execution) Mechanical System (Plant) _ Actuator Controller Sensor (Sensing/error identification) A Typical Feedback Control Loop for Mechanical Systems • Sense: System output, identify if error exists • Control: Decision • Actuation: Execution

  7. Related Work – Simplex Architecture • A simple reliable core (HAC) • Diversity in the form of 2 alternatives (HAC, HPC) • Feedback control of the software execution. Decision Simple high assurance control subsystem (HAC) Plant Complex high performance control subsystem (HPC) Data Flow Block Diagram Sense (feedback)->Decision (control/error correction) -> Execution (actuation)

  8. Drawbacks of Simplex • P1:Analytically redundant high assurance controller (HAC) runs in parallel with complex controller (HPC) • Lowers system performance, increase operating costs • Limits the application of Simplex in only safety-critical domains • P2: HAC and HPC must run at the same period Our new Proposal: On-demand Real-Time Guard (ORTGA)HAC only runs when faulty occurs! Design Goals of ORTGA 1. Similar functionalities with Simplex 2. Much less resource usage 3. Flexibility

  9. ORTGA Architecture: Key Ideas (1) : Reduce resource usage of Simplex • Solution: • “On-demand” execution of HAC. • Only when the control under HPC is detected as faulty, the HAC is switched in to take over the plant (2): Flexibility • Solution: • HAC and HPC ‘s periods are multiples of subperiod • HAC and HPC can have different periods.

  10. Maximum Stability Region (Recovery Region) Stability Region State Constraints Lyapunov Functions Background: Maximum Stability Region • The largest state space such that system is still stable under the current controller

  11. How to determine the Maximum Stability Region? • In the operation of a plant, there is a set of state constraints: representing the safety, device physical limitations, environmental and other operation requirements. • They can be represented as a normalized polytope, CTX 1, in the N-dimensional state space. We must be able • take the control away from a faulty State constraints Admissible States Operation Constraints and Admissible states

  12. State constraints Recovery Region Lyapunov function State Constraints and the switching rule (Lyapunov function) Maximum Stability Region • A stability region is closed with respect to the operations of simple controller. It is Lyapunov function inside the polytope. • The maximum recovery region can be found using LMI.

  13. Research Issues of ORTGA • How to detect faults in HPC • Timing faults: • Application level support: Monitor detect heartbeat messages misses • OS support: Scheduler detect task deadline misses • Other faults: • Wide range of traditional fault detection techniques can be used. • When to recover if a fault in HPC is detected? • Recover early? • Too early: False alarms • Recover late? • Too late: could not recover in time

  14. When to recover • Why not recover too early? • Control tasks are shown can tolerate several deadline misses • Sometimes system just have some delay (overloaded, communication delay etc) • These are not “real” faults • Try to minimize the recovery due to false alarms • Why not recover too late? • If you recover too late, then no time to make the system stable!

  15. Right Time To Recover (RTTR) • An example of a “desirable” late but timely recovery (under RM) Assumption: Fault is detected at t=2.0 before its task deadline D=8 Observation: Sometimes, a late but timely recovery makes system more schedulable Find RTTR instead of minimize MTTR!

  16. Recovered Threads Monitor find HB3 missing t ts S Prediction (t3) tr HB1 (t1) HB2 (t2) Stability Region S of Controlled Plant When to recover? A possible solution to determine RTTR • Idea • Recover as late as possible, • But not too late • If the state of HPC is going to be out of the HAC-established stability region, recover! • Otherwise, wait (maybe HPC still OK  )

  17. Reduce Resource Usage: On-demand Execution of HAC Performance Gain of ORTGA HPC’s timing parameters: {Cp, Tp}; HAC’s timing parameters: {Ca, Ta}; A total savings of: Relative saving:

  18. Ongoing Work: A proof-of-concept System Double Inverted Pendulum System • - Double Quanser inverted pendulum with custom-made tracks • PC/104 sized, i486 compatible system • Customized Linux 2.6 kernel and root image in flash memory • ORTGA middleware layer

  19. Conclusions • Feedback Based Real-Time Fault Tolerance • Leverage feedback control of software execution • ORTGA Architecture • On-demand execution of reliable core (HAC) only when fault occurs • Significantly reduces resource usage • Issues and possible solutions • How to detect fault • When to recover to maintain system stability • How to find the RTTR (instead of minimize MTTR)

  20. Backup Slides

  21. Timing fault GRMS Capability abuse Semantic fault Privilege management Analytic Redundancy (simple & complex Controllers Software Fault Model in RT Control systems • Timing fault: misses its deadlines • Capability abuse: • Corrupt others’ code or data • Unauthorized acquisition of process/resource management capability • Semantic fault: incorrect results that can lead to: • Poor control performance • Instability in the plant

More Related