150 likes | 291 Views
ORTEGA: An Efficient and Flexible Software Fault Tolerance Architecture for Real-Time Control Systems. Authors: Xue Liu, Hui Ding, Kihwal Lee, Qixin Wang, Lui Sha. Presentation by Evan Frenn . Overview. Introduction Software Faults General Assumptions Related Work ORTEGA Limitations
E N D
ORTEGA: An Efficient and Flexible Software Fault Tolerance Architecture for Real-Time Control Systems Authors: Xue Liu, Hui Ding, Kihwal Lee, Qixin Wang, Lui Sha Presentation by Evan Frenn
Overview Introduction Software Faults General Assumptions Related Work ORTEGA Limitations Design challenges/solutions Evaluation Conclusion
Introduction Problem: How to design fault tolerant architectures for real-time control systems Examples of real-time control systems? What are faults? Hardware malfunction Communication Medium malfunction Software Malfunction
Software Faults Resource sharing faults Corruption of memory Handled by address space protections Time Faults Failure to meet timing constraints (e.g. infinite loop) Handled by a real-time scheduling method e.g. Generalized Rate-Monotonic Scheduling Semantic Faults producing the wrong output Handled by utilizing a high assurance controller (HAC) Assumption - HAC is always correct
General Assumptions Authors assume existence of two distinct controllers: High Assurance Controller (HAC) – proven to be reliable based. Relies on its simple construction allowing formal methods for verification and validation High Performance Controller (HPC) – use advanced control techniques for higher performance Additional features or more complex control structure (e.g. neural networks) Is it common to have choice of controllers?
Related Work Simplex Utilizes HAC and HPC running in parallel to allow rapid response to an HPC fault Limitations: Inefficient: HAC is always running, even when faults are not present Inflexible: HAC and HPC are required to have the same sampling/control periods Does not allow HAC to make up time incurred by the fault
ORTEGA On-demand Real-TimE GuArd (ORTEGA) – 3 major components: Decision module – determines which control command to use for each period Simply uses semaphore to lock suspended control module Allows HPC to run during normal operation HPC module HAC module
ORTEGA ctd. Comparison to Simplex: Decision module allows efficient CPU utilization Decisions structure removes requirement for HAC and HPC to be lock stepped CPU Usage Savings
Limitations On-demand functionality of ORTEGA leads to a single period delay in the recovery procedure Can be overcome using a state projection technique – requires projection of next state of the plant… How? Ability of ORTEGA to dynamically change period of HAC minimizes delay
Design Challenges Maximum Stability Region ORTEGA requires that the HPC always state within the stable region that can be handled by the HAC If fault occurs in HPC outside HAC’s stability region, HAC will be unable to recover the system In order to reduce restrictions on the HPC, the stability region of the HAC must be maximized
Design Solution Looked at constraining the next plant state based on its current state and the current control output Use Lyapunov stability criteria to calculate stability region given state constraints– output is an ellipsoid of the stable state Stable state ellipsoid is then converted using Linear Matrix Inequality Prove state of controller can never leave stability region, provided it starts in the region Explained using stable state of an inverted pendulum
Evaluation Evaluated ORTEGA under control of an inverted pendulum Stability region of device measured by angle of the pendulum Tested 2 configurations Non-faulty HPC and HAC – used as base test against Simplex for CPU saving Non-faulty HAC and faulty HPC – tested ORTEGA’s ability to control system
Evaluated Bugs Infinite loop Non performing bug – HPC crashes and outputs zero Maximum control output – HPC faults to outputting maximum value Bang-bang – HPC faults to output maximum value then minimum value Positive feedback control – HPC outputs opposite of correct values Divide by zero
Conclusion Evaluation results: ORTEGA saves 30% of CPU resources when HPC and HAC have same period over Simplex ORTEGA saves up to 50% when sampling rate is dynamic ORTEGA tolerates all faults – True? How does this apply to plant control systems? Faults that are not tested? Instances where delay matters?