Analysis and design of Fault Tolerant Real-time systems

Analysis and design of Fault Tolerant Real-time systems by Roozbeh Izadi-Zamanabadi Department of Control Engineering Aalborg University

Overview • Introduction

Definitions - I The thruthworthiness of a computer system such that reliance can justifiably be placed on the service it delivers. Dependbility: The service delivered by a system is its behaviour as it is prerceived by its users (human or physical which interact with the computer system). Dependability is a general concept and different attributes are related to it. The most significant attributues are: Reliability, Availability, safety, and security. Reliability deals with continuity of service. Availability deals with rediness for usage. Safety deals with avaoidance of catastrophic consequences on the environment.

Definitions - II Security delas with prevention of unauthorized access and/or handling of information. Fault prevention <goal> to prevent faults from occurring or getting introduced into the system. Fault Tolerance <goal> to provide service despite the presence of faults in the system. Fault tolerance uses protective redundancy to mask failures, i.e. the system contains components that are not needed if no fault tolerance is to be supported. Fault prevention methods focus on methodologies for design, testing, and validation. Fault tolerant methods focus on how to use components in a mannersuch that failures can be masked.

Software Architecture for Real-time Systems • Architecture description Languages (ADL) are used to: • Communicate (the design solutions) among software Engineers • Support analysis of the architecture (verify the quality requirements are met) • Make maintenance easier.

ADL – Desired properties An ADL should provide six classes of properties • Composition: described by components and connections. • Abstraction: used to describe exact role of elements clearly and exactly. • Reusability: It should be possible to reuse components, connectors and achitectural pattern. • Configuration: the architectural structure among components should be separated from the sturcture in the compoenents. • Heterogeneity: possibility of combining different heterogeneous descriptions. • Analysis: support the possibility of different kind of analysis.

Architectural views • Structural view • Module view • Logical view • Hardware view • Temporal view • Communication view • Synchronization view

Module A Module B Module C Module D Structural view Describes the overall architectural design and style. It consists of software modules and their interconnection MASCOT design methodology:decomposed component level view HRT-HOOD (OO methodology): Parent-objects

Module A Module B Module C Module D Module View It exposes all the functions, methods or submodules in all the components modelled in the structural view. It is desireable to hold the interaction Between functions in different components To a minimum.

Logical view Functions from moddule view are described in more logical details. a! Different types of state machines and process algebra can be used. Timed automata for real-timed systems (representing time as well as concurrency) a?

Module A Module B Module C Module D Hardware view Distributed systems with separated CPUs, Or requirements of pre-allocated functionality among different nodes in the systesm Processor 2 Processor 1

Temporal view Correctness of the real-time system depends on correct functions as well as correct timing (i.e. Not too early andnot too late). Temporal view contains data such as: release time (the eaeliest start time of the task) deadline ( The latest completion time of a task) periode time (frequency of the task) ... HRT-HOOD has a temporal view that is divided in two parts:1 – describes the execution strategies (either Cyclic or sporadic)2 – provides temporal attributes, e.g. Period times, deadlines, ...

Coomunication view - Model of communication among tasks and processes.- Is performed using messages and signals P 1 P 2 P 3 msg 1 msg 2 MSC can be translated into ordinary finite state automata, hence easy to verify formally, for instance, using temporal logic msg 3 msg 4 Message Sequence Chart (MSC)

Synchronization view Multi-tasking system having several tasks running concurrently, it is necessary to syncronize access to shared resources in order to avoid inconsistency. Different sync. Techniques are avialable depending on the real-time operating syste. F.ex. Pre-run-time scheduelling (pre runtime generated table is used)Event trigered (semaphors are used) Temporal view Separationin time Synchronization Signals used Communication view

Architecture analysis The main goal for using software architecture notation (when designing) is the ability to analyse and verify the design in an early stage of the development process. • The software system quality properties are generally divided into two different classes: • Functional: thoes concerned with the runtime behaviour of the software, e.g. performance or reliability • Nonfunctional: thoes concerned with the quality of the software itself, e.g. maintainability and reusability.

Architecture analysis methods Systems requirements System domain Checklist based Questioning Scenario based Simulation/prototyping Measuring Scenario execution Mathematical methods Nonfuunctional properties Functional properties Property Class

Architecture analysis methods - 1 • Scenario is always system specific, i.e. Tailor-made for a particular application in a domain. • Checklists contain questions that are valid for all architectures in a particular domian. • Example for safety-critical real-time systems (Checklist): • Is the system schedulable? • Is there error recovery code in the system to clean upafter error detection? • Example for scenario: • What happes when division by zero occurs in the control task?

Architecture analysis methods - 2 • Measuring techniques: • Scenario execution: to ”execute” the questions stated by a scenario on the architecture and investigate its effects. (is suited for analysis of non-functional quality properties). • Simulation/prototyping: the used prototype should be as small as possible. (is used to analyse the functional qualityproperties) • Mathematical methods:used when mathematical models do exist (such as Timed automata). Examples are: Schedulability test for real-time systems and statistical reliability modelling.

Functional analysis Functional quality properties

Functional quality properties • Performance: • must have algorithmic solutions as inputs • prototyping/simulation teqniques are used • Ex.: event throughput or queuing length for events in a system • Performance measure is not absolute (used to compare different architectures) • Reliability: • Attempts have been made to borrow theories used for hardware systems and adapt them to software. • ! Software can never be worn out • Alternative method is to measure the testability. Testability is a function of the effort required in order to assure the required level of reliability or availability

Reliability Is achieved by using following approaches to handle faults: • Fault avidance: • is about designing error free systems. • Formal or semi formal metods are used. • Semi-formal methods offer a structured way of reasoning (both at design and analysis level). • They are based on some ”formal” notations, e.g. Unified modelling Language (UML), ADLs, etc. Representing the system model. Example of such methods: Object-oriented analysis and Design (OOA/OOD).

Reliability - 1 • Fault removal: is basically the task of finding the errors by testing and removing them by errorr correction • Fault tolerance: Two approches are used: • Tolerate faults from its environment, e.g. Operator, hardware errors, etc. • Tolerant against design faults within software itself. Ad. 1. Redunant hardware (with their own software blocks) are used. Ad. 2. Solution approaches include Recovery blocks N version programming

Safety • Concerned failures that endanger human life and the environment, i.e. Hazards. • Hazard analysis is performed in order to identify hazards. • Techniques for assessing safety properties are mostly scenario based and work either forward or backward. • Backward methods: analysis starts with the hazard as a scenario and try to trace down the responsible component. EX.: FTA (fault tree analysis) • Forward methods: effects of an error in a component is investigated. EX.: FMEA (Failure Mode and Effect Analysis), HAZOP (Hazard and Operability Studies).

Safety - 1 Depending on the result of safety analysis, changes in the design may have to be performed. Different design approaches to avoid catastrophic failures can be applied based on the severity of an accident caused by the hazard: • Hazard elimination: achieved by • Substitution (a dangerous design possibility by a functionally equivalent, but not dangerous solution). • Decoupling (safety-critical parts from non-critical software ex. Safety kernels, firewalls, ..) • Simplifications (KISS rule should be kept in mind) • Hazard reduction: reduces but not eliminates the hazards. Ex.: Erect a fence around an industrial robot.

Safety - 2 • Hazard control: Use fail-safe design, i.e. System is designed to detect the hazard and then transfer it into a safe state if such exists.If no safe states exists (such as in airplanes), use fault-tolerance methods, such as redundancy to keep the primary functions alive. • Damage minimization: If accidents occur, the consequences and losses must be reduced (minimize the exposure of the accident to the environment or human beings)

Availability and security Availability = 1 – (MTTR/MTBF) MTTR = Mean time to repair MTBF = Mean time between failure Security: <goal> protecting the software against malicious intended actions.Achieved through:safety/security kernels, firewalls, etc. Scenario based methods can be used to assess this property.

Real-time requirements • Temporal correctness of tasks is important • Analysis: schedulability test, i.e. Whether the task set is schedulable given resources and temporal constraints. • Resources: CPUs, communication buses, actuators, etc. • Temporal constraints include release times, deadlines, worst case exec. Time, jitter, etc.

Scheduling strategies Scheduling Preemptive/non-preemptive Run-time scheduling Pre-run-time scheduling Priority based Static priorities Dynamic priorities FPS+RM User defined PCP ED RM Rate monotonic FPS Fixed priority schedueling ED Earliest deadline PCP Priority cieling protocol

Non-functional quality properties

Non-functional quality properties - 1 • Cost: • dependes on other properties such as maintainability, testability and reusability. • Cost estimation is based upon historical experiences with similar systems. • Testability: • Proves functional correctness of the software, hence is essential. • Depends on three individual properties:

Non-functional quality properties - 2 • Observability: • the result must be observed • In structural view, components are black boxes, only the interfaces are observable. The bigger the interface, the more visibility  higher testability. • Controllability: • Given an input (to the task or a sub-system) one may control the path taken in the program. If the path only depends on the input itself, maximum controllability is achieved • If there are data dependencies between different modules, the controllability is decreased  lower testability. • Reproducability • To get high testability, the order in which processes execute must be controllable or deterministic, i.e. High reproducibility.

Non-functional quality properties - 3 • Reusability: • example: Standard Template Library (STL) for the object oriented language C++. • Portability: • Dependencies between the software components in the ystem and the platform are in focus. • Platform: hardware, e.g. Processors, A/D converters, and the operating systems. • The less direct dependency between the component and the plaform, the highest degree of portability.

Non-functional quality properties - 4 • Maintainability: • Def.: The amount of changes in the software architecture enforced by adding new functionality or error corrections. • Scenarios used from the requirements of the new function are used to analyse the existing architecture. A reference list will be provided on the net.

Analysis and design of Fault Tolerant Real-time systems

Analysis and design of Fault Tolerant Real-time systems

Presentation Transcript

Fault Tolerant Computer Design COMS30125

Chapter Fault Tolerant Design of Digital Systems

CprE 545: FAULT-TOLERANT SYSTEMS

Fault Tolerant Distributed Systems

Metrics for Fault-Tolerant Real-Time Software

CprE 545: Fault Tolerant Systems

Probability Distribution of Some Time Characteristics of Fault-Tolerant Systems

CprE 545: FAULT-TOLERANT SYSTEMS

Fault Tolerant Design of Distributed Automotive Systems

Modeling and Analyzing Fault-Tolerant, Real-Time Communication Protocols

Optimal Recovery Schemes for Fault Tolerant Distributed Real-Time Systems

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: FAULT-TOLERANT SYSTEMS

Adaptive Fault Tolerant Systems: Reflective Design and Validation

Design Optimization of Time- and Cost-Constrained Fault-Tolerant Distributed Embedded Systems

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

DRAFTS Distributed Real-time Applications Fault Tolerant Scheduling