Juan Pardo Fault Tolerant Systems Group Polytechnic University of Valencia Spain

Reliability study of an embedded operating system for industrial applicationsPardo, J., Campelo, J.C, Serrano, J.J. Juan Pardo Fault Tolerant Systems GroupPolytechnic University of Valencia Spain

Research Objectives • Critical industrial applications or fault tolerant applications need for operating systems (OS) which guarantee a correct and safe behaviour despite the appearance of errors. • In order to validate the behaviour of an operating system in front of errors, software fault injection techniques can be used. • These techniques can be used to corrupt the information of some of the operating system calls to see how the system react in front of invalid or corrupted values at the kernel calls. WSRS '04

Research Objectives • The research work presented is about the development and results on software fault injection in an embedded system composed by a Real-Time Operating System (RTOS) and a microcontroller. • A software fault injection tool has been developed. The methodology proposed treated the operating system as a black-box where its source code was not available. • With this objective a layer between the operating system and the application to be executed has been developed. • OS errordetection coverage has been measured and observations about OS critical data structures to be improved have been commented, in order to improve the final robustness of the operating system. WSRS '04

Introduction • Software of computer systems involves a lot of aspects of our lives. Despite their enormous expansion, they are still far from reaching the perfection. • In order to measure the quality of the software some tests are required. • Fault tolerance deals with software’s ability to hide problems, specifically the effects of faults [Voas98]. • Robustness is the degree to which a system operates correctly in the presence of exceptional inputs or stressful environmental conditions. • Robustness can thus be viewed as an indication on the OS capacity to resist/react to faults induced by the applications running on top of it, or originating from the hardware layer or from device drivers [DBench02]. WSRS '04

Introduction • Fault Tolerant System • Fault tolerance is intended to preserve the delivery of correct service in the presence of active faults. It is generally implemented by error detection and subsequent system recovery • A system able to continue working although the appearance of errors • Safe behaviour known state which doesn’t produce any risk to the system • Dependability • To avoid the lost of human lives or important economic quantities • Final products quality  Validation before to go to the market WSRS '04

Introduction Dependability: Dependability of a computing system is the ability to deliver service that can justifiably be trusted A. Avizienis JC. Laprie B. Randell WSRS '04

State of art Fault Injection Techniques WSRS '04

Advantages & drawbacks (SWIFI ) • Total control on When and Where to inject  Controllability • Higher level faults simulation • Reduced cost • Higher reachability • Higher portability  Flexibility • Low risk to damage the circuit under tests • Easy automation of the injection campaigns • Good observability everyday processors have more internal tools for debugging WSRS '04

Advantages & drawbacks (SWIFI ) • There are zones which SW can not reach. • Less precision on timing measurements  interferences with the system, overload, etc. • Injection and activation agents overload the system • Runtime Injection  Little intrusion • Objective: minimize the overload • Drawback for RTOS • Easy automation of injections campaigns • Pre-runtime  Less intrusion WSRS '04

SW Fault Injection • SW Fault Injection tools: • FIAT:Fault Injection Based Automated Testing Environment, Carnegie Mellon University. • EFI, PROFI:Processor Fault Injector, Dortmund University. • FERRARI: Fault and ERRor Automatic Real-time Injector, Texas University. • SFI, DOCTOR:intergrateD sOftware implemented fault injeCTiOn enviRonment, Michigan University. • FINE:Fault Injection and moNitoring Environment, Universidad de Illinois University. • FTAPE:Fault Tolerance and Performance Evaluator, Illinois University. • XCEPTION: Coimbra University. • MAFALDA, MAFALDA-RT:Microkernel Assessment by Fault injection AnaLysis and Design Aid, LAAS-CNRS en Toulouse • BALLISTA: Carnegie Mellon University. WSRS '04

XRAM 1KByte XRAM 1KByte RAM 1KByte CAN BUS- CONTROL RAM 1KByte CORE ROM INTERRUPT UNIT IR+PEC- CONTROL PWM SSC USART WDT CAPCOM 1+2 ADC GPT 1+2 Tools • MicroC/OS-II RTOS • Infineon C166  Microcontroller • Tasking  Compiler, Debugger.. • Infineon Microcontroller Characteristics: • 16 bits High performance • On-chip CMOS • 16.5 MIPS, 25/33 MHz • Advantages from CISC & RISC • High functionality for peripheral • Typical for automotive WSRS '04

COTS components • The main motivation to use Commercial Off-The-Shelf (COTS) components on a system design is the notorious cost reduction associated to the final product development. • The use of COTS components becomes a cost-effective method for rapid prototyping of complex software systems. • On the other hand, the use of COTS software components have serious certification problems due to their design process is unknown. WSRS '04

COTS components • COTS software is composed of general purpose components which have poor dependability specifications. • Usually, COTS components are like a black-box, the source code is not available and their internal architecture (structure and data flow) is not adequately documented. WSRS '04

µC/OS-II Operating System • Selection came motivated from the perspective that it is a system widely used since several years ago. First Version MicroC/OS 1992 • Industrial robots, motor control, medical instruments, etc. • It is 99% compliant with the Motor Industry Software Reliability Association (MISRA) C Coding Standards. • All Modified Condition Decision Coverage (MCDC) code in MicroC/OS-II has been removed, improving code quality for RTCA / EUROCAE DO-178B Level A-certified environments for avionics applications. Validated Software Comp. WSRS '04

µC/OS-II: Characteristics • Portable:uC/OS-II is written in highly portable ANSI C, with target microprocessor-specific code written in assembly language. • ROMable:was designed for embedded applications. This means that if you have the proper tool chain (i.e., C compiler, assembler, and linker/locator), you can embed uC/OS-II as part of a product. • Scalable:it’s possible to use only the services needed in the application. This allows to reduce the amount of memory (both RAM and ROM) needed. Scalability is accomplished with the use of conditional compilation. • Preemptive: uC/OS-II is a fully preemptive real-time kernel. This means that uC/OS-II always runs the highest priority task that is ready. • Multitasking:uC/OS-II can manage up to 64 tasks; however, the current version of the software reserves eight of these tasks for system use. This leaves your application up to 56 tasks. Each task has a unique priority assigned to it, which means that uC/OS-II cannot do round-robin scheduling. Jean J. Labrosse WSRS '04

µC/OS-II: Characteristics • Deterministic:Execution time of all uC/OS-II functions and services are deterministic. You can always know how much time uC/OS-II will take to execute a function or a service. Furthermore execution time of all uC/OS-II services do not depend on the number of tasks running in your application. • Task Stacks:Each task requires its own stack; uC/OS-II allows each task to have a different stack size. This allows you to reduce the amount of RAM needed in your application. • Services:system services such as mailboxes, queues, semaphores, fixed-sized memory partitions, time-related functions, etc. • Interrupt Management:Interrupts can suspend the execution of a task. If a higher priority task is awakened as a result of the interrupt, the highest priority task will run as soon as all nested interrupts complete. Interrupts can be nested up to 255 levels deep. • Robust and Reliable:uC/OS-II is based on uC/OS, which has been used in hundreds of commercial applications since 1992. Jean J. Labrosse WSRS '04

Black-box approach • The aim of study was to use a black-box approach for the OS study. • So the OS source code was not modified trying to avoid as maximum as possible an intrusion in the OS behaviour. • With this objective, a layer named as Meta-Kernel, had been developed between the OS and the application to be executed. • Through this layer the fault injection was realized in any of the parameters of the system calls to measure the OS robustness. • In black-box testing, input is fed into a program and the output is checked. What goes on inside the program (the black-box) is unimportant. (Voas98) COTS SW WSRS '04

System Design • MicroC/OS-II OS  Black-Box • OS Source Code not modified • Injector  Layer between the OS and the application • Injection on the parameters of system calls WSRS '04

Injector Attributes • Injector Attributes: • Prediction, elimination • Pre-runtime & Runtime • High Level • Transient faults • Changing of one bit at the system calls (Bit-Flip) • One fault injected each exp. • Workload for tool testing SOFTWARE FAULT INJECTION ATTRIBUTES WSRS '04

Workload Design • Characteristics: • Maximum system calls consume • System calls of synchronization, semaphores, memory, queues, messages, tasks handling, Timing management, etc. • Open module to include calculus. • Workload for testing the injection tool and the OS WSRS '04

Workload Design • The system workload was continuously running and consisted of a series of tasks executing the application. • On the other hand, an injection agent developed was in charge of injecting faults and invalid values at the kernel calls in order to monitor the system robustness. WSRS '04

Errors Classification • Errors which could affect the system • Classification related to the detection mechanisms • Measures about error detection coverage and latency times After the Fault Injection  WSRS '04

Injection Model • Thefaultloadis the most critical dimension of an OS benchmark and more generally of any dependability benchmark. • Two techniques for system call parameter corruption could be used: the ‘bit-fliptechnique’ consisting in flipping systematically bits of the target parameters • and the ‘selective substitution technique’ when invalid data values are introduced in the system call parameters. • Studies have demonstrated the equivalence of the errors provoked by the two techniques [Dbench02]. WSRS '04

Injection Model • BIT-FLIP technique • It is randomly chosen on runtime: • System call • Parameter • Bit • Consequence of physical faults • EMI interferences • Noise • Hardware faults • ... WSRS '04

Analysis of the obtained results • Codification of the different output values: • D0: No error, correct output (the fault injection didn’t affect the system). • D1: Error detected by the operating system (µC/OS-II error code). • D2: Error detected by the application (the application result was no correct). • D3: Error which produced the system hangs. (System failure) • D4: Error detected by the microcontroller. WSRS '04

Analysis of the obtained results Coverage: [Powell95, Constantinescu95] Complete System (µC/OS-II + Micro): C cs = D0 + D1 + D2 + D4 = 65,7 + 21 + 2 + 2,5 = 91,2 % Operating System ( µC/OS-II ): C OS = D0 + D1 =86,7 % WSRS '04

Analysis of the obtained results • Error detection latencies • Time between the injection and detection by the OS • Mean value obtained 304 μs • One built-in timer of the microcontroller to measure latencies • High precision WSRS '04

Other Results ‘E1’ was the most typical. This error is the ‘OS_ERR_EVENT_TYPE’. This error was produced when the fault was injected in some semaphore, message queue or mailbox. The system reacted going to a hanging state. Secondly, the error code ‘E42’ related with the ‘OS_PRIO_INVALID’ was obtained when the injection was at system calls about task management. Frequency tables about the most typical error codes given by the OS WSRS '04

Other Results Moreover, after the injection campaigns it was possible to see how errors were propagated through the system. It was registered the corrupted system call and later which was the system call who finally detected the error, taking the time employed for the system to detect this situation. Error Propagation WSRS '04

Other Results • To finish, results on which were the most critical system calls were obtained with the aim to improve their robustness and of course the final OS dependability. • For example, there are some data structures, related with the event control block, in which the injection produced a lot of failures and the most of times the system hanged. • This is due to in these structures is stored the list of tasks waiting for some event, so if the injection corrupts that information, the system loss the sequence of the next actions and goes to a non safe state without knowing how to react (the system hangs). • This give us information on where dedicate special attention due to an error on those data structures could provoke critical failures on the system. WSRS '04

Conclusions • After the experiments, the error detection coverage, error detection latency times, error propagation, typical OS error codes, etc. have been obtained. • Fault injection into the code and data memory segments of the microkernel will be implemented too. • About possible improvements for the MicroC/OS-II to increase its dependability should take into account, that some detected errors in certain data structures could provoke critical failures on the system. • These detected data structures should implement some mechanism to protect the information they host. WSRS '04

Future Research • In a next research work, these data have to be compared with other COTS RTOS working under the same conditions. • RT-fault injector to minimize intrusion (Without internal debug support, intrusion > 0) • Nexus-implemented fault injection • Other architecture: Motorola MPC565 • Intrusion -----> null • Preliminary results • Better controllability and observability • Best option to validate RTOS and applications WSRS '04

Contact Data Juan Pardo Fault Tolerant Systems Group Polytechnic University of Valencia Spain Email: juaparal@upvnet.upv.es Web: http://www.disca.upv.es/gstf/ WSRS '04

Juan Pardo Fault Tolerant Systems Group Polytechnic University of Valencia Spain

Juan Pardo Fault Tolerant Systems Group Polytechnic University of Valencia Spain

Presentation Transcript

Chapter Fault Tolerant Design of Digital Systems

CprE 545: FAULT-TOLERANT SYSTEMS

TECHNICAL UNIVERSITY OF VALENCIA (UPV), SPAIN

Fault Tolerant Distributed Systems

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: Fault Tolerant Systems

VALENCIA, SPAIN

CprE 545: Fault Tolerant Systems

Isidro Ramos Polytechnic University of Valencia -SPAIN-

The Fault-Tolerant Group Steiner Problem

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: FAULT-TOLERANT SYSTEMS

Fault Tolerant Design of Distributed Automotive Systems

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: Fault Tolerant Systems

TECHNICAL UNIVERSITY OF VALENCIA (UPV), SPAIN

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

The Fault-Tolerant Group Steiner Problem

Distributed systems II Fault-Tolerant AGREEMENT

Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance