F ault D etection in a HW /SW CoDesign Environment

Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök

Outline • Introduction • System Specification • Fault model • Some terminology • Methodology Analysis • Reliable communication • HW/SW Partitioning

Introduction • System reliability aspects are generally considered to the end of the design process, at low abstraction levels • Working at low abstraction levels introduces more overhead • Not all systems can be considered at low levels • It is better to handle fault detection at higher levels • It is better to asses if fault detection should be done in HW or SW for system performance

Introduction • At system level several parameters are considered and an alternative design is chosen among several alternatives • Time constraints • Power consumption • Testability • Area

Introduction • Fault detection facilities are introduced at system level • HW/SW binding of components is affected • System Specification: which parts are critical and need fault detection • Design methodologies: how these detection facilities are applied eitherin HW or SW • HW/SW partitioning: which parts are in SW, which are in HW. Guided by methodologies

System Specification • Language must support .. User should eb able to specify which sections require reliability aspects For ex: SystemC or OCCAM • Architecture; CPU(dsp or general purpose), Coprocessors, (ASIC or FPGA)

FAULT MODEL • Single Functional Failure • Any number of physical faults causes a functional model to perform incorrectly • HW is faulty, software is affected by hardware • CPU, communication channels, one of Co processors , memory may fail • Module failure is detected before any other fails • Temporal, architectural and informational redundancy is adopted

Some Terminology • Nominal :original system function elements • Checking: redundant elements for fault detection • Checker: element to compare checking and nominal • Each of these elements can be independently implemented in either HW or SW

HW or SW • Nominal SW,Checker SW, Checking SW Checking and checker are either executed by system processor or a dedicated processor Ex: Self checking SW, Assertions, Dual_processor and VLIW

HW or SW (Cont’d) • Nominal SW, checker HW and checking SW Interface for functional Redundancy check, VLIW with hardware,Dma checker • Nominal SW, checker HW and checking HW CED solutions are implemented totally in HW, EX: Dynamically configurable checker

HW or SW (Cont’d) • Nominal HW, Checker HW, Checking HW Classical Approach. Ex: Duplication , TSC devices

Methodologies Analysis - Concepts • Number and type of processing elements • Whether special architecture is necessary • Synchronization issues between processing elements • Allocation of checker memory space • Checker structure and complexity • Selection of a checker methodolgy to raise errors in case of mismatches

Methodologies Analysis - Metrics • Detection latency: the time between the instant an error occurs and the instance it is detected • Coverage: how many of the existing faults can be detected • Performance degradation: overhead caused by fault detection facilities compared to nominal functions

Methodologies Analysis – Metrics(Cont’d) • Material cost: cost of physical components • Design Cost: effort needed to design the system

Reliable Communication • Apart from data processing communication needs to be reliable • Hardware redundancy ; lines duplication • Information redundancy; data encoding • Best effective when data encoding is used when SW is involved and hardware sections employ dedicated lines(dublicated, encoded)

HW/SW Partitioning • After systems is specified, methodologies has been assessed, different alternatives have been produced with cost functions partitioning step takes place. • Evaluate cost functions, evaluate constraints of the user • Reliability aspects make it more complex Make partitioning in two stages!

HW/SW Partitioning(Cont’d) • First level: classical aspects and functions are taken into account • Second level:given the first solution reliability aspects are introduced and a solution between solution set that has the best trade off and that satisfies the first constraints is chosen. • If no reliability constraints is given second level is not carried

HW/SW Partitioning(Cont’d) • If specific architecture is required for reliability (for example dual processor) fist level benefits from earlier partitioning solutions • A solution may not exist after reliability constraints are introduced and first level may need to be repeated

HW/SW Partitioning(Cont’d) • Reliability constraints may be which druve the second stage • Hard, ex: % 100 fault coverage • Soft, ex: any fault coverage • Parameters considered • Fault coverage • Performance degradation • Detection latency • Area overhead

Conclusion • Design for reliability has been merged into HW/SW codesign process resulting in a final design that has on-line fault detection properties • Future work is introducing fault tolerancy into HW/SW codesign process

F ault D etection in a HW /SW CoDesign Environment