1 / 16

MISSION / SAFETY CRITICAL ARCHITECTURES Presented by Linda S. Alger INTRUSION TOLERANT WORKSHOP Williamsburg, Virginia 5

MISSION / SAFETY CRITICAL ARCHITECTURES Presented by Linda S. Alger INTRUSION TOLERANT WORKSHOP Williamsburg, Virginia 5 October 1999. OVERVIEW. Mission/Safety Critical Fault Tolerance Fault Tolerant Techniques Examples of Draper Fault Tolerant Systems Backup. 2.

waite
Download Presentation

MISSION / SAFETY CRITICAL ARCHITECTURES Presented by Linda S. Alger INTRUSION TOLERANT WORKSHOP Williamsburg, Virginia 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MISSION / SAFETY CRITICAL ARCHITECTURESPresented byLinda S. AlgerINTRUSION TOLERANT WORKSHOPWilliamsburg, Virginia5 October 1999

  2. OVERVIEW • Mission/Safety Critical Fault Tolerance • Fault Tolerant Techniques • Examples of Draper Fault Tolerant Systems • Backup 2

  3. Mission / Safety Critical Fault Tolerance • Fault Tolerance is the ability to provide intended functionality in the presence of faults. Redundancy is used to mitigate the effects of system error due to faults. • There are two classes of faults that are of particular concern to mission and safety critical applications: • Random hardware faults • Software / Common mode faults • Techniques have been developed for both classes of faults • Random hardware faults: • Formally validated solutions using hardware redundancy and exact consensus to tolerant arbitrarily malicious (Byzantine) faults • Continuous Self Tests for latent faults • State restoration • Software / Common mode faults: • Design diversity • Exception handlers • Watch dog timers 3

  4. Hardware Redundancy, Synchronization & Exact Consensus • Technique • Hardware redundancy using exact consensus with rigorously implemented fault containment regions and tight synchronization • Problems being addressed • Random hardware fault • High coverage fault detection and identification • identifying all failure modes and methods for dealing with them • dealing with arbitrarily malicious (Byzantine) faults • Avoidance of output errors through fault masking by voters • Pros • No need to exhaustively identify all possible failure modes • System amenable to formal verification methods • Cons • Does not protect from common mode faults • Assumes recovery from first fault occurs prior to occurrence of second fault 4

  5. State Restoration • Technique • Resynchronize and reset the state of the channel being recovered using the fault tolerant clock, inter-channel data exchange from known ‘good’ channels and voting network • Incremental recovery using tagged memory and data exchange network • Problems being addressed • Random hardware fault • Recovering temporarily failed or repaired hardware • Pros • Recovery process is a straight forward application of the existing cross channel data exchange and voting hardware • Cons • System is off line during recovery process or a portion of the recovery process • Incremental recovery may not converge 5

  6. Design Diversity • Technique • N-version programming with confidence voter, layered on Fault Tolerant Processor with attached processors • Problems being addressed • Common mode software problems without losing Byzantine resilience to random hardware faults • Coincident software faults across N -versions • Pros • Unified approach addressing random hardware faults and common mode software faults • Isolation between hardware and software faults • Cons • Cannot use exact consensus approach to voting software versions, need to set thresholds to manage false alarms vs. missed detections • Not enough testing done to prove confidence voter solved coincident software faults 6

  7. HIERARCHICAL APPROACH TO FAULT MANAGEMENT • Near Perfect Fault Detection, Identification and Reconfiguration of Random, Arbitrarily Malicious Faults to a Fault Containment Region • Continuous Hardware based Fault Masking and Detection • Near Real-Time (msec.) Software based Fault Isolation and Reconfiguration • Resolution of Transient, Intermittent, and Hard faults by Heuristics • Software based Self-Tests (Built-In-Test) Run in Background on a time-available basis: • Uncover Latent Faults (Especially with Voters, etc.) • Localize Faults to Module or Chip Level • On-line Repair & Diagnostic Capability 7

  8. SEAWOLFSHIP CONTROL PROCESSING UNIT (SCPU) • Quadruply redundant fault tolerant computer • Fault isolation to the card level • Heuristics to resolve transient, intermittent, & hard faults • Channel recovery withSCPU off line for < 150 msec. • Continuous self tests for latent faults • Exception handlers, watchdog timer, & overrun flag for common mode software faults 8

  9. Flight Critical Computer for theX-38 / Crew Return Vehicle • Fault Tolerant Parallel Processor • Eight Processors: one quad group & four simplex groups • COTS hardware and software • Four Fault Containment Regions • Expandable to Five • Network Element provides: • Hardware synchronization • Source congruent data exchange & voting • Message passing between parallel processors • Error detection • Exception handlers, watchdog timer, overrun flag and memory management for software faults 9

  10. Fault Tolerant Processor with Attached Processors (FTP/AP) • Quadruply redundant fault tolerant processor with 4 attached processors • Four software versions of critical function on AP • Fault isolation to hardware FCR or software version • Confidence voter instead of majority voter used to resolve software version discrepancies and the issue of coincident errors • Hardened kernel approach used for operating system software • Recovery of software versions is accomplished by continuous execution with output masked • Output is compared to voted output • If output agrees for several iterations, version is restored 10

  11. Backup 11

  12. Fault Containment Region (FCR) • An FCR is a collection of components that operates correctly regardless of any arbitrary logical or electrical fault outside the region. • An arbitrary logical or electrical fault in an FCR cannot cause the hardware outside the region to misbehave or fail in any manner. • Faults cannot propagate across containment regions but their effects (errors) can. 12

  13. Error Containment • Voting planes are used to mask errors at different stages in a fault tolerant system. • Input voting masks failed sensor value from propagating to application programs. • Internal computer voting masks erroneous data from a failed Fault Containment Region from propagating to other FCRs. • Output voting and monitor/interlock mechanism prevents outputs of failed FCRs from propagating outside the computational core. • Actuator voting masks errors in the transmission mechanism connecting computer to actuators. 13

  14. Requirements for Exact Consensus& Byzantine Resilience • Requirements for Exact Consensus: • Identical initial states • Identical inputs • Identical operations • No faults • Bounded skew • Theoretically correct implementation of f-Byzantine Resilience requires: • Bit-wise comparison of results emanating from redundant sites of equivalent state complexity • 3f+1 fault containment regions (FCRs) • 2f+1 inter-FCR connectivity • f+1 round inter-FCR protocol • FCR synchronism 14

  15. A Correct Solution: 4 FCRs • Four participants in input distribution algorithm 15

  16. A Correct Solution: 2-Rounds of Exchange • 2-Round input distribution algorithm • Vote exchanged values 16

More Related