MISSION / SAFETY CRITICAL ARCHITECTURES Presented by Linda S. Alger INTRUSION TOLERANT WORKSHOP Williamsburg, Virginia 5

MISSION / SAFETY CRITICAL ARCHITECTURESPresented byLinda S. AlgerINTRUSION TOLERANT WORKSHOPWilliamsburg, Virginia5 October 1999

OVERVIEW • Mission/Safety Critical Fault Tolerance • Fault Tolerant Techniques • Examples of Draper Fault Tolerant Systems • Backup 2

Mission / Safety Critical Fault Tolerance • Fault Tolerance is the ability to provide intended functionality in the presence of faults. Redundancy is used to mitigate the effects of system error due to faults. • There are two classes of faults that are of particular concern to mission and safety critical applications: • Random hardware faults • Software / Common mode faults • Techniques have been developed for both classes of faults • Random hardware faults: • Formally validated solutions using hardware redundancy and exact consensus to tolerant arbitrarily malicious (Byzantine) faults • Continuous Self Tests for latent faults • State restoration • Software / Common mode faults: • Design diversity • Exception handlers • Watch dog timers 3

Hardware Redundancy, Synchronization & Exact Consensus • Technique • Hardware redundancy using exact consensus with rigorously implemented fault containment regions and tight synchronization • Problems being addressed • Random hardware fault • High coverage fault detection and identification • identifying all failure modes and methods for dealing with them • dealing with arbitrarily malicious (Byzantine) faults • Avoidance of output errors through fault masking by voters • Pros • No need to exhaustively identify all possible failure modes • System amenable to formal verification methods • Cons • Does not protect from common mode faults • Assumes recovery from first fault occurs prior to occurrence of second fault 4

State Restoration • Technique • Resynchronize and reset the state of the channel being recovered using the fault tolerant clock, inter-channel data exchange from known ‘good’ channels and voting network • Incremental recovery using tagged memory and data exchange network • Problems being addressed • Random hardware fault • Recovering temporarily failed or repaired hardware • Pros • Recovery process is a straight forward application of the existing cross channel data exchange and voting hardware • Cons • System is off line during recovery process or a portion of the recovery process • Incremental recovery may not converge 5

Design Diversity • Technique • N-version programming with confidence voter, layered on Fault Tolerant Processor with attached processors • Problems being addressed • Common mode software problems without losing Byzantine resilience to random hardware faults • Coincident software faults across N -versions • Pros • Unified approach addressing random hardware faults and common mode software faults • Isolation between hardware and software faults • Cons • Cannot use exact consensus approach to voting software versions, need to set thresholds to manage false alarms vs. missed detections • Not enough testing done to prove confidence voter solved coincident software faults 6

HIERARCHICAL APPROACH TO FAULT MANAGEMENT • Near Perfect Fault Detection, Identification and Reconfiguration of Random, Arbitrarily Malicious Faults to a Fault Containment Region • Continuous Hardware based Fault Masking and Detection • Near Real-Time (msec.) Software based Fault Isolation and Reconfiguration • Resolution of Transient, Intermittent, and Hard faults by Heuristics • Software based Self-Tests (Built-In-Test) Run in Background on a time-available basis: • Uncover Latent Faults (Especially with Voters, etc.) • Localize Faults to Module or Chip Level • On-line Repair & Diagnostic Capability 7

SEAWOLFSHIP CONTROL PROCESSING UNIT (SCPU) • Quadruply redundant fault tolerant computer • Fault isolation to the card level • Heuristics to resolve transient, intermittent, & hard faults • Channel recovery withSCPU off line for < 150 msec. • Continuous self tests for latent faults • Exception handlers, watchdog timer, & overrun flag for common mode software faults 8

Flight Critical Computer for theX-38 / Crew Return Vehicle • Fault Tolerant Parallel Processor • Eight Processors: one quad group & four simplex groups • COTS hardware and software • Four Fault Containment Regions • Expandable to Five • Network Element provides: • Hardware synchronization • Source congruent data exchange & voting • Message passing between parallel processors • Error detection • Exception handlers, watchdog timer, overrun flag and memory management for software faults 9

Fault Tolerant Processor with Attached Processors (FTP/AP) • Quadruply redundant fault tolerant processor with 4 attached processors • Four software versions of critical function on AP • Fault isolation to hardware FCR or software version • Confidence voter instead of majority voter used to resolve software version discrepancies and the issue of coincident errors • Hardened kernel approach used for operating system software • Recovery of software versions is accomplished by continuous execution with output masked • Output is compared to voted output • If output agrees for several iterations, version is restored 10

Backup 11

Fault Containment Region (FCR) • An FCR is a collection of components that operates correctly regardless of any arbitrary logical or electrical fault outside the region. • An arbitrary logical or electrical fault in an FCR cannot cause the hardware outside the region to misbehave or fail in any manner. • Faults cannot propagate across containment regions but their effects (errors) can. 12

Error Containment • Voting planes are used to mask errors at different stages in a fault tolerant system. • Input voting masks failed sensor value from propagating to application programs. • Internal computer voting masks erroneous data from a failed Fault Containment Region from propagating to other FCRs. • Output voting and monitor/interlock mechanism prevents outputs of failed FCRs from propagating outside the computational core. • Actuator voting masks errors in the transmission mechanism connecting computer to actuators. 13

Requirements for Exact Consensus& Byzantine Resilience • Requirements for Exact Consensus: • Identical initial states • Identical inputs • Identical operations • No faults • Bounded skew • Theoretically correct implementation of f-Byzantine Resilience requires: • Bit-wise comparison of results emanating from redundant sites of equivalent state complexity • 3f+1 fault containment regions (FCRs) • 2f+1 inter-FCR connectivity • f+1 round inter-FCR protocol • FCR synchronism 14

A Correct Solution: 4 FCRs • Four participants in input distribution algorithm 15

A Correct Solution: 2-Rounds of Exchange • 2-Round input distribution algorithm • Vote exchanged values 16

MISSION / SAFETY CRITICAL ARCHITECTURES Presented by Linda S. Alger INTRUSION TOLERANT WORKSHOP Williamsburg, Virginia 5