Analysis of the SPIDER Fault-Tolerance Protocols

Analysis of the SPIDER Fault-Tolerance Protocols Paul S. Miner p.s.miner@larc.nasa.gov 5th NASA Langley Formal Methods Workshop Williamsburg, VA June 14, 2000

What is SPIDER? • A general purpose fault-tolerant architecture • Scalable Processor-Independent Design for Electromagnetic Resilience • Intended to serve as a platform to explore recovery strategies for HIRF/EMI induced faults • Developed as part of an FAA funded case-study to exercise RTCA DO-254: Design Assurance Guidance for Airborne Electronic Hardware Lfm2000

RTCA DO-254/EUROCAE ED-76 • Developed by RTCA Special Committee 180 and EUROCAE Working Group 46 • Approved by RTCA Program Management Committee in April 2000 • Approved by EUROCAE(?) • FAA Advisory Circular (?) • Earliest would be sometime this fall Lfm2000

Formal Methods in DO-254 • Formal Methods is one of the advanced analysis techniques suggested when developing hardware to support safety-critical (Level A or B) aircraft functions • DO-254 section on Formal Methods based upon material from NASA Formal Methods Guidebook, Volume II, (NASA-GB-001-97) Lfm2000

DO-254 Case-Study Participants • NASA LaRC (Design Team) • Paul Miner, Project Lead • Mahyar Malekpour, Design Engineer • Wilfredo Torres-Pomales, Design Engineer • Victor Carreño, Process Assurance • FAA (Sponsor and Certification Liaison) • Leanna Rierson, Pete Saraceni, Dennis Wallace, Connie Beane, and Will Struck Lfm2000

Strategy for DO-254 Case-Study • Fault-tolerance protocols and reliability models use the same fault classifications • Reliability analysis using SURE (Butler) • Calculates P(enough good hardware) • Formal proof of fault-tolerance protocols using PVS (SRI) enough good hardware => correct operation Lfm2000

SPIDER Design Concept • Inspired by several earlier designs • Main concept inspired by Palumbo’s Fault-tolerant processing system (U.S. Patent 5,533,188) • Developed as part of Fly-By-Light/Power-By-Wire project • Other ideas from Draper’s FTPP, FTP, and FTMP; Allied-Signal’s MAFT; SRI’s SIFT; Kopetz’s TTA; Honeywell’s SAFEbus; . . . Lfm2000

N simplex general purpose nodes logically connected via a Reliable Optical BUS (ROBUS) A ROBUS is an ultra-reliable unit providing basic fault-tolerant services A ROBUS is implemented as a special purpose fault-tolerant device SPIDER Architecture Lfm2000

SPIDER Architecture 4 3 5 ROBUS 2 6 1 7 0 Lfm2000

Logical View of ROBUS • ROBUS operates as a time-division multiplexed access broadcast bus • ROBUS strictly enforces write access • no babbling idiots • Processing nodes may be grouped to provide differing degrees of fault-tolerance • some voting available within ROBUS Lfm2000

Logical view of ROBUS(Sample Configuration) 0 4 3 2 6 1 7 5 ROBUS Lfm2000

ROBUS Characteristics • Bus access schedule statically determined • similar to SAFEbus, TTA • Some fault-tolerance functions provided by processing nodes • ROBUS will not have general purpose processing capabilities • Processing Elements need not be uniform • support for dissimilar architectures Lfm2000

ROBUS Requirements • All fault-free nodes observe the exact same sequence of messages • ROBUS provides a reliable time source (RTS) • The nodes are synchronized relative to this RTS • ROBUS provides correct and consistent system diagnostic information to all fault-free nodes • For 10 hour mission, P(ROBUS Failure) < 10-10 Lfm2000

Interactive Consistency(Byzantine Agreement) Agreement: For any message, all non-faulty receiving nodes will agree on the value of the message Validity: If the originator of the message is non-faulty, good receivers will receive the message sent Lfm2000

Clock Synchronization Precision: There is a small positive constant dmax such that for any two clocks that are good at t, |C1(t) - C2(t)| dmax Accuracy: All good clocks maintain an accurate measure of the passage of time (within a linear envelope of real time) Lfm2000

Diagnosis Correctness: Every node diagnosed as faulty by a good node is faulty • A good node can never conclude that another good node is faulty Completeness: Every faulty node is (eventually) diagnosed as being faulty • This is not always possible (pathological case involves asymmetric fault) Lfm2000

ROBUS decomposed into physically isolated Fault Containment Regions (FCR) Two main design elements Bus Interface Unit (BIU) Redundancy Management Unit (RMU) Processing elements may form separate FCRs FCRs fail independently This is necessary to achieve reliability goals Physical Segregation Lfm2000

ROBUSN,M BIU 1 PE 1 RMU 1 BIU 2 PE 2 PE 3 BIU 3 RMU 2 RMU M PE N BIU N ROBUS Topology Lfm2000

The failure status of an FCR is subdivided into four mutually exclusive cases Good (or fault-free) Benign Faulty (Known bad by all good) Symmetric Faulty (Same to all good) Asymmetric Faulty (Byzantine, Malicious) This is a global classification, individual FCRs do not know the failure status of other FCRs Hybrid Fault Assumptions Lfm2000

Fault Classification • Partition the RMUs into disjoint subsets based upon fault classification • GR, BR, SR, and AR for good, benign, symmetric, and asymmetric RMUs respectively • Similarly partition the BIUs • GB, BB, SB, and AB Lfm2000

Tolerating Asymmetric Faults • Requires 3f + 1 participants in protocol to withstand f simultaneous asymmetric faults • Requires 2f + 1 disjoint communication paths between any two participants • Requires f + 1 levels of communication • ROBUS Topology satisfies these conditions for N  3, M  3, f = 1 • For target reliability, we must tolerate at least 1 asymmetric fault Lfm2000

Interactive Consistency • SPIDER IC protocol is simple adaptation of IC algorithm for Draper FTP Architecture • Existing PVS proof due to Lincoln and Rushby, COMPASS’94, pages 107-120 • Protocol generalizes one suggested in Daniel Davies and John Wakerly, Synchronization and Matching in Redundant Systems, IEEE Trans. on Computers, Vol. C-27, No. 6, June 1978 Lfm2000

SPIDER IC Protocol Algorithm OMS (ignoring hybrid fault model): 1. Processing element j sends value v to BIU j 2. BIU j broadcasts v to all RMUs 3. All RMUs broadcast value received from BIU j to all BIUs 4. Each BIU votes on the values received from the RMUs to determine value from j 5. Each BIU forwards the voted value to its PE Lfm2000

Adapting for hybrid faults • Simple modification to steps 3 and 4 to enable special handling of manifestly bad messages • from benign faulty or asymmetrically faulty sources • OMHS(p,v,q) denotes the value received by q, when p broadcast value v using hybrid oral messages protocol on SPIDER • Verified in PVS, using simple modifications to Lincoln and Rushby’s proof of the Draper FTP Interactive Consistency Protocol Lfm2000

Agreement: For BIU g, if (|AR| = 0) or (g AB and |GR| > |SR| + |AR|), then for p,q GB: OMHS(g,v,p) = OMHS(g,v,q) Validity: If |GR| > |SR| + |AR|, then for pGB : If gGB, then OMHS(g,v,p) = v If gBB, then OMHS(g,v,p) = Error If gSB, then OMHS(g,v,p) = sent(g,v) Interactive Consistency Results Lfm2000

Alternative Verification Options • For a fixed number of participants, it is easy to demonstrate correctness of Interactive Consistency protocol using symbolic simulation • Amount of effort needed to verify using theorem prover is negligible • PVS proof is mostly symbolic evaluation • there is a small amount of deductive reasoning to evaluate the abstract specification of hybrid majority Lfm2000

To match the degree of Fault-Tolerance of the IC protocol, the synchronization protocol should ensure synchronization using the following fault assumptions: |GR| > |SR| + |AR| Not (|AR| > 0 & |AB| > 0) Clock Synchronization Goal Lfm2000

Need added assumption: |GB| > |SB| + |AB| With this assumption, a modified form of the Davies and Wakerly protocol (1978, IEEE ToC) ensures synchronization of the RMUs Modified protocol is similar to the Srikanth and Toueg protocol (1987, JACM) Clock Synchronization Lfm2000

Clock Synchronization Basics • Clocks (counters) driven by oscillator with a bounded drift () from its stated frequency • Periodically (every P ticks) clocks will adjust the count based on exchange with other clocks • The periods are indexed by round number (k) • Protocol seeks to ensure that at beginning of every round, all good clocks are within dmin Lfm2000

Synchronization Basics • If all good clocks start round within dmin of each other, by the end of the round they can be at most dmin + 2 P apart • If good clocks then make a small adjustment so that they start the next round within dmin, then both precision and accuracy are satisfied • Several machine checked proofs exist Lfm2000

Network Imprecision • There is imprecision in communication • If a node transmits a message at time t, observing nodes will receive it within time interval [t + d, t + d + e] • d is the minimum communication delay • e is the inherent imprecision (e > (1 + ) ticks) • e is a lower bound on synchronization precision • For many Byzantine Resilient protocols, dmin2e Lfm2000

RMU: (Perform the following concurrently) If Ready?(k) then broadcast (round k) to all BIUs If Accept?(k) then reset counter for round k BIU: If Accept?(k) then broadcast (round k) to all RMUs Simple Protocol (for round k) Lfm2000

Informal Description • Each good RMU broadcasts when its clock reaches a specific value • Each good BIU waits until it knows it has received a message from at least one good RMU. It then relays this information to all RMUs • Each RMU waits until it knows it has a message from at least one good BIU before resetting Lfm2000

Ready?(k) is an event triggered by a pre-determined local counter value, kP - a, a is a constant offset for communication delays P is the nominal duration of a round k is the round index Degree of fault-tolerance is determined by Accept?(k) Ready and Accept Lfm2000

Wait until there is a hybrid majority of observed (round k) events to trigger Accept?(k) A selection function under the hybrid fault model ignores manifestly bad values This protocol ensures that all good RMUs accept (round k) within a short time interval, provided the maximum fault assumptions are not violated Can also synchronize BIUs by echoing RMU accept Accept?(k) Lfm2000

Verification in PVS • Built generic clock synchronization theory in PVS • PVS theory from Ulm introduced too much potential error in formulation of clock drift assumptions • New theory allows proofs of clock skew as tight as best theoretical results • Support for some protocols absent • All pieces in place to complete SPIDER verification • Estimate 1-2 weeks effort to tie up loose ends Lfm2000

Alternative Verification Options • Protocol is (almost) finite state • Should be possible to use a model checker to confirm that all good nodes start each round within dmin of each other • Plausible tools for this are HyTech, UPPAAL • Still need theorem prover to get Precision and Accuracy results • Theorem prover can verify for arbitrary number of participants Lfm2000

Diagnosis • Plan to adapt MAFT on-line diagnosis algorithms to SPIDER architecture • MAFT algorithms previously verified using PVS • Chris Walter, Patrick Lincoln, and Neeraj Suri. Formally verified on-line diagnosis, IEEE Trans. On Software Engineering, Nov. 1997 • For diagnosis of Processing Elements, there exist verified Group Membership protocols • Katz, Lincoln, and Rushby, Low overhead time-triggered group membership, In 11th Workshop on Distributed Algorithms (WDAG’97), pages 155-169, LNCS 1320 Lfm2000

Summary • New conceptual design for a family of fault-tolerant systems • Design being developed using DO-254 guidance • Critical fault-tolerance verified using PVS • Able to reuse or adapt many existing proofs of fault-tolerance protocols • Unable to reuse existing Clock synchronization proofs Lfm2000

Future Plans • Report documenting SPIDER Conceptual design and proofs of fault-tolerance by end of summer • First laboratory prototype implementation of SPIDER by December • Second generation design in 2001 Lfm2000

Analysis of the SPIDER Fault-Tolerance Protocols

Analysis of the SPIDER Fault-Tolerance Protocols

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance