1 / 40

Analysis of the SPIDER Fault-Tolerance Protocols

Analysis of the SPIDER Fault-Tolerance Protocols. Paul S. Miner p.s.miner@larc.nasa.gov 5th NASA Langley Formal Methods Workshop Williamsburg, VA June 14, 2000. What is SPIDER?. A general purpose fault-tolerant architecture

tejano
Download Presentation

Analysis of the SPIDER Fault-Tolerance Protocols

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis of the SPIDER Fault-Tolerance Protocols Paul S. Miner p.s.miner@larc.nasa.gov 5th NASA Langley Formal Methods Workshop Williamsburg, VA June 14, 2000

  2. What is SPIDER? • A general purpose fault-tolerant architecture • Scalable Processor-Independent Design for Electromagnetic Resilience • Intended to serve as a platform to explore recovery strategies for HIRF/EMI induced faults • Developed as part of an FAA funded case-study to exercise RTCA DO-254: Design Assurance Guidance for Airborne Electronic Hardware Lfm2000

  3. RTCA DO-254/EUROCAE ED-76 • Developed by RTCA Special Committee 180 and EUROCAE Working Group 46 • Approved by RTCA Program Management Committee in April 2000 • Approved by EUROCAE(?) • FAA Advisory Circular (?) • Earliest would be sometime this fall Lfm2000

  4. Formal Methods in DO-254 • Formal Methods is one of the advanced analysis techniques suggested when developing hardware to support safety-critical (Level A or B) aircraft functions • DO-254 section on Formal Methods based upon material from NASA Formal Methods Guidebook, Volume II, (NASA-GB-001-97) Lfm2000

  5. DO-254 Case-Study Participants • NASA LaRC (Design Team) • Paul Miner, Project Lead • Mahyar Malekpour, Design Engineer • Wilfredo Torres-Pomales, Design Engineer • Victor Carreño, Process Assurance • FAA (Sponsor and Certification Liaison) • Leanna Rierson, Pete Saraceni, Dennis Wallace, Connie Beane, and Will Struck Lfm2000

  6. Strategy for DO-254 Case-Study • Fault-tolerance protocols and reliability models use the same fault classifications • Reliability analysis using SURE (Butler) • Calculates P(enough good hardware) • Formal proof of fault-tolerance protocols using PVS (SRI) enough good hardware => correct operation Lfm2000

  7. SPIDER Design Concept • Inspired by several earlier designs • Main concept inspired by Palumbo’s Fault-tolerant processing system (U.S. Patent 5,533,188) • Developed as part of Fly-By-Light/Power-By-Wire project • Other ideas from Draper’s FTPP, FTP, and FTMP; Allied-Signal’s MAFT; SRI’s SIFT; Kopetz’s TTA; Honeywell’s SAFEbus; . . . Lfm2000

  8. N simplex general purpose nodes logically connected via a Reliable Optical BUS (ROBUS) A ROBUS is an ultra-reliable unit providing basic fault-tolerant services A ROBUS is implemented as a special purpose fault-tolerant device SPIDER Architecture Lfm2000

  9. SPIDER Architecture 4 3 5 ROBUS 2 6 1 7 0 Lfm2000

  10. Logical View of ROBUS • ROBUS operates as a time-division multiplexed access broadcast bus • ROBUS strictly enforces write access • no babbling idiots • Processing nodes may be grouped to provide differing degrees of fault-tolerance • some voting available within ROBUS Lfm2000

  11. Logical view of ROBUS(Sample Configuration) 0 4 3 2 6 1 7 5 ROBUS Lfm2000

  12. ROBUS Characteristics • Bus access schedule statically determined • similar to SAFEbus, TTA • Some fault-tolerance functions provided by processing nodes • ROBUS will not have general purpose processing capabilities • Processing Elements need not be uniform • support for dissimilar architectures Lfm2000

  13. ROBUS Requirements • All fault-free nodes observe the exact same sequence of messages • ROBUS provides a reliable time source (RTS) • The nodes are synchronized relative to this RTS • ROBUS provides correct and consistent system diagnostic information to all fault-free nodes • For 10 hour mission, P(ROBUS Failure) < 10-10 Lfm2000

  14. Interactive Consistency(Byzantine Agreement) Agreement: For any message, all non-faulty receiving nodes will agree on the value of the message Validity: If the originator of the message is non-faulty, good receivers will receive the message sent Lfm2000

  15. Clock Synchronization Precision: There is a small positive constant dmax such that for any two clocks that are good at t, |C1(t) - C2(t)| dmax Accuracy: All good clocks maintain an accurate measure of the passage of time (within a linear envelope of real time) Lfm2000

  16. Diagnosis Correctness: Every node diagnosed as faulty by a good node is faulty • A good node can never conclude that another good node is faulty Completeness: Every faulty node is (eventually) diagnosed as being faulty • This is not always possible (pathological case involves asymmetric fault) Lfm2000

  17. ROBUS decomposed into physically isolated Fault Containment Regions (FCR) Two main design elements Bus Interface Unit (BIU) Redundancy Management Unit (RMU) Processing elements may form separate FCRs FCRs fail independently This is necessary to achieve reliability goals Physical Segregation Lfm2000

  18. ROBUSN,M BIU 1 PE 1 RMU 1 BIU 2 PE 2 PE 3 BIU 3 RMU 2 RMU M PE N BIU N ROBUS Topology Lfm2000

  19. The failure status of an FCR is subdivided into four mutually exclusive cases Good (or fault-free) Benign Faulty (Known bad by all good) Symmetric Faulty (Same to all good) Asymmetric Faulty (Byzantine, Malicious) This is a global classification, individual FCRs do not know the failure status of other FCRs Hybrid Fault Assumptions Lfm2000

  20. Fault Classification • Partition the RMUs into disjoint subsets based upon fault classification • GR, BR, SR, and AR for good, benign, symmetric, and asymmetric RMUs respectively • Similarly partition the BIUs • GB, BB, SB, and AB Lfm2000

  21. Tolerating Asymmetric Faults • Requires 3f + 1 participants in protocol to withstand f simultaneous asymmetric faults • Requires 2f + 1 disjoint communication paths between any two participants • Requires f + 1 levels of communication • ROBUS Topology satisfies these conditions for N  3, M  3, f = 1 • For target reliability, we must tolerate at least 1 asymmetric fault Lfm2000

  22. Interactive Consistency • SPIDER IC protocol is simple adaptation of IC algorithm for Draper FTP Architecture • Existing PVS proof due to Lincoln and Rushby, COMPASS’94, pages 107-120 • Protocol generalizes one suggested in Daniel Davies and John Wakerly, Synchronization and Matching in Redundant Systems, IEEE Trans. on Computers, Vol. C-27, No. 6, June 1978 Lfm2000

  23. SPIDER IC Protocol Algorithm OMS (ignoring hybrid fault model): 1. Processing element j sends value v to BIU j 2. BIU j broadcasts v to all RMUs 3. All RMUs broadcast value received from BIU j to all BIUs 4. Each BIU votes on the values received from the RMUs to determine value from j 5. Each BIU forwards the voted value to its PE Lfm2000

  24. Adapting for hybrid faults • Simple modification to steps 3 and 4 to enable special handling of manifestly bad messages • from benign faulty or asymmetrically faulty sources • OMHS(p,v,q) denotes the value received by q, when p broadcast value v using hybrid oral messages protocol on SPIDER • Verified in PVS, using simple modifications to Lincoln and Rushby’s proof of the Draper FTP Interactive Consistency Protocol Lfm2000

  25. Agreement: For BIU g, if (|AR| = 0) or (g AB and |GR| > |SR| + |AR|), then for p,q GB: OMHS(g,v,p) = OMHS(g,v,q) Validity: If |GR| > |SR| + |AR|, then for pGB : If gGB, then OMHS(g,v,p) = v If gBB, then OMHS(g,v,p) = Error If gSB, then OMHS(g,v,p) = sent(g,v) Interactive Consistency Results Lfm2000

  26. Alternative Verification Options • For a fixed number of participants, it is easy to demonstrate correctness of Interactive Consistency protocol using symbolic simulation • Amount of effort needed to verify using theorem prover is negligible • PVS proof is mostly symbolic evaluation • there is a small amount of deductive reasoning to evaluate the abstract specification of hybrid majority Lfm2000

  27. To match the degree of Fault-Tolerance of the IC protocol, the synchronization protocol should ensure synchronization using the following fault assumptions: |GR| > |SR| + |AR| Not (|AR| > 0 & |AB| > 0) Clock Synchronization Goal Lfm2000

  28. Need added assumption: |GB| > |SB| + |AB| With this assumption, a modified form of the Davies and Wakerly protocol (1978, IEEE ToC) ensures synchronization of the RMUs Modified protocol is similar to the Srikanth and Toueg protocol (1987, JACM) Clock Synchronization Lfm2000

  29. Clock Synchronization Basics • Clocks (counters) driven by oscillator with a bounded drift () from its stated frequency • Periodically (every P ticks) clocks will adjust the count based on exchange with other clocks • The periods are indexed by round number (k) • Protocol seeks to ensure that at beginning of every round, all good clocks are within dmin Lfm2000

  30. Synchronization Basics • If all good clocks start round within dmin of each other, by the end of the round they can be at most dmin + 2 P apart • If good clocks then make a small adjustment so that they start the next round within dmin, then both precision and accuracy are satisfied • Several machine checked proofs exist Lfm2000

  31. Network Imprecision • There is imprecision in communication • If a node transmits a message at time t, observing nodes will receive it within time interval [t + d, t + d + e] • d is the minimum communication delay • e is the inherent imprecision (e > (1 + ) ticks) • e is a lower bound on synchronization precision • For many Byzantine Resilient protocols, dmin2e Lfm2000

  32. RMU: (Perform the following concurrently) If Ready?(k) then broadcast (round k) to all BIUs If Accept?(k) then reset counter for round k BIU: If Accept?(k) then broadcast (round k) to all RMUs Simple Protocol (for round k) Lfm2000

  33. Informal Description • Each good RMU broadcasts when its clock reaches a specific value • Each good BIU waits until it knows it has received a message from at least one good RMU. It then relays this information to all RMUs • Each RMU waits until it knows it has a message from at least one good BIU before resetting Lfm2000

  34. Ready?(k) is an event triggered by a pre-determined local counter value, kP - a, a is a constant offset for communication delays P is the nominal duration of a round k is the round index Degree of fault-tolerance is determined by Accept?(k) Ready and Accept Lfm2000

  35. Wait until there is a hybrid majority of observed (round k) events to trigger Accept?(k) A selection function under the hybrid fault model ignores manifestly bad values This protocol ensures that all good RMUs accept (round k) within a short time interval, provided the maximum fault assumptions are not violated Can also synchronize BIUs by echoing RMU accept Accept?(k) Lfm2000

  36. Verification in PVS • Built generic clock synchronization theory in PVS • PVS theory from Ulm introduced too much potential error in formulation of clock drift assumptions • New theory allows proofs of clock skew as tight as best theoretical results • Support for some protocols absent • All pieces in place to complete SPIDER verification • Estimate 1-2 weeks effort to tie up loose ends Lfm2000

  37. Alternative Verification Options • Protocol is (almost) finite state • Should be possible to use a model checker to confirm that all good nodes start each round within dmin of each other • Plausible tools for this are HyTech, UPPAAL • Still need theorem prover to get Precision and Accuracy results • Theorem prover can verify for arbitrary number of participants Lfm2000

  38. Diagnosis • Plan to adapt MAFT on-line diagnosis algorithms to SPIDER architecture • MAFT algorithms previously verified using PVS • Chris Walter, Patrick Lincoln, and Neeraj Suri. Formally verified on-line diagnosis, IEEE Trans. On Software Engineering, Nov. 1997 • For diagnosis of Processing Elements, there exist verified Group Membership protocols • Katz, Lincoln, and Rushby, Low overhead time-triggered group membership, In 11th Workshop on Distributed Algorithms (WDAG’97), pages 155-169, LNCS 1320 Lfm2000

  39. Summary • New conceptual design for a family of fault-tolerant systems • Design being developed using DO-254 guidance • Critical fault-tolerance verified using PVS • Able to reuse or adapt many existing proofs of fault-tolerance protocols • Unable to reuse existing Clock synchronization proofs Lfm2000

  40. Future Plans • Report documenting SPIDER Conceptual design and proofs of fault-tolerance by end of summer • First laboratory prototype implementation of SPIDER by December • Second generation design in 2001 Lfm2000

More Related