400 likes | 644 Views
Fault-tolerance in Component-based Systems. Anish Arora The Ohio State University (on leave at Microsoft Research) March, 2000. Fault-tolerance - the what and the why. Fault-tolerance properties include : safety : hazards do not affect critical operation
E N D
Fault-tolerance in Component-based Systems Anish Arora The Ohio State University (on leave at Microsoft Research) March, 2000
Fault-tolerance - the what and the why Fault-tolerance properties include : • safety: hazards do not affect critical operation • liveness:failures do not affect availability • security: intruders do not compromise secrecy • timeliness: loads do not violate timing constraints • self-tuning: reconfigurations do not nullify functionality These properties are increasingly relevant : • high profile outages: brokerages, portals, telcos, in space • unreliable commodity devices being integrated • new media: powerline, wireless, cable, fiber optics, lasers • new opportunities for intruders, new sorts of faults
1st Message: Fault-tolerance Components • Our thesis: Fault-tolerant system = Fault-intolerant system composed with Tolerance components • Separation principle separate fault-tolerance from functionality >fault-tolerance design is not an afterthought • Structure principle few tolerance component frameworks should suffice
2nd Message: Scalable Fault-tolerance • A special case: Fault-tolerant component = Fault-intolerant component composed with Tolerance components • Principle of scale: making key system components fault-tolerant is often sufficient for or simplifiessystem fault-tolerance exploit end-to-end assumptions, client-component contracts, and component specification
Outline of Talk • Concepts • systems , specifications , components • faults • intolerance , tolerance • Fault-tolerance components • theory: detectors, correctors • practice: Sun’s Netra server • Microsoft’s home automation system • Scalable fault-tolerance • method: graybox design • application: failure-detector-based consensus
Systems • Computations of a system[Alpern, Schneider 85] = safety computations liveness computations • Safety is a set of sequences in which no sequence “does anything bad” • Liveness is a set of sequences that contains for each prefix an extension which “does something good” state event computation
Specifications • A set of “desirable” sequences, like systems • can be decomposed into safety and liveness parts • Let C be a system , B a specification C imp B iff computations(C) computations(B) • Note: definition of imp is readily extended to allow internal state in C and B ( on projections of computations on external state)
Components + Interface : set of methods • Characteristics : • unit of composition • usually have no state, but their instantiations do • specification may be visible • implementation may/may not be visible • Examples : • ADTs • COM+, CORBA, JavaBeans • Seuss [Misra 94] • RESOLVE [Weide, Ogden 94]
Faults • Classes: message: loss, corruption, replay, preplay, forgery process: hangs, crash, fail-stops, Byzantine failure sensor: stuck-at, intermittent failure memory: transient corruption channel: eavesdropping, fail-stops • Computations of a fault-class F are sequences too! • Let C F be computations of system C in presence of F • not : (C imp B) (C F) imp B • nor : (C F) imp (B F)
Fault-tolerance • In the presence of a fault-class, a fault-tolerant system must satisfy a tolerant specification • Tolerant specifications are potentially weaker than the original specifications • Types of tolerant specifications • masking = original specification • fail-safe = safety part only • stabilizing = liveness part “eventual safety” part
Fault-tolerance Components • The separation : fault-tolerant system C' = fault-intolerant system C composed with tolerance components • “Minimal” tolerance components • used to achieve tolerance, not to resatisfy the specification • reuse criterion • in the absence of faults, C' behaves as C does • in the presence of faults, if C' recovers it behaves as C does
Fault-tolerance Components … contd. The structure principle: • The reuse criterion implies a few tolerance component frameworks suffice for different sorts of fault-tolerance • enables structured design and automatic synthesis of fault-tolerance • fault-tolerance DLLs, dynamic binding, and dynamic instantiation enabled for composing tolerance components
Theory of Tolerance Components [Arora, Kulkarni 97a,b] Theorem: For fail-safe implementation, Detectors are necessary & sufficient in the class of reuse designs Theorem: For stabilizing implementation, Correctors are necessary & sufficient in the class of reuse designs Theorem: For masking implementation, Detectors and correctors are necessary & sufficient in the class of reuse designs
assume that safety not violated here check whether safety violated here Why Detectors for Fail-safe Tolerance • Before a method is executed, detect whether extended prefix would violate safety; can detect using only last state of prefix • system methods a state predicates.t. execution of the method in a state where that predicate is true satisfies safety • detect whether execution of method in givenstate is safe • Preservation of safety
d d w w d d d d d w Detectors Specification (detection state predicate , witness state predicate) “Large” detectors in distributed systems are built out of “parallel” or “sequential” composition of “smaller” ones Traditional examples : error detection codes, acceptance tests, comparators, snapshot procedures, exception conditions safety liveness
Why Correctors for Stabilizing Tolerance • Restore system to a state from where its safety and livenessare both satisfied • Ensure that eventually safety and liveness are satisfied states reached in presence of faults states from where safety and liveness are satisfied
d d w w c c c c c c w Correctors Specification (correction state predicate , witness state predicate) “Large” correctors in distributed systems are built out of “parallel” or “sequential” composition of “smaller” ones Traditional examples : error correction codes, reset procedures, voters,rollback recovery, constraint satisfaction safety liveness
Detectors&Correctors for Masking Tolerance Ensure safety in the presence of faults • detectors are necessary and sufficient Ensure eventual safety and liveness • correctors are necessary and suffice Detectors and correctors can be added in stepwise fashion: masking system + correctors + detectors stabilizing system fail-safe system + correctors + detectors intolerant system
Self-tolerance of Tolerance Components Detectors for fail-safe systems must be (at least) fail-safe Correctors for stabilizing systems must be (at least) stabilizing Detectors & correctors for masking systems must be masking
SUN’s Stabilizing Internet Server • Problem:Scalable implementation of Web, News, and Email on workstation cluster given commodity services, kernels, h/w • Methods of intolerant system: • client invokes name service for IP address of server • client invokes service from named server, server responds • Specification: Always • each service request is mapped to an IP address • the mapped IP address corresponds to a correct server • eventually the request is correctly completed • load on each workstation is bounded from above
Desired Tolerance and Its Design • Fault-classes • server / node failure / repair • communication failures • timing failures • load bursts • server upgrades • Tolerance: Stabilizing tolerant implementation • but high availability; i.e., fast convergence • Design: Add correctors to intolerant system • correction predicate: interface invariant of intolerant system • all invoked IP addresses uniquely identify an up, active server • each up server is within its load bounds
Design … contd. • Negation of correction predicate yields “bad” predicates • IP address identifies a down server • IP address identifies an inactive server • IP address identifies an overloaded server • there is no IP address for a given service • multiple servers have same IP address • the load of the server exceeds its bound • Instantiate a corrector for each “bad” predicate: • if IP address identifies a down server, reassign its service address, inform its group, and restart service • if duplicate IP addresses occur, retain only one • if server overloaded, reassign its service address, and remove it from service group… • Compose system and correctors: correctors are independent
SUN Netra Servers • Design corrector framework : [Arora, Poduska 95] • timing-based information exchange between servers • stabilizing • tunable for performance • Singhai prototyped our ideas at SUN, they were tested in-house for 18 months,and sold commercially [Singhai 98] • Reported 10 second failover latency, i.e., better than 99.999% availability if machines fail for 2 hrs/month
MSR’s Aladdin Home Automation System • Problem: remote control of heterogeneous device via in-home PC cluster and heterogeneous networks • System model : • peer to peer communication • peers may join and leave spontaneously • peers may have hidden state: nonpollable • all components refresh information periodically • subscribers “discover / lookup” providers • Application methods : • client sends email home to control devices (“shut garage”) • client receives cell-phone email if critical sensor fires (“water in crawl space”)
Vulnerabilities in Aladdin • Specification: Always • email service is working • lookup services are working • device controllers are working • PCs & peripherals & daemons are working • Fault-classes: • sensors / devices : battery / interference / babbling • PCs : hang / crash / ”blue screen” • in-home networks: loss / interference / intruders • power supply: failure
Tolerances in Aladdin • Tolerances desired: • masking for common-case faults (bounds on data quality) • stabilizing for rare-case faults • safety (for critical devices) • security (for critical devices and data) • Detector/corrector design is similar, but key differences: • tolerance components are larger • significant interference between tolerance components • extensibility of tolerance components is essential • tolerance framework supports state and event predicates
Soft-State Store (SSS) Framework • SSS maintains state for each information item • items are sensor status, lookup data, witness predicates • items are stored only for a given interval, unless refreshed • storage is volatile or persistent, based on refresh frequency • allows both eventing and polling of information • is replicated: despite loose consistency, is masking • Advantages: • automatically deals with device leaving • efficient use of storage • transparent failover of lookup services • avoids instability due to outdated state information • enables dealing with non-pollable devices
Aladdin Status • Deployed in Yi-Min Wang’s house [Wang, Arora, et al 00] • 50K loc (significant part is natural language interface) • It works! • Yi-Min is sleeping better • when his wife returns home, he receives a cell-phone notification that the garage door opened • in-house scenarios are growing • to be featured in PBS & Discovery specials this fall • Fault-tolerance overhead is negligible
Relation to Extant Methods • Detector/corrector-based design encompassesextant • blackbox methods (hidden implementation & specification) • whitebox methods (visible implementation & specification) • graybox methods (visible specification, not implementation) • Has enabled new efficient, structured protocols • distributed reset • barrier/resource synchronization • ATM congestion control … http://www.cis.ohio-state.edu/~anish
Scalable Fault-tolerance • Scalability of synchronous black-box replication is limited [Birman 98] • Exploit loose consistency where possible • Exploit component specification • Exploit end-to-end assumptions, client-component contracts • “Graybox” design method: for when components implementations are not available [AKD 99]
Graybox design method client client-component interface component
Application: Consensus • Problem: • N nodes, 0 .. N-1, communicating asynchronously • each node decides eventually • no two nodes decide different values • if a node decides value v, some node had chosen v • Fault-class:nodes may crash • Tolerance:masking, for correct nodes • Solution is impossible, if crashes are undetectable[FLP85]
Consensus … contd. • Cottage industry is growing around these solutions, for use in group-based systems • Possible, if S failure detector exists[CHT95] • eventually, all crashed nodes are permanently suspectedby all correct nodes • eventually, some correct node is never suspected by allcorrect nodes
Fault-intolerant System [Arora, Schiper] Client Coordinator, c: choose value ; broadcast value Participant, p : deliver decision + Corrector at p If detects c crash (using S), go to next round; if round number mod N is node id, node becomes new c Client-component interface If any node has delivered decision, subsequent choices are the same Choose Broadcast +Detector2 Safe choice +Detector1 Reliable
Design Space • Corrector: • Optimistic (trust failure detector suspicions) • Conservative (collect confirmation from some nodes) • Pessimistic (collect confirmations from all nodes) • Detector: • Detector1 and Detector2 have independent centralized & distributed implementations • Shared memory and Message passing implementations exist for all components At least 24 solutions designed, of which 21 are new Same design experiences for other failure detector models!
Summary The component-based approach to fault-tolerance : • Decomposes system requirements • separates fault-tolerance from functionality • reuses a small number of tolerance frameworks • Characterizes fault-classes • for each tolerance type, adds relevant tolerance components • Is general and effective : • completeness results • examples of improved designs • new method for graybox designs
Future Directions Graybox fault-tolerance design • primitives for self-repairing Network O.S. • for implementations conforming to standards • via DLLs