1 / 14

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance. Introduction Process resilience Reliable communication Failure recovery Distributed commit. Dependability. Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed systems.

torie
Download Presentation

Chapter 8 Fault Tolerance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 8 Fault Tolerance • Introduction • Process resilience • Reliable communication • Failure recovery • Distributed commit

  2. Dependability • Dependabilityis the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed systems. • Requirements for dependable systems • Availability: the probability that the system is available to perform its functions at any moment • 99.999 % availability (five 9s)  5 minutes of downtime per year • Reliability: the ability of the system to run continuously without failure • Down for 1ms every hour  99.9999 % availability but highly unreliable • Down for two weeks every year  high reliability but only 96% availability • Safety: when a system temporarily fails to operate correctly, nothing catastrophic happens • Maintainability: how easily a failed system can be repaired • Security: will cover in Chapter 9

  3. Failures and Faults • Building a dependable system comes down to preventing failures • A failure of a system occurs when the system cannot meet its promises • Failures are caused by faults • A fault is an anomalous condition. There are three categories of faults: • Transient faults: occur once and never reoccur (e.g., wireless communication being interrupted by external interference) • Intermittent faults: reoccur irregularly (e.g., a loose contact on a connector) • Permanent faults: persist until the faulty component is replaced(e.g., software bugs)

  4. Types of Failures Arbitrary failures are also known as Byzantine failures

  5. Fault Tolerance • In a single-machine system, a failure is almost always total • All components are affected and entire system may be brought down (e.g., OS crash, disk failures) • Partialfailures are possible in distributed systems • When one component fails, it may affect some components, while leaving other components unaffected • Fault tolerance means that a system can provide its services even in the presence of faults • Fault tolerance requires • preventing faults and failures from affecting other components of the system • automatically recovering from partial failures

  6. Failure Masking • Failure masking is a fault tolerance technique that hides occurrence of failures from other processes • The most common approach to failure masking is redundancy • Three types of redundancy: • Information redundancy: add extra bits to allow recovery from garbled bits • Time redundancy: repeat an action if needed • Physical redundancy: add extra equipment or processes so that the system can tolerate the loss or malfunctioning of some components

  7. An Example of Physical Redundancy Triple modular redundancy: the effect of a single component failing is completely masked.

  8. Process Resilience • Protection against process failures can be achieved by organizing several identical processes into a group • Flat group: all process are equal; the processes make decisions collectively • No single point of failure, but decision making is more complicated • Hierarchical group: a single coordinator makes all decisions • Decision making is simpler, but coordinator is a single point of failure

  9. Fault Tolerance in Process Groups • Having a group of identical processes allows us to mask one or more faculty processes in that group • A group of replicated processes is said to be k fault tolerant if it can survive k faults and still meet its specifications • With crash failures, K+1 processes are sufficient to survive k faults • With Byzantine failures, processes may produce erroneous, random, or malicious results  2k+1 processes are required to survive k faults (group output is defined by voting) • Assumption: All requests arrive at all members in the group in the same order (this requires atomic multicast)  only then are we sure that all members do exactly the same thing

  10. Agreement in Faulty Systems • The goal of distributed agreement algorithms is to have all thenonfaulty processes reach consensus on some issue within a finite number of steps • Q1: Can consensus be reached with nonfaultyprocesses and unreliable communication channel? • A: Two nonfaulty processes can never reach agreement in presence of unreliable channel • Q2: Can consensus be reached with faulty (Byzantine) processes and reliable channel? • A: Depends

  11. Conditions for Consensus • Assume processes may be faulty and communication is reliable. • A system is synchronous iff the processes operate in a lock-step mode (i.e., there is a constant c≥1, such that if any process has taken c+1 steps, every other process has taken at least one step).

  12. Byzantine Agreement Problem • Byzantine agreement problem: Can N generals reach consensus about each other’s troop strengths when communication channel is perfect but some of the generals are traitors and will lie to prevent agreement? • Formally, there are N processes, each process i will provide a value vi to the others. The goal is to let each process construct a vector V of length N, such that if process i is nonfaulty, V[i]=vi. Otherwise V[i] is undefined. • Assume processes are synchronous, messages are unicast while preserving ordering, and communication delay is bounded, with kfaulty processes, agreement can be achieved if there are 2k+1 nonfaulty processes [Lamport et al., 1982].

  13. Byzantine Agreement Problem: An Example • The Byzantine agreementproblem for 3 nonfaulty processes and 1 faulty process with vi=i. Consensus is reached for the nonfaulty processes. (a) Each process sends its value to the others. (b) The vectors that each process assembles based on (a). (c) The vectors that each process receives after each process passes its vector from (b) to every other process.

  14. Byzantine Agreement Problem: Another Example • The Byzantine agreementproblem for 2 nonfaulty processes and 1 faulty process. The algorithm fails to produce agreement.

More Related