Fault Tolerance

Fault Tolerance Part I Introduction Part II Process Resilience Part III Reliable Communication Part IV Recovery Part V Distributed Commit Chapter 7

Fault Tolerance Chapter 7 Part I Introduction

Fault Tolerance • A DS should be fault-tolerant • Should be able to continue functioning in the presence of faults • Fault tolerance is related to dependability

Dependability • Dependability Includes • Availability • Reliability • Safety • Maintainability

Availability & Reliability (1) • Availability: A measurement of whether a system is ready to be used immediately • System is available at any given moment • Reliability: A measurement of whether a system can run continuously without failure • System continues to function for a long period of time

Availability & Reliability (2) • A system goes down 1ms/hr has an availability of more than 99.99%, but is unreliable • A system that never crashes but is shut down for a week once every year is 100% reliable but only 96% available

Safety & Maintainability • Safety: A measurement of how safe failures are • System fails, nothing serious happens • Maintainability: A measurement of how easy it is to repair a system • System should be able to fix itself

Faults • A system fails when it cannot meet its promises (specifications) • An error is part of a system state that may lead to a failure • A fault is the cause of the error • Fault-Tolerance: the system can provide services even in the presence of faults • Faults can be: • Transient (appear once and disappear) • Intermittent (appear-disappear behavior) • Permanent (appear and persist until repaired)

Failure Models • Different types of failures.

Failure Masking by Redundancy • Redundancy is key technique for hiding failures • Redundancy types: • Information: add extra (control) information • Error-correction codes in messages • Time: perform an action persistently until it succeeds: • Transactions • Physical: add extra components (S/W & H/W) • Electronic circuits

Example – Redundancy in Circuits • Triple modular redundancy.

Fault Tolerance Chapter 7 Part II Process Resilience

Process Resilience • Mask process failures by replication • Organize process into groups • A message sent to a group is delivered to all members • If a member fails, another should fill in • Groups could be created and deleted dynamically • Processes join/leave several groups

Flat Groups versus Hierarchical Groups • Communication in a flat group. • Communication in a simple hierarchical group

Group Management • Need implementation for: creating/removing groups and joining/leaving groups • Solution 1: group server • Centralized solution, single point of failure • Simple • Solution 2: distributed • No single point of failure • Create/remove group ??? • Synchronization

Process Replication • Replicate a process and group replicas in one group • How many replicas do we create? • A system is k fault-tolerant if it can survive and function even if it has k faulty processes • k+1 replicas for crash failures • 2k+1 replicas for Byzantine failures

Agreement • Need agreement in DS: • Leader, commit, synchronize • Distributed Agreement algorithm: all non-faulty processes achieve consensus in a finite number of steps • Perfect processes, faulty channels: two-army • Faulty processes, perfect channels: Byzantine generals

Two-Army Problem

Impossible Consensus • Agreement is impossible in asynchronous DS, even if only one process fails [Fischer et al.] • Asynchronous DS: cannot distinguish a slow process from a crashed one

Possible Consensus • Agreement is possible in synchronous DS [e.g., Lamport et al.] • Byzantine Generals Problem • Asynchronous DS: can distinguish a slow process from a crashed one

Byzantine Generals Problem    

Byzantine Generals -Example (1) • The Byzantine generals problem for 3 loyal generals and1 traitor. • The generals announce their troop strengths (in units of 1 kilosoldiers). • The vectors that each general assembles based on (a) • The vectors that each general receives in step 3.

Byzantine Generals –Example (2) • The same as in previous slide, except now with 2 loyal generals and one traitor.

Byzantine Generals • Given three processes, if one fails, consensus is impossible • Given N processes, if F processes fail, consensus is impossible if N  3F

Fault Tolerance Chapter 7 Part III Reliable Communication

Reliable Client/Server Communication

Reliable Client/Server Communication • Channels may exhibit crash, omission, timing, and arbitrary failures • Point-to-Point communication: use TCP channels • TCP masks omission failures • TCP does not mask crash failures • The DS system itself may mask it

Reliable RPC/RMI • Client unable to locate server • Lost request from client to server • Server crashes after receiving client request • Server reply to client is lost • Client crashes after sending request

1. Client unable to locate server • Server is down • Client has outdated proxy • Solution: raise exception • Not all languages have exception handling • Location/Failure transparency

2. Lost request from client to server • Lost message • Solution: timeouts • OS or proxy start a timer • If timer expires before before reply or ack, resend • Server must detect duplicate messages • If too many requests are lost, client might conclude that server is down (back to 1.)

3. Server crashes after receiving request – Problem • A server in client-server communication • Normal case • Crash after execution (should raise exception) • Crash before execution (should re-transmit request) • Client cannot tell what occurred b or c

3. Server crashes after receiving request –Solutions • At lest once semantics: keep on sending request until the RPC/RMI is done at least once • At most once semantics: do it once or none • No guarantees: Client on its own • Exactly once semantics: ideal, but impossible • Example: • M: completion message • P: Print text • C: crash

3. Server crashes after receiving request – Example • Different combinations of client and server strategies in the presence of server crashes.

4. Server reply to client is lost • Lost message • Slow server • Solution: timeouts • Works with lost messages • Works with slow servers if operation is idempotent • With statefull servers, let server detect duplicate requests

5. Client Crashes after Sending Request • Orphan: a computation with a dead parent • CPU cycles, locks, etc … • Extermination: proxy logs request before sending it in safe storage. When client reboots, orphans are killed • Reincarnation: client divides time into numbered epochs. When client reboots, it announces a new epoch. Orphans are killed • Gentle reincarnation: Orphans are killed if parent cannot be found • Expiration: RPCs/RMIs need to renew leases • Neither is desirable in practice

Reliable Group Communication

Reliable Group Communication • For simplicity, assume a group is static and processes do not fail • Reliable communication = deliver the message to all group members • Any order delivery • Ordered delivery

Basic Reliable-Multicasting Schemes • A simple solution to reliable multicasting when all receivers are known and are assumed not to fail • Message transmission • Reporting feedback

Scalability Issues • Too many ACK messages => performance problems • Solution 1: only send negative acks (NACKS) • How long will a sender buffer a message before discarding it? • Solution 2: Feedback Suppression • Scalable Reliable Multicasting (SRM) • Solution 3: Hierarchical Feedback Control

Nonhierarchical Feedback Control • Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.

Hierarchical Feedback Control • The essence of hierarchical reliable multicasting. • Each local coordinator forwards the message to its children. • A local coordinator handles retransmission requests.

Atomic Multicast • Need to achieve reliable communication in the presence of process failures • Transaction-like delivery: deliver m to all processes in a group or to none of them • Called atomic multicast • Example: assignment 3 • Update must be done at all replicas • What if a replica crashes? • Keep recovery log events locally until replica comes back • Do not perform update (Atomic multicast)

Virtual SynchronyDS Logical Organization • The logical organization of a distributed system to distinguish between message receipt and message delivery

Virtual Synchrony (2) • The principle of virtual synchronous multicast.

Message Ordering (1) • Three communicating processes in the same group. The ordering of events per process is shown along the vertical axis.

Message Ordering (2) • Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting

Implementing Virtual Synchrony (1) • Six different versions of virtually synchronous reliable multicasting.

Implementing Virtual Synchrony (2) • Process 4 notices that process 7 has crashed, sends a view change • Process 6 sends out all its unstable messages, followed by a flush message • Process 6 installs the new view when it has received a flush message from everyone else

Fault Tolerance Chapter 7 Part IV Recovery

Recovery Stable Storage • Stable Storage • Crash after drive 1 is updated • Bad spot

Fault Tolerance

Fault Tolerance

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance