620 likes | 839 Views
Fault Tolerance. Part I Introduction Part II Process Resilience Part III Reliable Communication Part IV Recovery Part V Distributed Commit. Chapter 7. Fault Tolerance. Chapter 7. Part I Introduction. Fault Tolerance. A DS should be fault-tolerant
E N D
Fault Tolerance Part I Introduction Part II Process Resilience Part III Reliable Communication Part IV Recovery Part V Distributed Commit Chapter 7
Fault Tolerance Chapter 7 Part I Introduction
Fault Tolerance • A DS should be fault-tolerant • Should be able to continue functioning in the presence of faults • Fault tolerance is related to dependability
Dependability • Dependability Includes • Availability • Reliability • Safety • Maintainability
Availability & Reliability (1) • Availability: A measurement of whether a system is ready to be used immediately • System is available at any given moment • Reliability: A measurement of whether a system can run continuously without failure • System continues to function for a long period of time
Availability & Reliability (2) • A system goes down 1ms/hr has an availability of more than 99.99%, but is unreliable • A system that never crashes but is shut down for a week once every year is 100% reliable but only 96% available
Safety & Maintainability • Safety: A measurement of how safe failures are • System fails, nothing serious happens • Maintainability: A measurement of how easy it is to repair a system • System should be able to fix itself
Faults • A system fails when it cannot meet its promises (specifications) • An error is part of a system state that may lead to a failure • A fault is the cause of the error • Fault-Tolerance: the system can provide services even in the presence of faults • Faults can be: • Transient (appear once and disappear) • Intermittent (appear-disappear behavior) • Permanent (appear and persist until repaired)
Failure Models • Different types of failures.
Failure Masking by Redundancy • Redundancy is key technique for hiding failures • Redundancy types: • Information: add extra (control) information • Error-correction codes in messages • Time: perform an action persistently until it succeeds: • Transactions • Physical: add extra components (S/W & H/W) • Electronic circuits
Example – Redundancy in Circuits • Triple modular redundancy.
Fault Tolerance Chapter 7 Part II Process Resilience
Process Resilience • Mask process failures by replication • Organize process into groups • A message sent to a group is delivered to all members • If a member fails, another should fill in • Groups could be created and deleted dynamically • Processes join/leave several groups
Flat Groups versus Hierarchical Groups • Communication in a flat group. • Communication in a simple hierarchical group
Group Management • Need implementation for: creating/removing groups and joining/leaving groups • Solution 1: group server • Centralized solution, single point of failure • Simple • Solution 2: distributed • No single point of failure • Create/remove group ??? • Synchronization
Process Replication • Replicate a process and group replicas in one group • How many replicas do we create? • A system is k fault-tolerant if it can survive and function even if it has k faulty processes • k+1 replicas for crash failures • 2k+1 replicas for Byzantine failures
Agreement • Need agreement in DS: • Leader, commit, synchronize • Distributed Agreement algorithm: all non-faulty processes achieve consensus in a finite number of steps • Perfect processes, faulty channels: two-army • Faulty processes, perfect channels: Byzantine generals
Impossible Consensus • Agreement is impossible in asynchronous DS, even if only one process fails [Fischer et al.] • Asynchronous DS: cannot distinguish a slow process from a crashed one
Possible Consensus • Agreement is possible in synchronous DS [e.g., Lamport et al.] • Byzantine Generals Problem • Asynchronous DS: can distinguish a slow process from a crashed one
Byzantine Generals Problem
Byzantine Generals -Example (1) • The Byzantine generals problem for 3 loyal generals and1 traitor. • The generals announce their troop strengths (in units of 1 kilosoldiers). • The vectors that each general assembles based on (a) • The vectors that each general receives in step 3.
Byzantine Generals –Example (2) • The same as in previous slide, except now with 2 loyal generals and one traitor.
Byzantine Generals • Given three processes, if one fails, consensus is impossible • Given N processes, if F processes fail, consensus is impossible if N 3F
Fault Tolerance Chapter 7 Part III Reliable Communication
Reliable Client/Server Communication • Channels may exhibit crash, omission, timing, and arbitrary failures • Point-to-Point communication: use TCP channels • TCP masks omission failures • TCP does not mask crash failures • The DS system itself may mask it
Reliable RPC/RMI • Client unable to locate server • Lost request from client to server • Server crashes after receiving client request • Server reply to client is lost • Client crashes after sending request
1. Client unable to locate server • Server is down • Client has outdated proxy • Solution: raise exception • Not all languages have exception handling • Location/Failure transparency
2. Lost request from client to server • Lost message • Solution: timeouts • OS or proxy start a timer • If timer expires before before reply or ack, resend • Server must detect duplicate messages • If too many requests are lost, client might conclude that server is down (back to 1.)
3. Server crashes after receiving request – Problem • A server in client-server communication • Normal case • Crash after execution (should raise exception) • Crash before execution (should re-transmit request) • Client cannot tell what occurred b or c
3. Server crashes after receiving request –Solutions • At lest once semantics: keep on sending request until the RPC/RMI is done at least once • At most once semantics: do it once or none • No guarantees: Client on its own • Exactly once semantics: ideal, but impossible • Example: • M: completion message • P: Print text • C: crash
3. Server crashes after receiving request – Example • Different combinations of client and server strategies in the presence of server crashes.
4. Server reply to client is lost • Lost message • Slow server • Solution: timeouts • Works with lost messages • Works with slow servers if operation is idempotent • With statefull servers, let server detect duplicate requests
5. Client Crashes after Sending Request • Orphan: a computation with a dead parent • CPU cycles, locks, etc … • Extermination: proxy logs request before sending it in safe storage. When client reboots, orphans are killed • Reincarnation: client divides time into numbered epochs. When client reboots, it announces a new epoch. Orphans are killed • Gentle reincarnation: Orphans are killed if parent cannot be found • Expiration: RPCs/RMIs need to renew leases • Neither is desirable in practice
Reliable Group Communication • For simplicity, assume a group is static and processes do not fail • Reliable communication = deliver the message to all group members • Any order delivery • Ordered delivery
Basic Reliable-Multicasting Schemes • A simple solution to reliable multicasting when all receivers are known and are assumed not to fail • Message transmission • Reporting feedback
Scalability Issues • Too many ACK messages => performance problems • Solution 1: only send negative acks (NACKS) • How long will a sender buffer a message before discarding it? • Solution 2: Feedback Suppression • Scalable Reliable Multicasting (SRM) • Solution 3: Hierarchical Feedback Control
Nonhierarchical Feedback Control • Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.
Hierarchical Feedback Control • The essence of hierarchical reliable multicasting. • Each local coordinator forwards the message to its children. • A local coordinator handles retransmission requests.
Atomic Multicast • Need to achieve reliable communication in the presence of process failures • Transaction-like delivery: deliver m to all processes in a group or to none of them • Called atomic multicast • Example: assignment 3 • Update must be done at all replicas • What if a replica crashes? • Keep recovery log events locally until replica comes back • Do not perform update (Atomic multicast)
Virtual SynchronyDS Logical Organization • The logical organization of a distributed system to distinguish between message receipt and message delivery
Virtual Synchrony (2) • The principle of virtual synchronous multicast.
Message Ordering (1) • Three communicating processes in the same group. The ordering of events per process is shown along the vertical axis.
Message Ordering (2) • Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting
Implementing Virtual Synchrony (1) • Six different versions of virtually synchronous reliable multicasting.
Implementing Virtual Synchrony (2) • Process 4 notices that process 7 has crashed, sends a view change • Process 6 sends out all its unstable messages, followed by a flush message • Process 6 installs the new view when it has received a flush message from everyone else
Fault Tolerance Chapter 7 Part IV Recovery
Recovery Stable Storage • Stable Storage • Crash after drive 1 is updated • Bad spot