1 / 60

Fault Tolerance

Fault Tolerance. Part I Introduction Part II Process Resilience Part III Reliable Communication Part IV Recovery Part V Distributed Commit. Chapter 7. Fault Tolerance. Chapter 7. Part I Introduction. Fault Tolerance. A DS should be fault-tolerant

Download Presentation

Fault Tolerance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault Tolerance Part I Introduction Part II Process Resilience Part III Reliable Communication Part IV Recovery Part V Distributed Commit Chapter 7

  2. Fault Tolerance Chapter 7 Part I Introduction

  3. Fault Tolerance • A DS should be fault-tolerant • Should be able to continue functioning in the presence of faults • Fault tolerance is related to dependability

  4. Dependability • Dependability Includes • Availability • Reliability • Safety • Maintainability

  5. Availability & Reliability (1) • Availability: A measurement of whether a system is ready to be used immediately • System is available at any given moment • Reliability: A measurement of whether a system can run continuously without failure • System continues to function for a long period of time

  6. Availability & Reliability (2) • A system goes down 1ms/hr has an availability of more than 99.99%, but is unreliable • A system that never crashes but is shut down for a week once every year is 100% reliable but only 96% available

  7. Safety & Maintainability • Safety: A measurement of how safe failures are • System fails, nothing serious happens • Maintainability: A measurement of how easy it is to repair a system • System should be able to fix itself

  8. Faults • A system fails when it cannot meet its promises (specifications) • An error is part of a system state that may lead to a failure • A fault is the cause of the error • Fault-Tolerance: the system can provide services even in the presence of faults • Faults can be: • Transient (appear once and disappear) • Intermittent (appear-disappear behavior) • Permanent (appear and persist until repaired)

  9. Failure Models • Different types of failures.

  10. Failure Masking by Redundancy • Redundancy is key technique for hiding failures • Redundancy types: • Information: add extra (control) information • Error-correction codes in messages • Time: perform an action persistently until it succeeds: • Transactions • Physical: add extra components (S/W & H/W) • Electronic circuits

  11. Example – Redundancy in Circuits • Triple modular redundancy.

  12. Fault Tolerance Chapter 7 Part II Process Resilience

  13. Process Resilience • Mask process failures by replication • Organize process into groups • A message sent to a group is delivered to all members • If a member fails, another should fill in • Groups could be created and deleted dynamically • Processes join/leave several groups

  14. Flat Groups versus Hierarchical Groups • Communication in a flat group. • Communication in a simple hierarchical group

  15. Group Management • Need implementation for: creating/removing groups and joining/leaving groups • Solution 1: group server • Centralized solution, single point of failure • Simple • Solution 2: distributed • No single point of failure • Create/remove group ??? • Synchronization

  16. Process Replication • Replicate a process and group replicas in one group • How many replicas do we create? • A system is k fault-tolerant if it can survive and function even if it has k faulty processes • k+1 replicas for crash failures • 2k+1 replicas for Byzantine failures

  17. Agreement • Need agreement in DS: • Leader, commit, synchronize • Distributed Agreement algorithm: all non-faulty processes achieve consensus in a finite number of steps • Perfect processes, faulty channels: two-army • Faulty processes, perfect channels: Byzantine generals

  18. Two-Army Problem

  19. Impossible Consensus • Agreement is impossible in asynchronous DS, even if only one process fails [Fischer et al.] • Asynchronous DS: cannot distinguish a slow process from a crashed one

  20. Possible Consensus • Agreement is possible in synchronous DS [e.g., Lamport et al.] • Byzantine Generals Problem • Asynchronous DS: can distinguish a slow process from a crashed one

  21. Byzantine Generals Problem    

  22. Byzantine Generals -Example (1) • The Byzantine generals problem for 3 loyal generals and1 traitor. • The generals announce their troop strengths (in units of 1 kilosoldiers). • The vectors that each general assembles based on (a) • The vectors that each general receives in step 3.

  23. Byzantine Generals –Example (2) • The same as in previous slide, except now with 2 loyal generals and one traitor.

  24. Byzantine Generals • Given three processes, if one fails, consensus is impossible • Given N processes, if F processes fail, consensus is impossible if N  3F

  25. Fault Tolerance Chapter 7 Part III Reliable Communication

  26. Reliable Client/Server Communication

  27. Reliable Client/Server Communication • Channels may exhibit crash, omission, timing, and arbitrary failures • Point-to-Point communication: use TCP channels • TCP masks omission failures • TCP does not mask crash failures • The DS system itself may mask it

  28. Reliable RPC/RMI • Client unable to locate server • Lost request from client to server • Server crashes after receiving client request • Server reply to client is lost • Client crashes after sending request

  29. 1. Client unable to locate server • Server is down • Client has outdated proxy • Solution: raise exception • Not all languages have exception handling • Location/Failure transparency

  30. 2. Lost request from client to server • Lost message • Solution: timeouts • OS or proxy start a timer • If timer expires before before reply or ack, resend • Server must detect duplicate messages • If too many requests are lost, client might conclude that server is down (back to 1.)

  31. 3. Server crashes after receiving request – Problem • A server in client-server communication • Normal case • Crash after execution (should raise exception) • Crash before execution (should re-transmit request) • Client cannot tell what occurred b or c

  32. 3. Server crashes after receiving request –Solutions • At lest once semantics: keep on sending request until the RPC/RMI is done at least once • At most once semantics: do it once or none • No guarantees: Client on its own • Exactly once semantics: ideal, but impossible • Example: • M: completion message • P: Print text • C: crash

  33. 3. Server crashes after receiving request – Example • Different combinations of client and server strategies in the presence of server crashes.

  34. 4. Server reply to client is lost • Lost message • Slow server • Solution: timeouts • Works with lost messages • Works with slow servers if operation is idempotent • With statefull servers, let server detect duplicate requests

  35. 5. Client Crashes after Sending Request • Orphan: a computation with a dead parent • CPU cycles, locks, etc … • Extermination: proxy logs request before sending it in safe storage. When client reboots, orphans are killed • Reincarnation: client divides time into numbered epochs. When client reboots, it announces a new epoch. Orphans are killed • Gentle reincarnation: Orphans are killed if parent cannot be found • Expiration: RPCs/RMIs need to renew leases • Neither is desirable in practice

  36. Reliable Group Communication

  37. Reliable Group Communication • For simplicity, assume a group is static and processes do not fail • Reliable communication = deliver the message to all group members • Any order delivery • Ordered delivery

  38. Basic Reliable-Multicasting Schemes • A simple solution to reliable multicasting when all receivers are known and are assumed not to fail • Message transmission • Reporting feedback

  39. Scalability Issues • Too many ACK messages => performance problems • Solution 1: only send negative acks (NACKS) • How long will a sender buffer a message before discarding it? • Solution 2: Feedback Suppression • Scalable Reliable Multicasting (SRM) • Solution 3: Hierarchical Feedback Control

  40. Nonhierarchical Feedback Control • Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.

  41. Hierarchical Feedback Control • The essence of hierarchical reliable multicasting. • Each local coordinator forwards the message to its children. • A local coordinator handles retransmission requests.

  42. Atomic Multicast • Need to achieve reliable communication in the presence of process failures • Transaction-like delivery: deliver m to all processes in a group or to none of them • Called atomic multicast • Example: assignment 3 • Update must be done at all replicas • What if a replica crashes? • Keep recovery log events locally until replica comes back • Do not perform update (Atomic multicast)

  43. Virtual SynchronyDS Logical Organization • The logical organization of a distributed system to distinguish between message receipt and message delivery

  44. Virtual Synchrony (2) • The principle of virtual synchronous multicast.

  45. Message Ordering (1) • Three communicating processes in the same group. The ordering of events per process is shown along the vertical axis.

  46. Message Ordering (2) • Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting

  47. Implementing Virtual Synchrony (1) • Six different versions of virtually synchronous reliable multicasting.

  48. Implementing Virtual Synchrony (2) • Process 4 notices that process 7 has crashed, sends a view change • Process 6 sends out all its unstable messages, followed by a flush message • Process 6 installs the new view when it has received a flush message from everyone else

  49. Fault Tolerance Chapter 7 Part IV Recovery

  50. Recovery Stable Storage • Stable Storage • Crash after drive 1 is updated • Bad spot

More Related