1 / 52

Failure Detectors: A Perspective

Failure Detectors: A Perspective. Sam Toueg University of Toronto. Group Membership Group Communication Atomic Broadcast Primary/Backup systems. Atomic Commitment Consensus Leader Election …. Context: Distributed Systems with Failures.

lfaircloth
Download Presentation

Failure Detectors: A Perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Failure Detectors: A Perspective Sam Toueg University of Toronto

  2. Group Membership Group Communication Atomic Broadcast Primary/Backup systems Atomic Commitment Consensus Leader Election ….. Context: Distributed Systems with Failures In such systems, applications often need to determine which processes are up (operational) and which are down (crashed) This service is provided by Failure Detector (FD) FDs are at the core of many fault-tolerant algorithms and applications FDs are found in many systems: e.g., ISIS, Ensemble, Relacs, Transis, Air Traffic Control Systems, etc.

  3. Failure Detectors An FD is a distributed oracle that provides hints about the operational status of processes. However: • Hints may be incorrect • FD may give different hints to different processes • FD may change its mind (over & over) about the operational status of a process

  4. s q p s q q t q q s r SLOW

  5. Talk Outline • Using FDs to solve consensus • Broadening the use of FDs • Putting theory into practice • Using FDs to solve consensus • Broadening the use of FDs • Putting theory into practice

  6. 5 5 7 8 Crash! 5 8 2 5 5 p Consensus q t s r

  7. Consensus A paradigm for reaching agreement despite failures • Equivalent to Atomic Broadcast • Can be used to solve Atomic Commitment • Can be used to solve Group Membership • ….

  8. Solving Consensus • In synchronous systems: Possible • In asynchronous systems: Impossible [FLP83] • even if: • at most one process may crash, and • all links are reliable

  9. Why this difference? • In synchronous systems: use timeouts to determine with certainty whether a process has crashed => Perfect failure detector • In asynchronous systems: cannot determine with certainty whether a process has crashed or not (it may be slow, or its messages are delayed) => No failure detector

  10. Failure detector S S S can be used to solve consensus [CT 91] is the weakest FD to solve consensus [CHT92] Solving Consensus with Failure Detectors Is perfect failure detection necessary for consensus? No But there is a time after which: • every process that crashes is suspected (completeness) • some process does not crash is not suspected (accuracy) Initially, it can output arbitrary information.

  11. D p then D can be transformed into If FD D can be used to solve consensus… D S S S S S S D q t D D s r

  12. S Solving Consensus using : Rotating Coordinator Algorithms 1 2 3 • Processes are numbered 1, 2, …, n S 4 • In round r , the coordinator: • - tries to impose its estimate as the consensus value • - succeeds if does not crash and it is not suspected by Work for up to f < n/2 crashes • They execute asynchronous rounds • In round r , the coordinator is • process (r mod n) + 1

  13. S for roundsr := 0, 1, 2 ... do {round r msgs are tagged with r} A Consensus algorithm using (Mostefaoui and Raynal 1999) everyprocess p sets estimate to its initial value • every p waits until (a) it receives v from c • or (b) it suspects c (according to <>S ) • if (a) then send v to all • if (b) then send ? to all • the coordinator c of round r sends its estimate v to all • every p waits until it receives a msg v or ? from n-f processes • if it received at least (n+1)/2 msgs v then decidev • if it received at least one msg v then estimate := v • if it received only ? msgs then do nothing

  14. n=7 every q changes its estimate to v p decides v Why does it work? Agreement: f=3

  15. Why does it work? • Termination: • With <>S no process blocks forever waiting for a message from a dead coordinator • With <>S eventually some correct process c is not falsely suspected. When c becomes the coordinator, every process receives its c’s estimate and decides

  16. Consensus 1 Consensus 2 Consensus 3 What Happens if the Failure Detector Misbehaves? • Consensus algorithm is: • Safe -- Always! • Live -- During “good” FD periods

  17. Failure Detector Abstraction Someadvantages: • Increases the modularity and portability of algorithms • Suggests why consensus is not so difficult in practice • Determines minimal info about failures to solve consensus • Encapsulates various models of partially synchrony

  18. Failure Detection Abstraction Initially, applicability was limited: • Model: FLP only • process crashes only • a crash is permanent (no recovery possible) • no link failures (no msg losses) • Problems solved: consensus, atomic broadcast only

  19. Talk Outline • Using FDs to solve consensus • Broadening the use of FDs • Putting theory into practice

  20. Other problems: • Atomic Commitment • Group Membership • Leader Election • k-set Agreement • Reliable Communication Broadening the Applicability of FDs Other models: • Crashes + Link failures (fair links) • Network partitioning • Crash/Recovery • Byzantine (arbitrary) failures • FDs + Randomization

  21. Talk Outline • Using FDs to solve consensus • Broadening the use of FDs • Putting theory into practice

  22. Implementations need to be message-efficient: • FDs with linear msg complexity (ring, hierarchical, gossip) Putting Theory into Practice In practice: • FD implementation needs to be message-efficient: • ==>FDs with linear msg complexity (ring, hierarchical, gossip) • ``Eventual’’ guarantees are not sufficient: ==>FDs with QoS guarantees • Failure detection should be easily available • ==>Shared FD service (with QoS guarantees)

  23. On Failure Detectors withQoS guarantees

  24. Heartbeats Probabilistic Model • pL : probability of heartbeat loss • D : heartbeat delay (random variable) Simple FD problem qmonitorsp p q Heartbeats can be lost or delayed

  25. Typical FD Behavior trust trust trust FD at q suspect suspect (permanently) suspect up Process p down

  26. QoS of Failure Detectors The QoS specification of an FD quantifies: • how fast it detects actual crashes • how well it avoids mistakes (i.e., false detections) What QoS metrics should we use?

  27. Detection Time • TD : time to detect a crash TD trust FD Permanent suspicion up Process p down

  28. Accuracy Metrics • TMR : Time between two consecutive mistakes • TM : Duration of a mistake FD TM TMR Process p up

  29. T T T T S FD up Process p Another Accuracy Metric Application (queries at random time) • PA : probability that the FD is correct at a random time

  30. A Common FD Algorithm

  31.   TO TO TO TO Process q A Common FD Algorithm Process p • Timing-out also depends on previous heartbeat FD at q

  32. Large Detection Time • TDdepends on the delay of the last heartbeat sent by p TD crash Process p Process q TO TO TO FD atq

  33. A New FD Algorithm and its QoS

  34.  Freshness points: i-1 i i+1 i+2 New FD Algorithm  hi-1 hi hi+1 hi+2 Process p Process q FD at q • At time t[i,i+1), q trusts p iff it has received heartbeathi or higher.

  35. Detection Time is Bounded TD crash Process p hi Process q   i i+1 FD at q

  36. Optimality Result Among all FD algorithms with the same heartbeat rate and detection time, this FD has the best query accuracy probability PA

  37. QoS Analysis • Given: • the system behaviorpL and Pr(D  t) • the parameters  and  of the FD algorithm Can compute the QoS of this FD algorithm: • Max detection time TD • Average time between mistakes E(TMR ) • Average duration of a mistake E (TM) • Query accuracy probability PA

  38. QoS Analysis • Given: • the system behaviorpLandPr(D  t) • the parameters  and  of the FD algorithm • Can compute the QoS of this FD algorithm:

  39. Satisfying QoS Requirements • Given a set of QoS requirements: • Computehanddto achieve these requirements

  40. Computing FD parameters to achieve the QoS • AssumepLandPr(D  x) are known • Problem to be solved:

  41. Step 1: compute • and let • Step 2: let • find the largest  maxthat satisfies • Step 3: set Configuration Procedure

  42. P(D £ x) PL U U L D M MR T , T , T h d Probabilistic Behavior of Heartbeats QoS Requirements Configurator Failure Detector

  43. Probability of heartbeat loss: pL = 0.01 • Heartbeat delay D is exponentially distributed • with average delay E(D) = 0.02 sec QoS requirements: • Detect crash within 30 sec • At most one mistake per month (average) • Mistake is corrected within 60 s (average) Algorithm parameters: • Send a heartbeat every  = 9.97 sec • Set shift to= 20.03 sec Example

  44. If System Behavior is Not Known If pL andPr(D  x) are not known: • estimate pL, E(D), V(D) using heartbeats • use E(D), V(D)instead ofPr(D  x) • in the configuration procedure

  45. PL V(D) E(D) QoS Requirements U U L D M MR T , T , T h d Estimator of the Probabilistic Behavior of Heartbeats Configurator Failure Detector

  46. Probability of heartbeat loss: pL = 0.01 • Distribution of heartbeat delay D is not known, • but E(D) = V(D) = 0.02 sec are known QoS requirements: • Detect crash within 30 sec • At most one mistake per month (average) • Mistake is corrected within 60 s (average) Algorithm parameters: • Send a heartbeat every  = 9.71 sec • Set shift to= 20.29 sec Example

  47. A Failure Detector ServicewithQoS guarantees

  48. Approaches to Failure Detection • currently: • each application implements its own FD • no systematic way of setting timeouts and sending rates • we propose FD as shared service: • continuously running on every host • can detect process and host crashes • provides failure information to all applications

  49. Advantages of Shared FD Service • sharing: • applications can concurrently use the same FD service • merging FD messages can decreases network traffic • modularity: • well-defined API • different FD implementations may be used in different environments • reduced implementation effort • programming fault-tolerant applications becomes easier

  50. Advantages of Shared FD Service with QoS • QoS guarantees: • applications can specify desired QoS • applications do not need to set operational FD parameters (e.g. timeouts and sending rates) • adaptivity: • adapts to changing network conditions (message delays and losses) • adapts to changing QoS requirements

More Related