1 / 25

Unreliable Failure Detectors for Reliable Distributed Systems

Unreliable Failure Detectors for Reliable Distributed Systems. T Chandra, Sam Toueg, 1996. Why Fault Tolerance. Fault Tolerance in Asynchronous Distributed Systems. Problem: Impossible! (FLP, 1985) Solution: Randomization Failure Detectors. Unreliable Failure Detectors.

kenyon
Download Presentation

Unreliable Failure Detectors for Reliable Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unreliable Failure Detectors for Reliable Distributed Systems T Chandra, Sam Toueg, 1996

  2. Why Fault Tolerance

  3. Fault Tolerance in Asynchronous Distributed Systems • Problem: • Impossible! (FLP, 1985) • Solution: • Randomization • Failure Detectors

  4. Unreliable Failure Detectors • A distributed oracle that provides hints about the operational status of processes • Hints may be incorrect • Different hints to different processors • May change its mind over time • Mistakes should not make the system behave incorrectly!

  5. Example • Timeout based • Every proc sends an “I am alive” msg periodically • If not received within a time, suspect the proc • If received, remove from suspected

  6. Hints may be incorrect

  7. Different hints to different processors

  8. May change its mind over time

  9. Characterizing Failure Detectors • Completeness • Suspect every process that actually crashes • Accuracy • Limit the number of correct processes that are suspected

  10. Completeness Vs Accuracy

  11. The Model • Crash failures only with no recovery • Failure Detector works in Query/Response manner • Query the FD • Response = Currently suspected procs • FD properties

  12. Completeness • Strong Completeness • Eventually, every crashed process is permanently suspected by every correct process • Weak Completeness • Eventually, every crashed process is permanently suspected by some correct process

  13. Strong Completeness

  14. Weak Completeness

  15. Accuracy • Strong Accuracy • A process is never suspected before it crashes • Weak Accuracy • Some correct process never suspected Perpetual Accuracy!

  16. Eventual Accuracy • Eventual Strong Accuracy • After a time, correct processes do not suspect correct processes • Eventual Weak Accuracy • After a time, some correct process is not suspected by any correct process

  17. Failure Detector Classes

  18. The Consensus Problem • Termination: • Every process eventually decides • Uniform Integrity: • Every process decides at most once • Agreement: • No two correct procs decide differently • Uniform Validity: • If a proc decides v, then v was proposed by some proc

  19. Solving Consensus using FDs • An algorithm to solve consensus using S • S satisfies strong completeness and weak accuracy • Tolerates upto n-1 failures in n proc system

  20. The Algorithm • At every proc p: Procedure propose(vp) Vp← (┴, ┴, ┴…, ┴) Vp[p] ← vp ∆p← Vp

  21. Phase 1(asynch rounds rp,1≤ rp ≤n-1 For rp ← 1 to n-1 send(rp, ∆p, p) to all wait until [All q:rcvd (rp, ∆q, q) or q ϵ ₯ ] msgs[rp] ←{(rp, ∆q, q) | rcvd (rp, ∆q, q)} ∆p ← (┴, ┴, ┴…, ┴) for k ← 1 to n if Vp[k] = ┴ and (rp, ∆q, q) ϵ msgs[rp] with ∆q [k]≠┴ Vp[k] ← ∆q [k] ∆p [k] ← ∆q [k]

  22. Phase 2 Send Vp to all wait until [All q: rcvd Vp or q ϵ ₯ ] lastmsgs {Vq | received Vq} for k ← 1 to n if Vqϵ lastmsgs with Vq[k] = ┴ then Vp[k] ← ┴

  23. Phase 3 Decide (first non- ┴ component ofVp)

  24. Consensus solved using S • Every correct process reaches Phase 3 • Vp of every proc has at least one non- ┴ component • Every correct proc decides on some non- ┴ value in Phase 3 (termination) • This non- ┴ value is proposed by some proc (unif validity) • No process decides more than once (unif integrity) • All procs in Phase 3 have the same vector to decide from (unif agreement)

  25. Failure Detectors • Can be used to bridge gap between known impossibility results and need for practical solutions for fault tolerant asynchronous distributed systems

More Related