290 likes | 535 Views
Unreliable Failure Detectors for Reliable Distributed Systems. T Chandra, Sam Toueg, 1996. Why Fault Tolerance. Fault Tolerance in Asynchronous Distributed Systems. Problem: Impossible! (FLP, 1985) Solution: Randomization Failure Detectors. Unreliable Failure Detectors.
E N D
Unreliable Failure Detectors for Reliable Distributed Systems T Chandra, Sam Toueg, 1996
Fault Tolerance in Asynchronous Distributed Systems • Problem: • Impossible! (FLP, 1985) • Solution: • Randomization • Failure Detectors
Unreliable Failure Detectors • A distributed oracle that provides hints about the operational status of processes • Hints may be incorrect • Different hints to different processors • May change its mind over time • Mistakes should not make the system behave incorrectly!
Example • Timeout based • Every proc sends an “I am alive” msg periodically • If not received within a time, suspect the proc • If received, remove from suspected
Characterizing Failure Detectors • Completeness • Suspect every process that actually crashes • Accuracy • Limit the number of correct processes that are suspected
The Model • Crash failures only with no recovery • Failure Detector works in Query/Response manner • Query the FD • Response = Currently suspected procs • FD properties
Completeness • Strong Completeness • Eventually, every crashed process is permanently suspected by every correct process • Weak Completeness • Eventually, every crashed process is permanently suspected by some correct process
Accuracy • Strong Accuracy • A process is never suspected before it crashes • Weak Accuracy • Some correct process never suspected Perpetual Accuracy!
Eventual Accuracy • Eventual Strong Accuracy • After a time, correct processes do not suspect correct processes • Eventual Weak Accuracy • After a time, some correct process is not suspected by any correct process
The Consensus Problem • Termination: • Every process eventually decides • Uniform Integrity: • Every process decides at most once • Agreement: • No two correct procs decide differently • Uniform Validity: • If a proc decides v, then v was proposed by some proc
Solving Consensus using FDs • An algorithm to solve consensus using S • S satisfies strong completeness and weak accuracy • Tolerates upto n-1 failures in n proc system
The Algorithm • At every proc p: Procedure propose(vp) Vp← (┴, ┴, ┴…, ┴) Vp[p] ← vp ∆p← Vp
Phase 1(asynch rounds rp,1≤ rp ≤n-1 For rp ← 1 to n-1 send(rp, ∆p, p) to all wait until [All q:rcvd (rp, ∆q, q) or q ϵ ₯ ] msgs[rp] ←{(rp, ∆q, q) | rcvd (rp, ∆q, q)} ∆p ← (┴, ┴, ┴…, ┴) for k ← 1 to n if Vp[k] = ┴ and (rp, ∆q, q) ϵ msgs[rp] with ∆q [k]≠┴ Vp[k] ← ∆q [k] ∆p [k] ← ∆q [k]
Phase 2 Send Vp to all wait until [All q: rcvd Vp or q ϵ ₯ ] lastmsgs {Vq | received Vq} for k ← 1 to n if Vqϵ lastmsgs with Vq[k] = ┴ then Vp[k] ← ┴
Phase 3 Decide (first non- ┴ component ofVp)
Consensus solved using S • Every correct process reaches Phase 3 • Vp of every proc has at least one non- ┴ component • Every correct proc decides on some non- ┴ value in Phase 3 (termination) • This non- ┴ value is proposed by some proc (unif validity) • No process decides more than once (unif integrity) • All procs in Phase 3 have the same vector to decide from (unif agreement)
Failure Detectors • Can be used to bridge gap between known impossibility results and need for practical solutions for fault tolerant asynchronous distributed systems