Unreliable Failure Detectors for Reliable Distributed Systems

Unreliable Failure Detectors for Reliable Distributed Systems T Chandra, Sam Toueg, 1996

Why Fault Tolerance

Fault Tolerance in Asynchronous Distributed Systems • Problem: • Impossible! (FLP, 1985) • Solution: • Randomization • Failure Detectors

Unreliable Failure Detectors • A distributed oracle that provides hints about the operational status of processes • Hints may be incorrect • Different hints to different processors • May change its mind over time • Mistakes should not make the system behave incorrectly!

Example • Timeout based • Every proc sends an “I am alive” msg periodically • If not received within a time, suspect the proc • If received, remove from suspected

Hints may be incorrect

Different hints to different processors

May change its mind over time

Characterizing Failure Detectors • Completeness • Suspect every process that actually crashes • Accuracy • Limit the number of correct processes that are suspected

Completeness Vs Accuracy

The Model • Crash failures only with no recovery • Failure Detector works in Query/Response manner • Query the FD • Response = Currently suspected procs • FD properties

Completeness • Strong Completeness • Eventually, every crashed process is permanently suspected by every correct process • Weak Completeness • Eventually, every crashed process is permanently suspected by some correct process

Strong Completeness

Weak Completeness

Accuracy • Strong Accuracy • A process is never suspected before it crashes • Weak Accuracy • Some correct process never suspected Perpetual Accuracy!

Eventual Accuracy • Eventual Strong Accuracy • After a time, correct processes do not suspect correct processes • Eventual Weak Accuracy • After a time, some correct process is not suspected by any correct process

Failure Detector Classes

The Consensus Problem • Termination: • Every process eventually decides • Uniform Integrity: • Every process decides at most once • Agreement: • No two correct procs decide differently • Uniform Validity: • If a proc decides v, then v was proposed by some proc

Solving Consensus using FDs • An algorithm to solve consensus using S • S satisfies strong completeness and weak accuracy • Tolerates upto n-1 failures in n proc system

The Algorithm • At every proc p: Procedure propose(vp) Vp← (┴, ┴, ┴…, ┴) Vp[p] ← vp ∆p← Vp

Phase 1(asynch rounds rp,1≤ rp ≤n-1 For rp ← 1 to n-1 send(rp, ∆p, p) to all wait until [All q:rcvd (rp, ∆q, q) or q ϵ ₯ ] msgs[rp] ←{(rp, ∆q, q) | rcvd (rp, ∆q, q)} ∆p ← (┴, ┴, ┴…, ┴) for k ← 1 to n if Vp[k] = ┴ and (rp, ∆q, q) ϵ msgs[rp] with ∆q [k]≠┴ Vp[k] ← ∆q [k] ∆p [k] ← ∆q [k]

Phase 2 Send Vp to all wait until [All q: rcvd Vp or q ϵ ₯ ] lastmsgs {Vq | received Vq} for k ← 1 to n if Vqϵ lastmsgs with Vq[k] = ┴ then Vp[k] ← ┴

Phase 3 Decide (first non- ┴ component ofVp)

Consensus solved using S • Every correct process reaches Phase 3 • Vp of every proc has at least one non- ┴ component • Every correct proc decides on some non- ┴ value in Phase 3 (termination) • This non- ┴ value is proposed by some proc (unif validity) • No process decides more than once (unif integrity) • All procs in Phase 3 have the same vector to decide from (unif agreement)

Failure Detectors • Can be used to bridge gap between known impossibility results and need for practical solutions for fault tolerant asynchronous distributed systems

Unreliable Failure Detectors for Reliable Distributed Systems

Unreliable Failure Detectors for Reliable Distributed Systems

Presentation Transcript

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Unreliable Failure Detectors for Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems

Reliable Distributed Systems