Failure Detectors: A Perspective

Failure Detectors: A Perspective Sam Toueg University of Toronto

Group Membership Group Communication Atomic Broadcast Primary/Backup systems Atomic Commitment Consensus Leader Election ….. Context: Distributed Systems with Failures In such systems, applications often need to determine which processes are up (operational) and which are down (crashed) This service is provided by Failure Detector (FD) FDs are at the core of many fault-tolerant algorithms and applications FDs are found in many systems: e.g., ISIS, Ensemble, Relacs, Transis, Air Traffic Control Systems, etc.

Failure Detectors An FD is a distributed oracle that provides hints about the operational status of processes. However: • Hints may be incorrect • FD may give different hints to different processes • FD may change its mind (over & over) about the operational status of a process

s q p s q q t q q s r SLOW

Talk Outline • Using FDs to solve consensus • Broadening the use of FDs • Putting theory into practice • Using FDs to solve consensus • Broadening the use of FDs • Putting theory into practice

5 5 7 8 Crash! 5 8 2 5 5 p Consensus q t s r

Consensus A paradigm for reaching agreement despite failures • Equivalent to Atomic Broadcast • Can be used to solve Atomic Commitment • Can be used to solve Group Membership • ….

Solving Consensus • In synchronous systems: Possible • In asynchronous systems: Impossible [FLP83] • even if: • at most one process may crash, and • all links are reliable

Why this difference? • In synchronous systems: use timeouts to determine with certainty whether a process has crashed => Perfect failure detector • In asynchronous systems: cannot determine with certainty whether a process has crashed or not (it may be slow, or its messages are delayed) => No failure detector

Failure detector S S S can be used to solve consensus [CT 91] is the weakest FD to solve consensus [CHT92] Solving Consensus with Failure Detectors Is perfect failure detection necessary for consensus? No But there is a time after which: • every process that crashes is suspected (completeness) • some process does not crash is not suspected (accuracy) Initially, it can output arbitrary information.

D p then D can be transformed into If FD D can be used to solve consensus… D S S S S S S D q t D D s r

S Solving Consensus using : Rotating Coordinator Algorithms 1 2 3 • Processes are numbered 1, 2, …, n S 4 • In round r , the coordinator: • - tries to impose its estimate as the consensus value • - succeeds if does not crash and it is not suspected by Work for up to f < n/2 crashes • They execute asynchronous rounds • In round r , the coordinator is • process (r mod n) + 1

S for roundsr := 0, 1, 2 ... do {round r msgs are tagged with r} A Consensus algorithm using (Mostefaoui and Raynal 1999) everyprocess p sets estimate to its initial value • every p waits until (a) it receives v from c • or (b) it suspects c (according to <>S ) • if (a) then send v to all • if (b) then send ? to all • the coordinator c of round r sends its estimate v to all • every p waits until it receives a msg v or ? from n-f processes • if it received at least (n+1)/2 msgs v then decidev • if it received at least one msg v then estimate := v • if it received only ? msgs then do nothing

n=7 every q changes its estimate to v p decides v Why does it work? Agreement: f=3

Why does it work? • Termination: • With <>S no process blocks forever waiting for a message from a dead coordinator • With <>S eventually some correct process c is not falsely suspected. When c becomes the coordinator, every process receives its c’s estimate and decides

Consensus 1 Consensus 2 Consensus 3 What Happens if the Failure Detector Misbehaves? • Consensus algorithm is: • Safe -- Always! • Live -- During “good” FD periods

Failure Detector Abstraction Someadvantages: • Increases the modularity and portability of algorithms • Suggests why consensus is not so difficult in practice • Determines minimal info about failures to solve consensus • Encapsulates various models of partially synchrony

Failure Detection Abstraction Initially, applicability was limited: • Model: FLP only • process crashes only • a crash is permanent (no recovery possible) • no link failures (no msg losses) • Problems solved: consensus, atomic broadcast only

Talk Outline • Using FDs to solve consensus • Broadening the use of FDs • Putting theory into practice

Other problems: • Atomic Commitment • Group Membership • Leader Election • k-set Agreement • Reliable Communication Broadening the Applicability of FDs Other models: • Crashes + Link failures (fair links) • Network partitioning • Crash/Recovery • Byzantine (arbitrary) failures • FDs + Randomization

Talk Outline • Using FDs to solve consensus • Broadening the use of FDs • Putting theory into practice

Implementations need to be message-efficient: • FDs with linear msg complexity (ring, hierarchical, gossip) Putting Theory into Practice In practice: • FD implementation needs to be message-efficient: • ==>FDs with linear msg complexity (ring, hierarchical, gossip) • ``Eventual’’ guarantees are not sufficient: ==>FDs with QoS guarantees • Failure detection should be easily available • ==>Shared FD service (with QoS guarantees)

On Failure Detectors withQoS guarantees

Heartbeats Probabilistic Model • pL : probability of heartbeat loss • D : heartbeat delay (random variable) Simple FD problem qmonitorsp p q Heartbeats can be lost or delayed

Typical FD Behavior trust trust trust FD at q suspect suspect (permanently) suspect up Process p down

QoS of Failure Detectors The QoS specification of an FD quantifies: • how fast it detects actual crashes • how well it avoids mistakes (i.e., false detections) What QoS metrics should we use?

Detection Time • TD : time to detect a crash TD trust FD Permanent suspicion up Process p down

Accuracy Metrics • TMR : Time between two consecutive mistakes • TM : Duration of a mistake FD TM TMR Process p up

T T T T S FD up Process p Another Accuracy Metric Application (queries at random time) • PA : probability that the FD is correct at a random time

A Common FD Algorithm

   TO TO TO TO Process q A Common FD Algorithm Process p • Timing-out also depends on previous heartbeat FD at q

Large Detection Time • TDdepends on the delay of the last heartbeat sent by p TD crash Process p Process q TO TO TO FD atq

A New FD Algorithm and its QoS

  Freshness points: i-1 i i+1 i+2 New FD Algorithm  hi-1 hi hi+1 hi+2 Process p Process q FD at q • At time t[i,i+1), q trusts p iff it has received heartbeathi or higher.

Detection Time is Bounded TD crash Process p hi Process q   i i+1 FD at q

Optimality Result Among all FD algorithms with the same heartbeat rate and detection time, this FD has the best query accuracy probability PA

QoS Analysis • Given: • the system behaviorpL and Pr(D  t) • the parameters  and  of the FD algorithm Can compute the QoS of this FD algorithm: • Max detection time TD • Average time between mistakes E(TMR ) • Average duration of a mistake E (TM) • Query accuracy probability PA

QoS Analysis • Given: • the system behaviorpLandPr(D  t) • the parameters  and  of the FD algorithm • Can compute the QoS of this FD algorithm:

Satisfying QoS Requirements • Given a set of QoS requirements: • Computehanddto achieve these requirements

Computing FD parameters to achieve the QoS • AssumepLandPr(D  x) are known • Problem to be solved:

Step 1: compute • and let • Step 2: let • find the largest  maxthat satisfies • Step 3: set Configuration Procedure

P(D £ x) PL U U L D M MR T , T , T h d Probabilistic Behavior of Heartbeats QoS Requirements Configurator Failure Detector

Probability of heartbeat loss: pL = 0.01 • Heartbeat delay D is exponentially distributed • with average delay E(D) = 0.02 sec QoS requirements: • Detect crash within 30 sec • At most one mistake per month (average) • Mistake is corrected within 60 s (average) Algorithm parameters: • Send a heartbeat every  = 9.97 sec • Set shift to= 20.03 sec Example

If System Behavior is Not Known If pL andPr(D  x) are not known: • estimate pL, E(D), V(D) using heartbeats • use E(D), V(D)instead ofPr(D  x) • in the configuration procedure

PL V(D) E(D) QoS Requirements U U L D M MR T , T , T h d Estimator of the Probabilistic Behavior of Heartbeats Configurator Failure Detector

Probability of heartbeat loss: pL = 0.01 • Distribution of heartbeat delay D is not known, • but E(D) = V(D) = 0.02 sec are known QoS requirements: • Detect crash within 30 sec • At most one mistake per month (average) • Mistake is corrected within 60 s (average) Algorithm parameters: • Send a heartbeat every  = 9.71 sec • Set shift to= 20.29 sec Example

A Failure Detector ServicewithQoS guarantees

Approaches to Failure Detection • currently: • each application implements its own FD • no systematic way of setting timeouts and sending rates • we propose FD as shared service: • continuously running on every host • can detect process and host crashes • provides failure information to all applications

Advantages of Shared FD Service • sharing: • applications can concurrently use the same FD service • merging FD messages can decreases network traffic • modularity: • well-defined API • different FD implementations may be used in different environments • reduced implementation effort • programming fault-tolerant applications becomes easier

Advantages of Shared FD Service with QoS • QoS guarantees: • applications can specify desired QoS • applications do not need to set operational FD parameters (e.g. timeouts and sending rates) • adaptivity: • adapts to changing network conditions (message delays and losses) • adapts to changing QoS requirements

Failure Detectors: A Perspective