Error Confinement: New Measures for Fault Tolerance and Core Bootstrapping

Distributed Error-Confinement Shay Kutten (Technion) with Boaz Patt-Shamir (Tel Aviv U.) Yossi Azar (Tel Aviv U.)

Talk Overview (1) What is error confinement? (2) The new “agility” measure for fault tolerance (3) The new “core- bootstrapping” idea for algorithm. (4) Optimization question and answer for “core” construction.

C S A B Motivation: “error propagation” (example) (1) Assume no fault: My distance to C via S: 7+4=11 Message from S to A: distance 7toC 7 4 C Traffic to C Internet routing: Node A compute shortest path to C based on messages from S.

C distance 0toC Motivation: “error propagation” (example) (2) with fault (at B): My distance to C via S: 7+4=11 Message from S to A: C distance 7toC 7 S A 2 4 B Traffic to C State corrupting fault (adversary modifies data memory)

distance 0toC State corrupting fault (self stabilization): Not malicious! Just a one time change of memory content. C 7 S A 2 4 B State corrupting fault (adversary modifies data memory)

C Motivation: “error propagation” (example) (2) With fault (at B): My distance to C via S: 7+4=11 Message from S to A: C distance 7toC 7 S A 4 2 B Traffic to C distance 0toC fault

C Motivation: “error propagation” (example) (3) B’s fault propagated to A My distance to C via B: 2+0=2 Message from S to A: C distance 7toC 7 S A 4 2 B Traffic to C distance 0toC fault

C C Motivation: “error propagation” (example) B’s fault propagated to A My distance to C via B: 2+0=2 Message from S to A: C distance 7toC 7 S A 4 2 B (4) Traffic to C is sent the wrong way as a result of the fault propagation distance 0toC fault

This is, actually, how the Internet (than Called “ARPANET”) in 1980 S C crashed D C S D A C B D I have distance 0to everybody fault

C I do not believe you! “Error confinement”: non faulty node A outputs only correctoutput(or nooutputat all) Sounds impossible? S A Output (to routing:) My distance to C via S: 7+4=11 B distance 0toC fault

(“stabilization” deals also with faulty nodes) • (behavior- ignoring time) Error Confinement (Formally) • : problem specification, P: protocol. • P solves  with error confinement if for any execution of P with behavior  (possibly containing a state corrupting fault), there exists a behavior ’ & for all non-faulty nodes v: ’v= v

Talk Overview (1) What is error confinement? (2) The new “agility” measure for fault tolerance. (3) The core- bootstrapping idea idea for algorithm (4) Optimization question and answer for “core” construction.

Introducing a new measure of fault resilience: The resilience of a protocol is smaller at first t Environment (e.g. user) 2 time t 1 Input is given to S at time t 0 C t S A 0 B D

time The resilience of a protocol is smaller at first (cont.) Environment (e.g. user) gives Input is to S at time t t 2 0 If adversary changes the state of S at timetf shortly after the input t 1 t f C t S A 0 B D

t 2 t 1 The resilience of a protocol is smaller at first (cont.) time Environment (e.g. user) gives Input to S at time t 0 If adversary changes the state of S at time tf shortly after the input then the input is lost forever t f C t S A 0 B D

t 2 t 1 t f The resilience of a protocol grows with time time However, a fault, even in S, can be tolerated if it waits until after S distributed the input value C S A B D C t S A 0 B D input

t 2 t 1 t t f f The resilience of a protocol grows with time (cont.) time However, a fault, even in S, can be tolerated if it waits until after S distributed the input value distribution C S A B D C t S A 0 B D input

t 2 t 1 tf tf The resilience of a protocol grows with time time A fault even in S can be tolerated if it “waits” until after S distributed the input value distribution C S A B D C t S A 0 B D input

t 2 t 1 t t f f The resilience of a protocol grows with time time A fault even in S can be tolerated if it “waits” until after S distributed the input value distribution C A S B D C t S A 0 B D input

t t f f The resilience of a protocol grows with time time To destroy the replicated value the adversary needs to hit more nodes at > > t0 t1 tf t0 C S t1 A B D C t0 S A B D input

t t 3 3 If no faults occurred by some later , then the input is replicated even further The resilience continues to grows with time time C S S A B D C t S A 2 B D C t S A 1 B D

tf The resilience continues to grows with time time C S A t B D 3 C t S A 2 B D The later the faults, the more faults can be tolerated C t S A 1 B D

Time Space Cone time C S S A t B D 3 C t S A 2 B D The later the faults, the more faults can be tolerated if the protocol is designed to be robust C t S A 1 B D

“Narrow” cone a LESS fault tolerant algorithm time C S S A t B D 3 C t S A 2 B D Slower replication less nodes offer help C t S A 1 B D

A “Wider” cone a more fault tolerant algorithm time C S S A t B D 3 C t S A 2 B D Replication to more nodes faster C t S A 1 B D

So, a recovery of corrupted values is theoretically possible, for an adversary that is constrained according to a space-time-cone, but what is the algorithm that does the recovery? time S

Constraining faults: Agility • c-constrained environment: environment generating faultstf time units after the input, (c 1), only in: • with agilityc: Broadcast algorithm that guarantees error confinement against c-constrained environments. minority of· |Balls(c·tf)| nodes. algorithm V V c·tf S Balls

Algorithm’s “agility” measures the rate the constraint on the adversary can be lifted C S S D C S time D Agility: S

Talk Overview (1) What is error confinement? (2) The new “agility” measure for fault tolerance. (3) The new “core- bootstrapping” idea for algorithm. (4) Optimization question and answer for “core” construction.

The message resides at some nodes we term “core”

A node can join the core when it “made sure” it heard the votes of all core nodes

and even the fault can be corrected

Let us view again the join of one node

If core is such that adversary’s constraint allows hit of only a minority of the core… Then the message passes to the new node correctly

If core is such that adversary’s constraint allows hit of only a minority of the core… Disclaimer: any connection to Actual historical rivalry is coincidental Then the message passes to the new node correctly

C I do not believe you! “Error confinement”: non faulty node A outputs Only correctoutput(or nooutputat all) S A D Output (to routing:) My distance to C via S: 7+4=11 B distance 0toC fault

and even the fault can be corrected

When the core grow, the algorithm can withstand more faults.

Error Confinement: New Measures for Fault Tolerance and Core Bootstrapping