Moving away from the independent and identically distributed failure assumption

Moving away from the independent and identically distributed failure assumption University of California, San Diego Flavio Junqueira Research Exam/Thesis Proposal Advisors: Keith Marzullo and Geoffrey M. Voelker

Motivation • Common approach for distributed systems: replicate! • Cheaper than investing on ultra-reliable, specialized components • Enhance performance, availability • E.g. Processes on software-based systems • Typical replication strategy • Compute a threshold t on the failures of processes • Determine the degree of replication required, depending on the problem (e.g. n > 3t for Consensus with arbitrary failures ) • Replicate to this degree • Well suited for independent and identically distributed failures (IID failure assumption) • Non-negligible probability of t failures in any subset of size t+1 • Is it often a reasonable assumption?

Where IID does not apply… • Systems for the Internet • Hosts execute the same popular software systems • Hosts share the same vulnerabilities • Some major outbreaks • Code Red: over 360,000 hosts [Moore02] • Sapphire: over 75,000 hosts [Moore03a] A threshold on the number of failures is unrealistic.

Where IID does not apply… • Quorum systems in a wide-area network [Amir96] • Failures are strongly correlated • Power outages • Network partitions • Software bugs [Little01] • Single version • A demand may cause all replicas to crash • Multiple independently-developed versions • Difficulty of a demand: difficulty in handling it • Level of difficulty varies among the demands • More difficult demands tend to cause multiple versions to fail

Where IID does not apply… • Multi-computer systems [Tang92] • Correlated failures due to shared resources • Network errors • Shared memory • Impact on availability, reliability, and performance • Grid computing • Master delegates computation • Wait replies from slaves • Replicate to achieve fault-tolerance • Dependent failures: same sub-network, same software systems, etc.

Outline • System model • Modeling failures • The classical approach: The threshold model • An alternative to the threshold model: Cores/Survivor sets • Applying it to problems: Consensus • Traditional results on Consensus • Consensus in the core/survivor set model • Generalizing the results for Consensus • General bounds on process replication • Coping with dependent failures in the real world • A few systems that assume dependent failures • An application: The Phoenix Recovery System

System model • Set of processes  = {p1, p2, , pn} • A process is a unit of computation • Communicate by exchanging messages • Reliable channels • Validity: If a correct process p sends a message m to a correct process q, then q eventually receives m; • Integrity: A process p receives a message m from some process q only if q sent m to p;

System model Set  of processes Processes exchange messages Channels are reliable State Distributed algorithm: collection of state machines Step of a process Atomic Execution: sequence of steps of processes

Distributed algorithm • Collection of state machines, one for each process p  • Proceeds in steps of processes • In a step, a process p • Sends a message to a single process • Receives a message from a single process • Undergoes a state transition • Execution • Sequence of steps of processes in 

Timing assumptions • Synchronous systems • Clock drift, message delay, processor speed are bounded • Execution in synchronous rounds • In a synchronous round, a process • sends messages to any number of processes • receives messages from any number of processes • Undergoes a state transition • Asynchronous systems • No bounds on clock drift, message delay, or processor speed

Failure modes for processes • Crash failures • For every faulty process p in some execution of an algorithm A, there is a time tpafter which p stops executing steps of A • Arbitrary failures • A faulty process can deviate arbitrarily from the specification of the algorithm • E.g. crash, sending messages selectively, modify arbitrarily the content of messages • Receive-omission failures • A faulty process either crashes or selectively fail to receive messages • Assumptions • Once a process fails it does not recover • Probability of a total failure is negligible

Modeling failures

The threshold model • Threshold t on the number of process failures • Degree of reliability: R[0,1] • The probability of t+1process failures is smaller than 1-R • Simple and compact representation (n > f(t)) • SIFT project [Wensley76] • Ultra-reliable computer system • Process failures are arbitrary, but non-malicious • Hardware designed to isolate faults (independent failures) • Similar hardware (identically distributed process failures) • IID failure assumption is valid • What if failures are not IID? • Still safe • t is the size of the largest subset of faulty processes in any execution • It does not hurt to consider more

Limitations of the threshold model R : target degree of reliability >R: subset of processes has reliability greater than R

An alternative to the threshold model • Desirable properties • Expressive: scenarios in the previous slide • Flexible: not tied to any particular way of characterizing failures • General: widely applicable • Cores [JM03a] • A core c: minimal reliable subset of processes • At least one process in c is correct in every execution of the system • Generalize subsets of size t+1 • Survivor sets [JM03a] • A survivor set: contains all the correct processes of some execution • Generalize subsets of size n-t

Cores and Survivor sets • R: desired degree of reliability • r(X), X  : evaluates to the reliability of x • A subset C   is a core of  iff • r(C)  R • p C, r(C - {p})  R • C: set of cores of  • A subset S   is a survivor set of  iff • C  C, SC   • p S,  C  C, such that (p  C) and ((S - {p})  C = ) • S: set of survivor sets of  • Cores and survivor sets are the dual of each other

An alternative definition • Design of algorithms • W: be the set of allowed executions • up(w): be the set of correct processes in execution w • A subset C   is a core of  iff • wWs.t. C up() • C’C,  wW s.t. C’ up()= • C: set of cores of  • A subset S   is a survivor set of  iff •  wW s.t. S =up() • S’  S,  wW, S’ up() • S: set of survivor sets of  • : system configuration

An example Blue, Red, and Yellow fail independently Failures of Yellow processes are highly correlated r({Red, Blue, Yellow})= R

Another example Blue: highly-reliable server Red: client Failures of Blue and Red are negatively correlated Probability of more than 3 Red processes failing is negligible

Determining cores and survivor sets • Probability models • E.g. Markov models used in the analysis of dynamic fault trees [Ren98] • To find cores: Minimal subset of processes s.t. probability of total failure in the subset is negligible • Often difficult in practice • Attribute-based model [JM02] • Processes characterized by attributes • Attributes determine failure correlation • Finding a core is NP-hard • Color-based model [JM02] • Single attribute characterizes a process • Polynomial time algorithm to find cores

Cores/Survivor sets vs. Quorum systems • Cores, Survivor sets, Quorums • Subsets of processes • Quorums [Giff79] • Enforce mutual exclusion [GM85] • E.g. One-copy serializability • Quorums necessarily intersect • Execute operations on behalf of the system • Cores/Survivor sets • Do not necessarily execute operations on behalf of the system • Weaker than quorums: no intersection requirement a priori • Generalize objects commonly used in proofs and algorithms • Cores: subsets of size t+1 • Survivor sets: subsets of size n-t

Consensus

Motivation for Consensus • Replication often requires coordination • Coordination problems • Atomic broadcast • Clock synchronization • Agreement on fault-tolerant processors (FTP)

Consensus specification • Each process begins with a proposed value v V • Goal: agree on a single value • Typical Consensus definition [Attiya98] • Agreement: No two correct processes decide on different values • Termination: Every correct process eventually decides • Validity: If a process p decides on value v, then v was proposed by some process q • Strong validity: if every process has v as its initial value, then v is the only possible decision value [Attiya98] • Vector validity: A correct process decides on a vector  such that [Doudou98] • If pi is correct, then [i] has the initial value of pi or null • At least t+1 elements of  are initial values of correct processes

Solution for any number of failures Full-information algorithm (t+1 rounds, ) Early-deciding algorithms [LF82, CB00] For any execution with f failures, correct processes decide in at most f+1 rounds ( ) Clean round: Round in which no process fails Process receives messages from the same set of processes in two consecutive rounds Message complexity: O(f·||2) Synchronous systems - Crash failures

Algorithm SyncCrash [JM03a, JM03d] Choose a core C, preferentially the smallest Execute early-deciding algorithm among processes of C Every process in P has an array of |C| positions, one for each process in C Processes in C send messages to processes in P-C as well A process decides when a round with no failures in C happen Decision in at most |C| rounds If |C|-1 < t, then improves on number of rounds Message complexity: O(f·|C|·||) In the core/survivor set model

Impossible if n 3•t [Lamport82] Strong Consensus Proof idea Consensus algorithm that solves for ||  3·t Execution in which agreement is violated Assume ||  3·t Partition (A, B, C) of  s.t. each subset has at most t processes Execution 1 (A, B, C: v) Execution 2 (A, B, C: v’) Execution 3 (A: v; B: v’, C: *) Synchronous systems - Arbitrary failures

In the core/survivor set model • Lower bound on process replication [JM03a, JM03d] • Byzantine Partition: Every partition (A, B, C) of  is such that at least one of the subsets contains a core • Byzantine Intersection: • The intersection of every pair of survivor sets in S contains a core • The intersection of every three survivor sets in S is not empty Scenario  (A, B, C: v) Scenario  (A, B, C: v’) Scenario  (A: v; B: v’, C: *)

No subset contains a core S1S2S3 is empty Equivalence of Byzantine Intersection and Partition In a partition (A,B,C):

Solving Consensus for arbitrary failures • In the threshold model: Lamport et al. [Lamport82] • Solution for n>3·t in t+1 rounds • In the core/survivor set model • Modified algorithm by Lamport et al. • Solution for systems satisfying Byzantine Partition • Replace subsets of processes of size n-t by survivor sets • Replace majority by intersection of two survivor sets • Enable solution for some systems • ={pa, pb, pc, pd, pe} • C={papbpc, papd, pape, pbpd, pbpe, pcpd, pcpe, pdpe} • S={papbpcpd, papbpcpe, papdpe, pbpdpe, pcpdpe}

Lower bound on the number of rounds • Definitions • : replication requirement (e.g. Byzantine Partition) • is a subsystem of iff • satisfies  • A subsystem is minimal if there is no smaller subsystem • Theorem: Given a system [JM03a, JM03b] • is a minimal subsystem of sys • A is a Consensus algorithm

Back to the example • ={pa, pb, pc, pd, pe} • C={papbpc, papd, pape, pbpd, pbpe, pcpd, pcpe, pdpe} • S={papbpcpd, papbpcpe, papdpe, pbpdpe, pcpdpe} • Crash failures • Lower bound on the number of rounds: • Arbitrary failures • Lower bound on the number of rounds: • Bound is different for crash and arbitrary failures!

Asynchronous systems • No solution for pure asynchronous systems even for a single crash failure [FLP85] • Slow process vs. Faulty process: requires a liveness property • Common approaches • Partially synchronous systems [DLS88] • Extend model with failure detectors [CT96] • Crash failures (S [CT96]) • Crash Partition: Every partition (A,B) of  is such that either A or B contains a core • Crash Intersection: The intersection of every two survivor sets contains a core (coterie [GM88]) • Arbitrary failures (M [Doudou98]) • Byzantine Partition/Intersection

Related work - Hybrid failures models • Moves away only from the identically distributed failure assumption • Different failure modes, one class for each mode [LR94] • Manifest (c):detectable failures (e.g. corrupted messages) • Symmetric (s): behavior deviates arbitrarily, but it is the same for every other processor (e.g. send the same erroneous value to every other process) • Arbitrary (a): behavior deviates arbitrarily (e.g. send different values to different processes) • Algorithm for the Oral messages problem

Replication requirements elsewhere • More general descriptions of failure scenarios • Fail-prone systems [Malkhi97] • Collusion and adversary structures (malicious players) [Hirt97] • Martin et al [Martin02] • Confirmable writes in quorum systems • Property: for every subset B in a fail-prone system and every pair of quorums Q1, Q2, we have that Q1Q2\B  intersection of every pair of quorums contains a core • Hirt and Maurer [Hirt97] • Secure multi-party protocols • Passive model: no pair of collusions can add up to the set of players  set of correct players is a coterie • Active model: no three adversaries can add up to the set of players  intersection of three sets of correct players is not empty

Generalizing n > k t (Work in progress)

Motivation: k integer • Properties establishing bounds on process replication are similar for problems • Asynchronous crash Consensus( W) • TM: n > 2 • t • C/SS: S1, S2  S: S1  S2   • State-machine replication: arbitrary failures • TM: n > 2 •t • C/SS: S1, S2  S: S1 S2   • Synchronous arbitrary Consensus • TM: n > 3 •t • C/SS:S1, S2, S3 S: S1 S2  S3  

Consensus for synchronous systems with receive-omission faults In the threshold model: Execution 1: Process in B and C crash Processes in A propose 0 and decide upon 0 Execution 2 Process in A and C crash Processes in B propose 1 and decide upon 1 Motivation: k rational • Proof idea • Execution 3 • Process in A omit to receive msgs from processes not in A • Processes in B omit to receive msgs from processes not in B • Processes in A propose 0 and decide upon 0 • Processes in B propose 1 and decide upon 1 • Agreement is violated!

Generalizing the partition and the intersection properties • (, )-Partition. For every partition of , there is a subset such that: • (, )-Intersection. For every : ,: subset of S

Threshold Model ( ) Some intuition on the generalized properties • =3, =2 AC contains a core Core/Survivor set Model

Bounds on process replication • Lower bound • Every set of processes  that satisfies , also satisfies (, )-Partition • In every partition of  into  subsets, there are  subsets s.t. the union contains at least t+1 processes • consequently a core • Upper bound (work in progress) • If a problem P can be solved by an algorithm A in a system satisfying , then P can be solved by a system satisfying (k,1)-Partition • Simulate a system under the threshold model • Rational k • Looking for a candidate algorithm to motivate

Implications • Algorithms designed under the threshold model can be automatically translated to our model, for integer k • There is no need to rethink the whole FT distributed systems world • If it simplifies, one may design an algorithm under the threshold model and later translate using our technique

Correlated failures in the real world (work in progress)

Background: Systems considering dependent failures • Oceanstore [WMK02] • Online mechanism to correlate failures • Identify subsets of maximally independent failures • Problem • Correlate failures only after they have happened • Not useful for malicious behavior • PASIS [BWWG02] • Survivable storage systems • Add correlation level to classical model of availability • Two models to determine correlation level • Conditional probabilities • Beta-binomial distribution • Problem: • Requires the computation of failure distributions

Coping with Internet catastrophes: Phoenix • Possible approaches • Contain Internet pathogens: very challenging [Moore03b] • Recover from catastrophes: replicate data • Typical replication strategy • Assume independent host failures • Compute a threshold t on the number of failures • Replicate to this degree • Shared vulnerabilities Dependent host failures • Independent host failures is not a suitable assumption • Threshold t on the number of host failures • From previous events, t can be large • Code Red worm infected over 360,000 hosts

Our replication strategy • Desirable properties • Enable recovery of data after an Internet catastrophe • Small replica sets • Informed strategy for replica placement [JBMSV03] • Sets of hosts that fail independently • Hosts executing different sets of software systems • Classes of software systems: attributes • E.g. Operating system • Potentially vulnerable software systems: attribute values • E.g. Linux, Windows • Replicate data on a set of hosts that have different values for each attribute: cores

Phoenix { , , } An example • Attributes • Operating system:{ , } • Web server:{ , } • Web browser:{ , } • Cores • Red and Green (orthogonal core) • Red, Yellow, and Blue { , , } { , , } Attribute configurations Attribute configurations { , , }

In this presentation… • Feasibility of this approach • What is the impact of diversity on storage overhead and load? • Diversity: distribution of attribute configurations • Storage overhead: size of the replica set (core) • Storage load: given a host h, number of cores h participates • Simulations • Levels of diversity • Varying attribute sets

A set H of hosts A set A of attributes Every attribute has the same cardinality y A mapping M from hosts to attribute configurations Diversity Determined by M Often skewed in practice (93% Windows) [OneStat] Modeling diversity Single parameter f [0.5,1) A sharef of the hosts has a share(1-f)of the attribute configurations System model Attribute configurations: Example 1: f = 0.5 Example 2: f = 0.75

{ , , } { , , } { , , } { , , } Phoenix Phoenix Attribute configurations Attribute configurations Attribute configurations Attribute configuration { , , } { , , } { , , } Heuristic to find cores • Attributes • Operating system:{ , } • Web server:{ , } • Web browser:{ , } • Cores • Red and Green • Red, Yellow, and Blue

Moving away from the independent and identically distributed failure assumption