Replication Techniques for Fault Tolerance

Replication techniquesPrimary-backup, RSM, Paxos Jinyang Li

Fault tolerance => replication • How to recover a single node from power failure? • Wait for reboot • Data is durable, but service is unavailable temporarily • Use multiple nodes to provide service • Another node takes over to provide service

Replicated state machine (RSM) • RSM is a general replication method • Lab 6: apply RSM to lock service • RSM Rules: • All replicas start in the same initial state • Every replica apply operations in the same order • All operations must be deterministic • All replicas end up in the same state

RSM opA opB opB opA • How to maintain a single order in the face of concurrent client requests? opA opA opB opB

RSM: primary/backup opA opB opA opB • Primary/backup: ensure a single order of ops: • Primary orders operations • Backups execute operations in order primary backup opB opA

Case study: Hypervisor [Bressoud and Schnider] • Hypervisor’s goal: fault tolerance • Banks, telephone exchanges, NASA need fault tolerant computing • CPUs are most likely to fail due to complexity • Hypervisor: primary/backup replication • If primary fails, backup takes over • Caveat: assuming failure detection is perfect

Hypervisor replicates at VM-level • Why replicating at VM-level? • Hardware fault-tolerant machines is big in 80s • Software solution is more economical than hardware ones • Replicating at O/S level is messy (many interfaces) • Replicating at app level requires programmer efforts (web developers are not used to writing RSM codes) • Replicating at VM level has a cleaner interface (and no need to change O/S or app) • Primary and backup executes the same sequence of machine instructions

A Strawman design mem mem • Two identical machines • Same initial memory/disk content • Start execute on both machines • Will they perform the same computation?

A strawman design • Property: nodes executing the same sequence of ops see same effect if ops are deterministic • What are deterministic operations? • ADD, MUL etc. • Read time-of-day register, cycle counter, privilege level? • Read memory? • Read disk? • Interrupt timing? • External input devices (network, keyboard) • External output (network, display)

Deal with I/O devices Strawman replicates disks at both machines Problem: hardwares might not behave identically (e.g. fail at different sectors) mem mem SCSI bus primary • Hypervisor connects devices to • at both machines • Only primary reads/writes to devices • Primary sends read values to backup • Only primary handles interrupts from h/w • Primary sends interrupts to backup ethernet backup

Hypervisor executes in epochs • How to execute interrupts at the same time on both nodes? • Strawman: execute instructions one at a time • Backup waits from primary to send interrupt at end of each instruction • Very slow…. • Hypervisor executes in epochs • CPU h/w interrupts every N instructions (so both nodes stop at the same point) • Primary delays all interrupts till end of an epoch • Primary sends all interrupts to backup

Hypervisor failover • If primary fails, backup must handle I/O • How does backup know primary has failed? • How des backup know if primary has handled I/O writes in the last epoch? • Relies on O/S to re-try the I/O • Device needs to support repeated ops • OK for disk writes/reads • OK for network (TCP will figure it out) • How about keyboard, printer, ATM cash machine?

Hypervisor implementation • Hypervisor needs to trap every non-deterministic instruction • Time-of-day register • HP TLB replacement • HP branch-and-link instruction • Memory-mapped I/O loads/stores • Performance penalty is reasonable • A factor of two slow down • How about its performance on modern hardware?

Caveats in Hypervisor • What if the network between primary and backup fails? • Primary is still running • Backup becomes a new primary • Two primaries at the same time! • Can timeouts detect failures correctly? • Pings from backup to primary are lost • Pings from backup to primary are delayed

Paxos: general approach • One (or more) node decides to be the leader • Leader proposes a value and solicits acceptance from others • Leader announces result or try again

Paxos: fault tolerant agreement • Paxos lets all nodes agree on the same value despite node failures, network failures and delays • Extremely useful: • e.g. Nodes agree that X is the primary • e.g. Nodes agree that the last instruction I is executed

Paxos requirement • Correctness: • All nodes agree on the same value • The agreed value X has been proposed by some node • Fault-tolerance: • If less than N/2 nodes fail, the rest nodes should reach agreement eventually w.h.p • Liveness is not guranteed

Why is agreement hard? • What if >1 nodes become leaders simultaneously? • What if there is a network partition? • What if a leader crashes in the middle of solicitation? • What if a leader crashes after deciding but before announcing results? • What if the new leader proposes different values than already decided value?

Paxos setup • Each node proposes a value for a particular view • Agreement is of type vid=<XYZ> • Each node runs as a proposer, acceptor or learner

Strawman • Designate a single node X as acceptor (e.g. one with smallest id) • Each proposer sends its value to X • X decides on one of the values • X announces its decision to all learners • Problem? • Failure of the single acceptor halts decision • Need multiple acceptors!

Strawman 2: multiple acceptors • Each proposer (leader) prpose to all acceptors • Each acceptor accepts the first proposal it receives • If the leader receives positive replies from a majority of acceptors, it decides on its own value • There is at most 1 majority, hence only a single value is decided • Leader sends decided value to all learners • Problem: • What if multiple leaders propose simultaneously so there is no majority accepting?

Paxos solution • Proposals are ordered by proposal # • Each acceptor may accept multiple proposals • If a proposal with value v is chosen, all higher proposals have value v

Paxos operation: node state • Each node maintains: • na, va: highest proposal # and its corresponding value accepted by me • nh: highest proposal # seen • myn: my proposal # in current Paxos

Paxos operation: 3P protocol • Phase 1 (Prepare) • A node decides to be leader (and propose) • Leader choose myn > nh • Leader sends <prepare, myn>to all nodes • Upon receiving <prepare, n> If n < nh reply <prepare-reject> Else nh = n reply <prepare-ok, na,va>

Paxos operation • Phase 2 (Accept): • If leader gets prepare-ok from a majority V = non-empty value corresponding to the highest na received If V= null, then leader can pick any V Send <accept, myn, V> to all nodes • If leader fails to get majority prepare-ok • Delay and restart Paxos • Upon receiving <accept, n, V> If n > nh na = n; va = V reply with <accept-ok> Else reply with reject

Paxos operation • Phase 3 (Decide) • If leader gets accept-ok from a majority • Send <decide, va> to all nodes • If leader fails to get accept-ok from a majority • Delay and restart Paxos

Paxos operation: an example hn: N1:0 na = va = null hn: N2,0 na = N1,1 va = null hn: N0:0 na = va = null Prepare,N1:1 Prepare,N1:1 hn: N1:1 na = null va = null hn: N1,1 na = null va = null ok, na= va=null ok, na =va=nulll Accept,N1:1,val1 Accept,N1:1,val1 hn: N1,1 na = N1,1 va = val1 hn: N1,1 na = N1,1 va = val1 ok ok Decide,val1 Decide,val1 N0 N1 N2

Paxos properties • When is the value V chosen? • When leader receives a majority prepare-ok and proposes V • When a majority nodes accept V • When the leader receives a majority accept-ok for value V

Understanding Paxos • What if more than one leader is active? • Suppose two leaders use different proposal number, N0:10 N1:11 • Can both leaders see a majority of prepare-ok?

Understanding Paxos • What if leader fails while sending accept? • What if a node fails after receiving accept? • If it doesn’t restart, ? • If it reboots, ? • What if a node fails after sending prepare-ok? • If it reboots, ?

Using Paxos for RSM • Fault-tolerant RSM requires consistent replica membership • Membership: <primary, backups> • RSM goes through a series of membership changes <vid-0, primary, backups><vid-1, primary, backups> .. • Use Paxos to agree on the <primary, backups> for a particular vid

Lab5: Using Paxos for RSM All nodes start with static config vid1:N1 vid1: N1 N2 joins A majority in vid1:N1 accept vid2: N1,N2 vid2: N1,N2 N3 joins A majority in vid2:N1,N2 accept vid3: N1,N2,N3 vid3: N1,N2, N3 N3 fails A majority in vid3:N1,N2,N3 accept vid4: N1,N2 vid4: N1,N2

Lab5: Using Paxos for RSM vid1: N1 N1 vid2: N1,N2 Prepare, vid2, N3:1 oldview, vid2=N1,N2 N3 vid1: N1 vid1: N1 N2 N3 joins vid2: N1,N2

Lab5: Using Paxos for RSM vid1: N1 N1 vid2: N1,N2 Prepare, vid3, N3:1 N3 vid1: N1 Prepare, vid3, N3:1 vid2: N1,N2 vid1: N1 N2 N3 joins vid2: N1,N2

Lab6: re-configurable RSM • Use RSM to replicate lock_server • Primary in each view assigns a viewstamp to each client requests • Viewstamp is a tuple (vid:seqno) • (0:0)(0:1)(0:2)(0:3)(1:0)(1:1)(1:2)(2:0)(2:1) • All replicas execute client requests in viewstamp order

Lab6: Viewstamp replication • To execute an op with viewstamp (vs), a replica must have executed all ops < vs • A replica can transfer state from another node to ensure its state reflect executions of all ops < vs

Lab5: Using Paxos for RSM vid1: N1 N1 … Prepare, accept,decide etc. myvs:(1:50) Largest vs (1:50) N2 vid1: N1 Transfer state to bring N2 up-to-date N2 joins

Replication Techniques for Fault Tolerance