COMP 655: Distributed/Operating Systems

COMP 655:Distributed/Operating Systems Winter 2012 Mihajlo Jovanovic Week 7: Fault Tolerance Distributed Systems - COMP 655

Fault Tolerance • Fault tolerance concepts • Implementation – distributed agreement • Distributed agreement meets transaction processing: 2- and 3-phase commit Bonus material • Implementation – reliable point-to-point communication • Implementation – process groups • Implementation – reliable multicast • Recovery • Sparing Distributed Systems - COMP 655

Fault tolerance concepts • Availability – can I use it now? • Usually quantified as a percentage • Reliability – can I use it for a certain period of time? • Usually quantified as MTBF • Safety – will anything really bad happen if it does fail? • Maintainability – how hard is it to fix when it fails? • Usually quantified as MTTR Distributed Systems - COMP 655

Comparing nines • 1 year = 8760 hr • Availability levels • 90% = 876 hr downtime/yr • 99% = 87.6 hr downtime/yr • 99.9% = 8.76 hr downtime/yr • 99.99% = 52.56 min downtime/yr • 99.999% = 5.256 min downtime/yr Distributed Systems - COMP 655

Exercise: how to get five nines • Brainstorm what you would have to deal with to build a single-machine system that could run for five years with 25 min downtime. Consider: • Hardware failures, especially disks • Power failures • Network outages • Software installation • What else? • Come up with some ideas about how to solve the problems you identify Distributed Systems - COMP 655

Multiple machines at 99% Assuming independent failures Distributed Systems - COMP 655

Things to watch out for in availability requirements • What constitutes an outage … • A client PC going down? • A client applet going into an infinite loop? • A server crashing? • A network outage? • Reports unavailable? • If a transaction times out? • If 100 transactions time out in a 10 min period? • etc Distributed Systems - COMP 655

More to watch out for • What constitutes being back up after an outage? • When does an outage start? • When does it end? • Are there outages that don’t count? • Natural disasters? • Outages due to operator errors? • What about MTBF? Distributed Systems - COMP 655

Ways to get 99% availability • MTBF = 99 hr, MTTR = 1 hr • MTBF = 99 min, MTTR = 1 min • MTBF = 99 sec, MTTR = 1 sec Distributed Systems - COMP 655

fault causes error may cause failure More definitions • Types of faults: • transient • intermittent • permanent Fault tolerance is continuing to work correctly in the presence of faults. Distributed Systems - COMP 655

Types of failures Distributed Systems - COMP 655

If you remember one thing • Components fail in distributed systems on a regular basis. • Distributed systems have to be designed to deal with the failure of individual components so that the system as a whole • Is available and/or • Is reliable and/or • Is safe and/or • Is maintainable depending on the problem it is trying to solve and the resources available … Distributed Systems - COMP 655

Fault Tolerance • Fault tolerance concepts • Implementation – distributed agreement • Distributed agreement meets transaction processing: 2- and 3-phase commit Distributed Systems - COMP 655

Two-army problem • Red army has 5,000 troops • Blue army and White army have 3,000 troops each • Attack together and win • Attack separately and lose in serial • Communication is by messenger, who might be captured • Blue and white generals have no way to know when a messenger is captured Distributed Systems - COMP 655

Activity: outsmart the generals • Take your best shot at designing a protocol that can solve the two-army problem • Spend ten minutes • Did you think of anything promising? Distributed Systems - COMP 655

Conclusion: go home • “agreement between even two processes is not possible in the face of unreliable communication” Distributed Systems - COMP 655

Byzantine generals • Assume perfect communication • Assume n generals, m of whom should not be trusted • The problem is to reach agreement on troop strength among the non-faulty generals Distributed Systems - COMP 655

Byzantine generals - example n = 4, m = 1 (units are K-troops) • Multicast troop-strength messages • Construct troop-strength vectors • Compare notes: majority rules in each component • Result: 1, 2, and 4 agree on (1,2,unknown,4) Distributed Systems - COMP 655

Doesn’t work with n=3, m=1 Distributed Systems - COMP 655

Fault Tolerance • Fault tolerance concepts • Implementation – distributed agreement • Distributed agreement meets transaction processing: 2- and 3-phase commit Distributed Systems - COMP 655

Distributed commit protocols • What is the problem they are trying to solve? • Ensure that a group of processes all do something, or none of them do • Example: in a distributed transaction that involves updates to data on three different servers, ensure that all three commit or none of them do Distributed Systems - COMP 655

2-phase commit What to do when P, in READY state, contacts Q Coordinator Participant Distributed Systems - COMP 655

If coordinator crashes • Participants could wait until the coordinator recovers • Or, they could try to figure out what to do among themselves • Example, if P contacts Q, and Q is in the COMMIT state, P should COMMIT as well Distributed Systems - COMP 655

2-phase commit What to do when P, in READY state, contacts Q • If all surviving participants are in READY state, • Wait for coordinator to recover • Elect a new coordinator (?) Distributed Systems - COMP 655

3-phase commit • Problem addressed: • Non-blocking distributed commit in the presence of failures • Interesting theoretically, but rarely used in practice Distributed Systems - COMP 655

3-phase commit Coordinator Participant Distributed Systems - COMP 655

Bonus material • Implementation – reliable point-to-point communication • Implementation – process groups • Implementation – reliable multicast • Recovery • Sparing Distributed Systems - COMP 655

RPC, RMI crash & omission failures • Client can’t locate server • Request lost • Server crashes after receipt of request • Response lost • Client crashes after sending request Distributed Systems - COMP 655

Can’t locate server • Raise an exception, or • Send a signal, or • Log an error and return an error code Note: hard to mask distribution in this case Distributed Systems - COMP 655

Request lost • Timeout and retry • Back off to “cannot locate server” if too many timeouts occur Distributed Systems - COMP 655

Server crashes after receipt of request • Possible semantic commitments • Exactly once • At least once • At most once Normal Work done Work not done Distributed Systems - COMP 655

Behavioral possibilities • Server events • Process (P) • Send completion message (M) • Crash (C) • Server order • P then M • M then P • Client strategies • Retry every message • Retry no messages • Retry if unacknowledged • Retry if acknowledged Distributed Systems - COMP 655

Combining the options Distributed Systems - COMP 655

Lost replies • Make server operations idempotent whenever possible • Structure requests so that server can distinguish retries from the original Distributed Systems - COMP 655

Client crashes • The server-side activity is called an orphan computation • Orphans can tie up resources, hold locks, etc • Four strategies (at least) • Extermination, based on client-side logs • Client writes a log record before and after each call • When client restarts after a crash, it checks the log and kills outstanding orphan computations • Problems include: • Lots of disk activity • Grand-orphans Distributed Systems - COMP 655

Client crashes, continued • More approaches for handling orphans • Re-incarnation, based on client-defined epochs • When client restarts after a crash, it broadcasts a start-of-epoch message • On receipt of a start-of-epoch message, each server kills any computation for that client • “Gentle” re-incarnation • Similar, but server tries to verify that a computation is really an orphan before killing it Distributed Systems - COMP 655

Yet more client-crash strategies • One more strategy • Expiration • Each computation has a lease on life • If not complete when the lease expires, a computation must obtain another lease from its owner • Clients wait one lease period before restarting after a crash (so any orphans will be gone) • Problem: what’s a reasonable lease period? Distributed Systems - COMP 655

Common problems with client-crash strategies • Crashes that involve network partition (communication between partitions will not work at all) • Killed orphans may leave persistent traces behind, for example • Locks • Requests in message queues Distributed Systems - COMP 655

How to do it? • Redundancy applied • In the appropriate places • In the appropriate ways • Types of redundancy • Data (e.g. error correcting codes, replicated data) • Time (e.g. retry) • Physical (e.g. replicated hardware, backup systems) Distributed Systems - COMP 655

Triple Modular Redundancy Distributed Systems - COMP 655

Tandem Computers • TMR on • CPUs • Memory • Duplicated • Buses • Disks • Power supplies • A big hit in operations systems for a while Distributed Systems - COMP 655

Replicated processing • Based on process groups • A process group consists of one or more identical processes • Key events • Message sent to one member of a group • Process joins group • Process leaves group • Process crashes • Key requirements • Messages must be received by all members • All members must agree on group membership Distributed Systems - COMP 655

Flat or non-flat? Distributed Systems - COMP 655

Effective process groups require • Distributed agreement • On group membership • On coordinator elections • On whether or not to commit a transaction • Effective communication • Reliable enough • Scalable enough • Often, multicast • Typically looking for atomic multicast Distributed Systems - COMP 655

Process groups also require • Ability to tolerate crash failures and omission failures • Need k+1 processes to deal with up to k silent failures • Ability to tolerate performance, response, and arbitrary failures • Need 3k+1 processes to reach agreement with up to k Byzantine failures • Need 2k+1 processes to ensure that a majority of the system produces the correct results with up to k Byzantine failures Distributed Systems - COMP 655

Reliable multicasting Distributed Systems - COMP 655

COMP 655: Distributed/Operating Systems