1 / 70

COMP 655: Distributed/Operating Systems

COMP 655: Distributed/Operating Systems. Winter 2012 Mihajlo Jovanovic Week 7: Fault Tolerance. Fault Tolerance. Fault tolerance concepts Implementation – distributed agreement Distributed agreement meets transaction processing: 2- and 3-phase commit Bonus material

ciaran-guy
Download Presentation

COMP 655: Distributed/Operating Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP 655:Distributed/Operating Systems Winter 2012 Mihajlo Jovanovic Week 7: Fault Tolerance Distributed Systems - COMP 655

  2. Fault Tolerance • Fault tolerance concepts • Implementation – distributed agreement • Distributed agreement meets transaction processing: 2- and 3-phase commit Bonus material • Implementation – reliable point-to-point communication • Implementation – process groups • Implementation – reliable multicast • Recovery • Sparing Distributed Systems - COMP 655

  3. Fault tolerance concepts • Availability – can I use it now? • Usually quantified as a percentage • Reliability – can I use it for a certain period of time? • Usually quantified as MTBF • Safety – will anything really bad happen if it does fail? • Maintainability – how hard is it to fix when it fails? • Usually quantified as MTTR Distributed Systems - COMP 655

  4. Comparing nines • 1 year = 8760 hr • Availability levels • 90% = 876 hr downtime/yr • 99% = 87.6 hr downtime/yr • 99.9% = 8.76 hr downtime/yr • 99.99% = 52.56 min downtime/yr • 99.999% = 5.256 min downtime/yr Distributed Systems - COMP 655

  5. Exercise: how to get five nines • Brainstorm what you would have to deal with to build a single-machine system that could run for five years with 25 min downtime. Consider: • Hardware failures, especially disks • Power failures • Network outages • Software installation • What else? • Come up with some ideas about how to solve the problems you identify Distributed Systems - COMP 655

  6. Multiple machines at 99% Assuming independent failures Distributed Systems - COMP 655

  7. Multiple machines at 95% Assuming independent failures Distributed Systems - COMP 655

  8. Multiple machines at 80% Assuming independent failures Distributed Systems - COMP 655

  9. Things to watch out for in availability requirements • What constitutes an outage … • A client PC going down? • A client applet going into an infinite loop? • A server crashing? • A network outage? • Reports unavailable? • If a transaction times out? • If 100 transactions time out in a 10 min period? • etc Distributed Systems - COMP 655

  10. More to watch out for • What constitutes being back up after an outage? • When does an outage start? • When does it end? • Are there outages that don’t count? • Natural disasters? • Outages due to operator errors? • What about MTBF? Distributed Systems - COMP 655

  11. Ways to get 99% availability • MTBF = 99 hr, MTTR = 1 hr • MTBF = 99 min, MTTR = 1 min • MTBF = 99 sec, MTTR = 1 sec Distributed Systems - COMP 655

  12. fault causes error may cause failure More definitions • Types of faults: • transient • intermittent • permanent Fault tolerance is continuing to work correctly in the presence of faults. Distributed Systems - COMP 655

  13. Types of failures Distributed Systems - COMP 655

  14. If you remember one thing • Components fail in distributed systems on a regular basis. • Distributed systems have to be designed to deal with the failure of individual components so that the system as a whole • Is available and/or • Is reliable and/or • Is safe and/or • Is maintainable depending on the problem it is trying to solve and the resources available … Distributed Systems - COMP 655

  15. Fault Tolerance • Fault tolerance concepts • Implementation – distributed agreement • Distributed agreement meets transaction processing: 2- and 3-phase commit Distributed Systems - COMP 655

  16. Two-army problem • Red army has 5,000 troops • Blue army and White army have 3,000 troops each • Attack together and win • Attack separately and lose in serial • Communication is by messenger, who might be captured • Blue and white generals have no way to know when a messenger is captured Distributed Systems - COMP 655

  17. Activity: outsmart the generals • Take your best shot at designing a protocol that can solve the two-army problem • Spend ten minutes • Did you think of anything promising? Distributed Systems - COMP 655

  18. Conclusion: go home • “agreement between even two processes is not possible in the face of unreliable communication” Distributed Systems - COMP 655

  19. Byzantine generals • Assume perfect communication • Assume n generals, m of whom should not be trusted • The problem is to reach agreement on troop strength among the non-faulty generals Distributed Systems - COMP 655

  20. Byzantine generals - example n = 4, m = 1 (units are K-troops) • Multicast troop-strength messages • Construct troop-strength vectors • Compare notes: majority rules in each component • Result: 1, 2, and 4 agree on (1,2,unknown,4) Distributed Systems - COMP 655

  21. Doesn’t work with n=3, m=1 Distributed Systems - COMP 655

  22. Fault Tolerance • Fault tolerance concepts • Implementation – distributed agreement • Distributed agreement meets transaction processing: 2- and 3-phase commit Distributed Systems - COMP 655

  23. Distributed commit protocols • What is the problem they are trying to solve? • Ensure that a group of processes all do something, or none of them do • Example: in a distributed transaction that involves updates to data on three different servers, ensure that all three commit or none of them do Distributed Systems - COMP 655

  24. 2-phase commit What to do when P, in READY state, contacts Q Coordinator Participant Distributed Systems - COMP 655

  25. If coordinator crashes • Participants could wait until the coordinator recovers • Or, they could try to figure out what to do among themselves • Example, if P contacts Q, and Q is in the COMMIT state, P should COMMIT as well Distributed Systems - COMP 655

  26. 2-phase commit What to do when P, in READY state, contacts Q • If all surviving participants are in READY state, • Wait for coordinator to recover • Elect a new coordinator (?) Distributed Systems - COMP 655

  27. 3-phase commit • Problem addressed: • Non-blocking distributed commit in the presence of failures • Interesting theoretically, but rarely used in practice Distributed Systems - COMP 655

  28. 3-phase commit Coordinator Participant Distributed Systems - COMP 655

  29. Bonus material • Implementation – reliable point-to-point communication • Implementation – process groups • Implementation – reliable multicast • Recovery • Sparing Distributed Systems - COMP 655

  30. RPC, RMI crash & omission failures • Client can’t locate server • Request lost • Server crashes after receipt of request • Response lost • Client crashes after sending request Distributed Systems - COMP 655

  31. Can’t locate server • Raise an exception, or • Send a signal, or • Log an error and return an error code Note: hard to mask distribution in this case Distributed Systems - COMP 655

  32. Request lost • Timeout and retry • Back off to “cannot locate server” if too many timeouts occur Distributed Systems - COMP 655

  33. Server crashes after receipt of request • Possible semantic commitments • Exactly once • At least once • At most once Normal Work done Work not done Distributed Systems - COMP 655

  34. Behavioral possibilities • Server events • Process (P) • Send completion message (M) • Crash (C) • Server order • P then M • M then P • Client strategies • Retry every message • Retry no messages • Retry if unacknowledged • Retry if acknowledged Distributed Systems - COMP 655

  35. Combining the options Distributed Systems - COMP 655

  36. Lost replies • Make server operations idempotent whenever possible • Structure requests so that server can distinguish retries from the original Distributed Systems - COMP 655

  37. Client crashes • The server-side activity is called an orphan computation • Orphans can tie up resources, hold locks, etc • Four strategies (at least) • Extermination, based on client-side logs • Client writes a log record before and after each call • When client restarts after a crash, it checks the log and kills outstanding orphan computations • Problems include: • Lots of disk activity • Grand-orphans Distributed Systems - COMP 655

  38. Client crashes, continued • More approaches for handling orphans • Re-incarnation, based on client-defined epochs • When client restarts after a crash, it broadcasts a start-of-epoch message • On receipt of a start-of-epoch message, each server kills any computation for that client • “Gentle” re-incarnation • Similar, but server tries to verify that a computation is really an orphan before killing it Distributed Systems - COMP 655

  39. Yet more client-crash strategies • One more strategy • Expiration • Each computation has a lease on life • If not complete when the lease expires, a computation must obtain another lease from its owner • Clients wait one lease period before restarting after a crash (so any orphans will be gone) • Problem: what’s a reasonable lease period? Distributed Systems - COMP 655

  40. Common problems with client-crash strategies • Crashes that involve network partition (communication between partitions will not work at all) • Killed orphans may leave persistent traces behind, for example • Locks • Requests in message queues Distributed Systems - COMP 655

  41. Bonus material • Implementation – reliable point-to-point communication • Implementation – process groups • Implementation – reliable multicast • Recovery • Sparing Distributed Systems - COMP 655

  42. How to do it? • Redundancy applied • In the appropriate places • In the appropriate ways • Types of redundancy • Data (e.g. error correcting codes, replicated data) • Time (e.g. retry) • Physical (e.g. replicated hardware, backup systems) Distributed Systems - COMP 655

  43. Triple Modular Redundancy Distributed Systems - COMP 655

  44. Tandem Computers • TMR on • CPUs • Memory • Duplicated • Buses • Disks • Power supplies • A big hit in operations systems for a while Distributed Systems - COMP 655

  45. Replicated processing • Based on process groups • A process group consists of one or more identical processes • Key events • Message sent to one member of a group • Process joins group • Process leaves group • Process crashes • Key requirements • Messages must be received by all members • All members must agree on group membership Distributed Systems - COMP 655

  46. Flat or non-flat? Distributed Systems - COMP 655

  47. Effective process groups require • Distributed agreement • On group membership • On coordinator elections • On whether or not to commit a transaction • Effective communication • Reliable enough • Scalable enough • Often, multicast • Typically looking for atomic multicast Distributed Systems - COMP 655

  48. Process groups also require • Ability to tolerate crash failures and omission failures • Need k+1 processes to deal with up to k silent failures • Ability to tolerate performance, response, and arbitrary failures • Need 3k+1 processes to reach agreement with up to k Byzantine failures • Need 2k+1 processes to ensure that a majority of the system produces the correct results with up to k Byzantine failures Distributed Systems - COMP 655

  49. Bonus material • Implementation – reliable point-to-point communication • Implementation – process groups • Implementation – reliable multicast • Recovery • Sparing Distributed Systems - COMP 655

  50. Reliable multicasting Distributed Systems - COMP 655

More Related