1 / 32

Fundamentals Stream Session 8: Fault Tolerance & Dependability I

Distributed Systems Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Ta ïani Overview of the day Today: Fault-Tolerance in DS / Guest Lecture 9-10am: Fundamental Stream / Fault Tolerance (a) 10-11am: Guest Lecturer Prof. Jean-Charles Fabre

Download Presentation

Fundamentals Stream Session 8: Fault Tolerance & Dependability I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Systems Fundamentals Stream Session 8:Fault Tolerance &Dependability I CSC 253 Gordon Blair, François Taïani

  2. Overview of the day Today: Fault-Tolerance in DS / Guest Lecture • 9-10am: Fundamental Stream / Fault Tolerance (a) • 10-11am: Guest Lecturer Prof. Jean-Charles Fabre • How to assess the robustness of OS and middleware • 12-1pm: Fundamental Stream / Fault Tolerance (b) Next week: Advanced Fault-Tolerance G. Blair/ F. Taiani

  3. Overview of the session • A famous example (video) • Basic dependability concepts • Also covered in CSc 365 (Year B) "Safety Critical Systems” • Fault-tolerance at the node level: Replication • Fault-tolerance at the network level: TCP • Fault-tolerance at the environment level: Data centres Associated Reading: Chapter 7. Fault Tolerance of Tanenbaum & van Steen G. Blair/ F. Taiani

  4. A Famous Example • April 14, 1912 ? Video G. Blair/ F. Taiani

  5. A Major Disaster • The worst peacetime maritime disaster • more than 1500 dead • only 706 survivors • The wreck only discovered in 1985 • 2 1/2 miles down on the ocean floor G. Blair/ F. Taiani

  6. The Titanic and Fault-Tolerance • How fault-tolerant was the Titanic? • best technology of the time: deemed “practicably” unsinkable • 16 watertight compartments • Titanic could float with its first 4 compartments flooded • Some flaws in the design: • poor turning ability • bulkheads only went as high as E-deck • not enough lifeboats! G. Blair/ F. Taiani

  7. What does It Teach Us? • Good to plan ahead for problems • Like collision and hull damage for a ship • Replication / redundancy is the key to survival • Titanic contained 16 watertight compartments • Could still float with the first four flooded (& other combinations) • But always a limit to what you can tolerate • But you also need diversity / independence • Compartments are watertight • Critical issue: prevent water propagation (deck E issue) • The case of the titanic can be applied to many other systems • Generic concepts have been developed to discuss them G. Blair/ F. Taiani

  8. Dependability Concepts • Failure: what we want to avoid • Titanic: sinking and killing people • Distributed System: stop functioning correctly • Fault: a problem that could cause a failure • Titanic: the iceberg, design flaws • Distributed System: hardware crash, software bug • Errors: an erroneous internal state that can lead to a failure • Titanic: damaged hull, water in compartment • DS: erroneous data, inconsistent internal behaviour G. Blair/ F. Taiani

  9. System and Sub-Systems • Faults, Errors, Failures are recursive notions • A systems usually made of components e.g. sub-systems • Titanic: hull, decks, compartments, smoke stacks • Distributed system: nodes, network links, routers, hubs, services • Nodes contain CPUs, hard disks, network card, ... • Network comprises routers, hubs, gateways, name servers ... • Failure of internal component: a fault for enclosing system G. Blair/ F. Taiani

  10. Means to Dependability • Fault prevention: • Do not go through the Artic Ocean (no icebergs) • Buy good hardware (less faulty hardware) • Hire skilled programmers (less bugs) • Fault removal • Blast the iceberg (not practicable as external fault) • Test and replace faulty hardware before it is used • Test and remove bugs • Fault tolerance • Use replication / redundancy and diversity • Prevent water / error propagation G. Blair/ F. Taiani

  11. Quality of Service • Notion of 'failure' is not black or white • Consider a web server taking 5 minute to deliver each request • It fulfils its functional specification (to deliver web pages) • It works better than a server that would not reply at all • But can it be considered to be functioning correctly? • Lots of intermediary situations • Quality of Service of a system: how 'good' a system is • Multiple dimension: latency, bandwidth, security, availability, reliability, ... • 'Failure' usually reserved to when a service becomes unusable • Goal: highest QoS in spite of faults (and at the lowest cost) complete failure optimal service G. Blair/ F. Taiani

  12. Graceful Degradation • Ideally fault tolerance should mask any internal failure/fault • A router crashes but you do not even notice • In practice not always possible • Essentially because of cost + fault assumption • Goal is then to minimise impact of faults on system user • Avoid complete failure • Maintain highest possible QoS in spite of faults • This is called "Graceful Degradation" • Extremely important: leaves time to administrators to react • Users continue using the service, albeit with a lower QoS G. Blair/ F. Taiani

  13. Distributed Systems andFault Tolerance • Fault tolerance requires distribution • For replication and independence of failure modes • I.e. to tolerate earthquakes  not all servers in San Francisco • Distribution requires fault-tolerance • In large distributed systems, partial node failures unavoidable • Single points of failure to be avoided as much as possible G. Blair/ F. Taiani

  14. What Can Go Wrong in a DS? • Nodes (servers, clients, peers) • Faulty hardware  crash or data corruption • Power failure  crash (and sometimes data corruption) • Any "physical" accident: fire, flood, earthquake, ... • Network • Same as nodes. • Routers gateways: whole subnet impacted • Name servers: whole name domain impacted • Congestion (dropped packets) • Users / People • Involuntary mistakes • Security attacks (external &internal, i.e. from legitimate users) G. Blair/ F. Taiani

  15. Main Steps of Fault Tolerance • Error Detection • Detecting that something is wrong (like water in the keel) • The earlier the better • The earlier the more difficult • Error Recovery • Preventing errors from propagating (watertight door) • Removing errors • Does not mean root cause (fault) removed • Fault treatment (usually off-line, manually) • Fault Diagnostic • Fault Removal G. Blair/ F. Taiani

  16. Error Detection • Error detection in the Titanic quite “easy” • No water should leak inside the boat (safety invariant) • Distributed Systems: more difficult • What should be the result of a request is usually unknown • In some cases sanity or correctness check possible • In many cases some form of redundancy needed (1.414213)2=? 2 C √ x 1.414213 G. Blair/ F. Taiani

  17. Error Recovery • Backward error recovery • Return system into a previous error-free (hopefully) state • Can be achieved in generic manner (replication, checkpointing) • Risk of inconsistency in general case • Can be minimised (see next week) • Forward error recovery • Bring system in a valid future state • Highly application dependent • Makes sense as soon as real-time is involved • Examples: video streaming, flight control software G. Blair/ F. Taiani

  18. Node FT: Replication • Why replicate? • Performance: e.g. replication of heavily loaded web servers • Allows error detection (when replicas disagree) • Allows backward error recovery • Importance of fault-model • Effect of replication depends on how individual replicas fail • Crash faults  availability: % availability = 1-pn(where n = no of replicas, p = probability of individual failure) • Requirements for a replication service • Replication transparency ... • in spite of network partitions, disconnections, etc. G. Blair/ F. Taiani

  19. Focus on Availability • Assumptions (very important!) • Replicas fail independently • Replicas fail silently (crashes, no arbitrary behaviour) G. Blair/ F. Taiani

  20. A Generic Replication Architecture Service Request &replies RM C RM FE Front ends Clients C FE RM Replicamanagers G. Blair/ F. Taiani

  21. Primary Backup C FE RM RM C RM FE Backup Passive Replication • What is passive replication (primary backup)? • FEs communicate with a single primary RM, which must then communicate with secondary RMs (slaves) • Requires election of new primary in case of primary failure • Only tolerate crash faults (silent failure of replicas) • Variant: primary’s state saved to stable storage (cold passive) • saving to stable storage known as “checkpointing” G. Blair/ F. Taiani

  22. RM C RM C FE FE RM Active Replication • What is active replication? • Front Ends multicast requests to every Replica Manager • Appropriate reliable group communications needed • ordering guarantees crucial (see W4: Group Communication) • Tolerate crash-faults and arbitrary faults G. Blair/ F. Taiani

  23. less expensive less complexless powerful more powerful more expensive more complex Comparing Active vs. Passive G. Blair/ F. Taiani

  24. Error detection against corruption Network FT: TCP Fault-model • TCP is a fault-tolerant protocol • correct messages arrives even if packets get lost or corrupted • A TCP packet: C.SC 251: Networking G. Blair/ F. Taiani

  25. TCP Checksum • TCP checksum : 16bit blocks of packet added & complemented • (additional one-complementing here and there) • 216 (~65 000) possible combinations • TCP packets typically around 1500 bytes  212000 combinations • Several packets bounds to have the same checksum • Roughly: 212000 ÷ 216 = 211984 ~ 16 x 103594 • The Trick • Usual corruption unlikely to produce a consistent checksum • Notion of hamming distance • Additional checksum at the network (IP) and link layers (Ethernet)  (!)  multiple corruptions G. Blair/ F. Taiani

  26. 0110 + 1010 0110 1010 Hamming Distance • Between 2 string of bits: • the number of bit switches required to go from 1 to the other • Distance between 1011101 and 1001001? 2 • Minimum Hamming Distance of a code • minimal hamming distance between 2 valid "code words" • i.e. minimal number of corruptions to fall back on a valid packet • Simple checksum: • packets of 8 bits • checksum on 4 bits: adds the two 4-bit blocks • What is the minimal hamming distance of this code? • Between 2 string of bits: • the number of bit switches required to go from 1 to the other • Distance between 1011101 and 1001001? 1100 ???? G. Blair/ F. Taiani

  27. Does not match. Invalid packet. Corruption detected. 1000 1000  Matches again. Valid packet.Corruption not detected. 0110 0100 0110 0100 0100 1010 1100 + + + 1000 1000 1000 Hamming Distance • Hamming distance is 2 • Note that the 2 corruptions must obey a particular pattern • In a network corruption usually occur in bursts 0110 1010 1100 G. Blair/ F. Taiani

  28. What Is in a Checksum? • The TCP checksum is a form of error detection code • Min. Hamming distance reflects strength of the detection • Far more complex error detection codes exist • For instance CRC: cyclic redundancy check • It is a form of redundancy • information redundancy: if packet OK no need for checksum • partial redundancy: impossible to reconstruct packet from checksum • When code used "strong" enough (min. Hamming dist > 2) possible reconstruct "most likely" correct packet • Error correction code (not seen in this course) • Provides error recovery G. Blair/ F. Taiani

  29. Environment FT: Data Centres • Mission critical systems (your company’s web server) • Little point in addressing only one type of fault • Major risks: power failure & overheating • Collocation Provider: rents luxury “parking places” for servers • Multiple network operators • Secured physical access • Multiple power sources (including power generator) • Highly robust air conditioning • Example: Redbus Interhouse plc (http://www.interhouse.org/) • 3 centres in London, others in Amsterdam, Frankfurt, Milan, Paris • Even there things can go wrong http://www.theregister.co.uk/2004/06/16/redbus_power_fails/ G. Blair/ F. Taiani

  30. What we'll see next week • Refinement of what we have seen today • Replication scheme: the problem of consistency • Consistent check-pointing for primary backup • Fault-tolerant total ordering for active replication • Contrast replication and distributed transactions • Short flash-back to Week 3 lecture G. Blair/ F. Taiani

  31. References • Chapter 7 of Tanenbaum & van Steen • On the Titanic • http://www.charlespellegrino.com/Time%20Line.htm • Very detailed account • http://octopus.gma.org/space1/titanic.html • Fundamental concepts of dependability • by Algirdas Avizenis Jean-Claude Laprie Brian Randell • http://www.cert.org/research/isw/isw2000/papers/56.pdf G. Blair/ F. Taiani

  32. Expected Learning Outcomes At the end of this 8th Unit: • You should understand the basic dependability concepts of fault, error, failure • You should know the basic steps of fault-tolerance • You should appreciate fundamental fault-techniques like replication and error-detection codes • You should be able to explain and compare active and passive replication style G. Blair/ F. Taiani

More Related