1 / 11

Distributed Systems

Distributed Systems. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org. Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults. Deviation from expected behavior Variety of factors hardware software

lenore
Download Presentation

Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Systems Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu ds@pk.org Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.

  2. Faults • Deviation from expected behavior • Variety of factors • hardware • software • operator • Network • Three categories • transient faults • intermittent faults • permanent faults

  3. Faults • Three categories • transient faults • intermittent faults • permanent faults • Any fault may be • fail-silent (fail-stop) • Byzantine • synchronous system vs. asynchronous system • E.g., IP packet versus serial port transmission

  4. Achieving fault tolerence Redundancy • information redundancy • Hamming codes, parity memory ECC memory • time redundancy • Timeout & retransmit • physical redundancy • TMR, RAID disks, backup servers

  5. How much fault tolerance? • 100 % fault-tolerance cannotbe achieved. • The closer we wish to get to 100%, the more expensive the system will be. • A system is k-fault tolerantif it can withstand k faults. • Need k+1 components with silent faultsk can fail and one will still be working • Need 2k+1 components with Byzantine faultsk can generate false replies: k+1 will provide a majority vote

  6. Active replication Technique for fault tolerance through physical redundancy No redundancy: Triple Modular Redundancy (TMR): Threefold component replication to detect and correct a single component failure

  7. Primary backup • One server does all the work • When it fails, backup takes over • Backup may ping primary with are you alive messages • Simpler design: no need for multicast • Works poorly with Byzantine faults • Recovery may be time-consuming and/or complex

  8. Agreement in faulty systems Two army problem • good processors - faulty communication lines • coordinated attack • multiple acknowledgement problem

  9. Agreement in faulty systems Byzantine Generals problem • reliable communication lines - faulty processors • n generals head different divisions • m generals are traitors and are trying to prevent others from reaching agreement • 4 generals agree to attack • 4 generals agree to retreat • 1 traitor tells the 1st group that he’ll attack and tells the 2nd group that he’ll retreat • can the loyal generals reach agreement?

  10. Agreement in faulty systems Byzantine Generals problem • Solutions require: • 3m+1 participants for m traitors (2m+1 loyal generals) • m+1 rounds of message exchanges • O(m2) messages • Costly solution!

  11. The end.

More Related