1 / 19

Fault Tolerance Computer Programs that Can Fix Themselves!

This article discusses the concept of fault-tolerant software for distributed systems, exploring solutions such as token rings and constant state size. Learn how computers can recover from faults and ensure the reliability of their programs.

Download Presentation

Fault Tolerance Computer Programs that Can Fix Themselves!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault ToleranceComputer Programs that CanFix Themselves! Prof’s Paul Sivilottiand Tim Long Dept. of Computer & Info. Science The Ohio State University paolo@cis.ohio-state.edu

  2. Ubiquity of Computers

  3. Distributed Systems • Computers are increasingly connected • Can cooperate to solve a common task • Example: Voting day • Ballots counted in individual polling stations • District office sums these counts • State office sums the district numbers • Example: Traffic flow • Calculate best route from OSU to downtown Columbus • Based on: roads, distances, construction information, traffic jams, other cars,…

  4. Distributed Programs:Advantages • Speed • Divide a big problem into smaller parts • Solve smaller parts in parallel • Communication • Problem is inherently distributed across space • Must gather information and coordinate • Robustness • Give the same task to many computers • Even if one fails, others are probably ok

  5. So What’s the Downside? • With more computers, there are more things that can go wrong! • Despite our best efforts, “faults” occur • Examples: • “blue screen”, “segmentation fault”, “crash”,… • Possible causes: • Environmental conditions • Computer unplugged, wireless connection lost • Software bug • Human error in engineering the software • User error • Program given bad input, used improperly

  6. Fault-Tolerant Software • A program that still does the right thing, despite faults • Can fix itself, and eventually recover • For sequential software • Restart system • But for distributed software? • Restart isn’t practical! • System must be “self-stabilizing”

  7. Tokens for Taking Turns • Consider all the chefs across the city • Say they need to take turns • Only one can be on vacation at a time • How do they coordinate when to go on vacation? • Solution: use a “token” • Pass token around • Rule: If you have the token, you can go on vacation

  8. Token Ring • Problem: What about faults? • What happens if token is lost? • One fault means disaster!

  9. Prevent Loss of Token: Binary Ring • Rule (for most chefs): if left neighbor is different from me, then I have the token • Make my number equal to that neighbor’s 0 0 0 1 1 1 1

  10. Completing the Ring 0 0 0 1 1 1 1 • One chef is special: if left neighbor is same as me, then I have the token • Make my number differ from that neighbor’s

  11. “1” token 0 “0” token “1” token Fault: Corruption of Values • Problem: multiple tokens in ring • Tokens chase each other around ring • One fault means disaster 0 0 0 1 1 1 1

  12. k-State Token Ring • Solution: use more values than chefs! • Same rule • If left neighbor different from me: • I have the token! (use it) • Change my value to be equal to neighbor • Again, one chef is special • If left neighbor same as me • I have the token (use it) • Change my value to be one bigger

  13. Activity: Anthropomorphism • Form a ring • Each person has number cards • Each person has a chime • When you get the token: • Play your chime • Then change your number • We’ll run different versions • I’ll introduce “faults” and see if you can recover!

  14. Can We Do Better? • Problems with this approach? • Consider a very large ring • Need a very large number of states! • Consider a dynamic ring • When the size changes, everyone needs to be updated! • A better solution would use a fixed number of states • Independent of ring size

  15. Constant State Size • Recall difficulty with binary ring • Many tokens in ring, of different “types” • Chase each other around the ring, never colliding • Solution A: • Force them to “collide” by having more types than chefs • Solution B?

  16. Don’t Use a Ring! 0 0 0 1 1 1 1 • Solution: chop the ring! • Tokens can’t circulate • They reflect back and forth • All chefs responsible for stabilization

  17. 4-State Token “Ring” • Chefs pass tokens right and left • State = value (0/1) and direction (left/right) • Left rule (same as before) • If left neighbor different from me: • I have the token! (use it) • Change my value to be equal to neighbor • Face right • Right rule (other direction) • If right neighbor same as me and facing me: • I have the token! (use it) • Face left

  18. Activity 2 • Form a “ring” again • Pairs at each station • One person for left rule, one for right • Left rule person: • Always “on” • Copy value and face right (may already be facing that way) • Right rule person: • Only “on” when you and right neighbor face each other • Change direction to face left

  19. Take-Home Messages • Ubiquity of Distributed Systems • Computers are getting smaller, cheaper, faster • Increased connectivity • Faults • Some errors are outside of our control • Sequential programs can be reset • Distributed Fault Tolerance • Distributed programs that fix themselves • Eventually stabilize in spite of faults • Subtlety of Distributed Algorithms • Concurrency and nondeterminism • How can we convince ourselves these algorithms are correct?

More Related