1 / 47

CS514: Intermediate Course in Operating Systems

This lecture discusses the replication of data for increased availability, the vulnerabilities of the quorum scheme, and other options such as primary-backup schemes. It also explores the concept of non-blocking commit for high availability transactional systems.

alisonh
Download Presentation

CS514: Intermediate Course in Operating Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 9: Sept. 21

  2. Conclusion? • We set out to replicate data for increased availability • And concluded that • Quorum scheme works for updates • But commit is required • And represents a vulnerability • Other options?

  3. Other options • We mentioned primary-backup schemes • These are a second way to solve the problem • Based on the log at the data manager

  4. Server replication • Suppose the primary sends the log to the backup server • It replays the log and applies committed transactions to its replicated state • If primary crashes, the backup soon catches up and can take over

  5. Primary/backup primary backup log Clients initially connected to primary, which keeps backup up to date. Backup tracks log

  6. Primary/backup primary backup Primary crashes. Backup sees the channel break, applies committed updates. But it may have missedthe last few updates!

  7. Primary/backup primary backup Clients detect the failure and reconnect to backup. Butsome clients may have “gone away”. Backup state couldbe slightly stale. New transactions might suffer from this

  8. Issues? • Under what conditions should backup take over • Revisits the consistency problem seen earlier with clients and servers • Could end up with a “split brain” • Also notice that still needs 2PC to ensure that primary and backup stay in same states!

  9. Split brain: reminder primary backup log Clients initially connected to primary, which keeps backup up to date. Backup follows log

  10. Split brain: reminder primary backup Transient problem causes some links to break but not all. Backup thinks it is now primary, primary thinks backup is down

  11. Split brain: reminder primary backup Some clients still connected to primary, but one has switched to backup and one is completely disconnected from both

  12. Implication? • A strict interpretation of ACID leads to conclusions that • There are no ACID replication schemes that provide high availability • Most real systems solve by weakening ACID

  13. Real systems • They use primary-backup with logging • But they simply omit the 2PC • Server might take over in the wrong state (may lag state of primary) • Can use hardware to reduce or eliminate split brain problem

  14. How does hardware help? • Idea is that primary and backup share a disk • Hardware is configured so only one can write the disk • If server takes over it grabs the “token” • Token loss causes primary to shut down (if it hasn’t actually crashed)

  15. Reconciliation • This is the problem of fixing the transactions impacted by lack of 2PC • Usually just a handful of transactions • They committed but backup doesn’t know because never saw commit record • Later. server recovers and we discover the problem • Need to apply the missing ones • Also causes cascaded rollback • Worst case may require human intervention

  16. Summary • Reliability can be understood in terms of • Availability: system keeps running during a crash • Recoverability: system can recover automatically • Transactions are best for latter • Some systems need both sorts of mechanisms, but there are “deep” tradeoffs involved

  17. Replication and High Availability • All is not lost! • Suppose we move away from the transactional model • Can we replicate data at lower cost and with high availability? • Leads to “virtual synchrony” model • Treats data as the “state” of a group of participating processes • Replicated update: done with multicast

  18. Steps to a solution • First look more closely at 2PC, 3PC, failure detection • 2PC and 3PC both “block” in real settings • But we can replace failure detection by consensus on membership • Then these protocols become non-blocking (although solving a slightly different problem) • Generalized approach leads to ordered atomic multicast in dynamic process groups

  19. Non-blocking Commit • Goal: a protocol that allows all operational processes to terminate the protocol even if some subset crash • Needed if we are to build high availability transactional systems (or systems that use quorum replication)

  20. Definition of problem • Given a set of processes, one of which wants to initiate an action • Participants may vote for or against the action • Originator will perform the action only if all vote in favor; if any votes against (or don’t vote), we will “abort” the protocol and not take the action • Goal is all-or-nothing outcome

  21. Non-triviality • Want to avoid solutions that do nothing (trivial case of “all or none”) • Would like to say that if all vote for commit, protocol will commit ... but in distributed systems we can’t be sure votes will reach the coordinator! • any “live” protocol risks making a mistake and counting a live process that voted to commit as a failed process, leading to an abort • Hence, non-triviality condition is hard to capture

  22. Typical protocol • Coordinator asks all processes if they can take the action • Processes decide if they can and send back “ok” or “abort” • Coordinator collects all the answers (or times out) • Coordinator computes outcome and sends it back

  23. Commit protocol illustrated ok to commit?

  24. Commit protocol illustrated ok to commit? ok with us

  25. Commit protocol illustrated ok to commit? ok with us commit Note: garbage collection protocol not shown here

  26. Failure issues • So far, have implicitly assumed that processes fail by halting (and hence not voting) • In real systems a process could fail in arbitrary ways, even maliciously • This has lead to work on the “Byzantine generals” problem, which is a variation on commit set in a “synchronous” model with malicious failures

  27. Failure model impacts costs! • Byzantine model is very costly: 3t+1 processes needed to overcome t failures, protocol runs in t+1 rounds • This cost is unacceptable for most real systems, hence protocols are rarely used • Main area of application: hardware fault-tolerance, security systems • For these reasons, we won’t study such protocols

  28. Commit with simpler failure model • Assume processes fail by halting • Coordinator detects failures (unreliably) using timouts. It can make mistakes! • Now the challenge is to terminate the protocol if the coordinator fails instead of, or in addition to, a participant!

  29. Commit protocol illustrated ok to commit? ok with us crashed! … times outabort! Note: garbage collection protocol not shown here

  30. Example of a hard scenario • Coordinator starts the protocol • One participant votes to abort, all others to commit • Coordinator and one participant now fail ... we now lack the information to correctly terminate the protocol!

  31. Commit protocol illustrated ok to commit? vote unknown! ok decision unknown! ok

  32. Example of a hard scenario • Problem is that if coordinator told the failed participant to abort, all must abort • If it voted for commit and was told to commit, all must commit • Surviving participants can’t deduce the outcome without knowing how failed participant voted • Thus protocol “blocks” until recovery occurs

  33. Skeen: Three-phase commit • Seeks to increase availability • Makes an unrealistic assumption that failures are accurately detectable • With this, can terminate the protocol even if a failure does occur

  34. Skeen: Three-phase commit • Coordinator starts protocol by sending request • Participants vote to commit or to abort • Coordinator collects votes, decides on outcome • Coordinator can abort immediately • To commit, coordinator first sends a “prepare to commit” message • Participants acknowledge, commit occurs during a final round of “commit” messages

  35. ok to commit? prepare to commit commit Three phase commit protocol illustrated ok .... prepared... Note: garbage collection protocol not shown here

  36. Observations about 3PC • If any process is in “prepare to commit” all voted for commit • Protocol commits only when all surviving processes have acknowledged prepare to commit • After coordinator fails, it is easy to run the protocol forward to commit state (or back to abort state)

  37. Assumptions about failures • If the coordinator suspects a failure, the failure is “real” and the faulty process, if it later recovers, will know it was faulty • Failures are detectable with bounded delay • On recovery, process must go through a reconnection protocol to rejoin the system! (Find out status of pending protocols that terminated while it was not operational)

  38. Problems with 3PC • With realistic failure detectors (that can make mistakes), protocol still blocks! • Bad case arises during “network partitioning” when the network splits the participating processes into two or more sets of operational processes • Can prove that this problem is not avoidable: there are no non-blocking commit protocols for asynchronous networks

  39. Situation in practical systems? • Most use protocols based on 2PC: 3PC is more costly and ultimately, still subject to blocking! • Need to extend with a form of garbage collection mechanism to avoid accumulation of protocol state information (can solve in the background) • Some systems simply accept the risk of blocking when a failure occurs • Others reduce the consistency property to make progress at risk of inconsistency with failed proc.

  40. Process groups • To overcome cost of replication will introduce dynamic process group model (processes that join, leave while system is running) • Will also relax our consistency goal: seek only consistency within a set of processes that all remain operational and members of the system • In this model, 3PC is non-blocking! • Yields an extremely cheap replication scheme!

  41. Failure detection • Basic question: how to detect a failure • Wait until the process recovers. If it was dead, it tells you • I died, but I feel much better now • Could be a long wait • Use some form of probe • But might make mistakes • Substitute agreement on membership • Now, failure is a “soft” concept • Rather than “up” or “down” we think about whether a process is behaving acceptably in the eyes of peer processes

  42. Architecture Applications use replicated data for high availability 3PC-like protocols use membership changes instead of failure notification Membership Agreement, “join/leave” and “P seems to be unresponsive”

  43. Issues? • How to “detect” failures • Can use timeout • Or could use other system monitoring tools and interfaces • Sometimes can exploit hardware • Tracking membership • Basically, need a new replicated service • System membership “lists” are the data it manages • We’ll say it takes join/leave requests as input and produces “views” as output

  44. Architecture Application processes membership views A {A} {A,B,D} {A,D} {A,D,C} {D,C} GMS processes join B leave GMS join C X Y Z D A seems to have failed

  45. Issues • Group membership service(GMS) has just a small number of members • This core set will tracks membership for a large number of system processes • Internally it runs a group membership protocol (GMP) • Full system membership list is just replicated data managed by GMS members, updated using multicast

  46. GMP design • What protocol should we use to track the membership of GMS • Must avoid split-brain problem • Desire continuous availability • We’ll see that a version of 3PC can be used • But can’t “always” guarantee liveness

  47. Reading ahead? • Read chapters 12, 13 • Thought problem: how important is external consistency (called dynamic uniformity in the text)? • Homework: Read about FLP. Identify other “impossibility results” for distributed systems. What is the simplest case of an impossibility result that you can identify?

More Related