770 likes | 792 Views
Explore the concept of black box online services, delving into their implementation and design decisions. Learn about centralization, sharding, replication, and overcoming consistency failures in distributed systems.
E N D
CS 3700Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why can’t we all just get along?)
Black Box Online Services Black Box Service • Storing and retrieving data from online services is commonplace • We tend to treat these services as black boxes • Data goes in, we assume outputs are correct • We have no idea how the service is implemented
Black Box Online Services • Storing and retrieving data from online services is commonplace • We tend to treat these services as black boxes • Data goes in, we assume outputs are correct • We have no idea how the service is implemented debit_transaction(-$75) OK get_recent_transactions() […, “-$75”, …]
Black Box Online Services • Storing and retrieving data from online services is commonplace • We tend to treat these services as black boxes • Data goes in, we assume outputs are correct • We have no idea how the service is implemented add_item_to_cart(“Cheerios”) OK get_cart() [“Lucky Charms”, “Cheerios”]
Black Box Online Services • Storing and retrieving data from online services is commonplace • We tend to treat these services as black boxes • Data goes in, we assume outputs are correct • We have no idea how the service is implemented post_update(“I LOLed”) OK get_newsfeed() […, {“txt”: “I LOLed”, “likes”: 87}]
Peeling Back the Curtain Black Box Service ? • How are large services implemented? • Different types of services may have different requirements • Leads to different design decisions
Centralization ? • Advantages of centralization • Easy to setup and deploy • Consistency is guaranteed (assuming correct software implementation) • Shortcomings • No load balancing • Single point of failure debit_transaction(-$75) OK get_account_balance() Bob: $300 Bob: $225 Bob $225
Sharding <A-M> • Advantages of sharding • Better load balancing • If done intelligently, may allow incremental scalability • Shortcomings • Failures are still devastating debit_account(-$75) Bob: $300 OK Bob: $225 <N-Z> get_account_balance() Bob $225
Replication 100% Agreement <A-M> • Advantages of replication • Better load balancing of reads (potentially) • Resilience against failure; high availability (with some caveats) • Shortcomings • How do we maintain consistency? debit_account(-$75) <A-M> OK Bob: $300 Bob: $225 get_account_balance() <A-M> Bob: $300 Bob: $225 Bob $225 Bob: $300 Bob: $225
Leader cannot disambiguate cases where requests and responses are lost Consistency Failures No ACK Bob: $300 Bob: $225 Bob: $300 Bob: $300 Bob: $300 No Agreement No ACK Bob: $300 Bob: $225 Bob: $225 Bob: $300 Bob: $300 Bob: $300 Too few replicas? Asynchronous networks are problematic Bob: $225 No Agreement Timeout! Bob: $225 Bob: $225
Byzantine Failures • When discussing Distributed Systems, failures due to malice are known as Byzantine Failures • Name comes from the Byzantine generals problem • More on this later… Bob: $300 In some cases, replicas may be buggy or malicious No Agreement Bob: $300 Bob: $1000
Problem and Definitions • Build a distributed system that meets the following goals: • The system should be able to reach consensus • Consensus [n]: general agreement • The system should be consistent • Data should be correct; no integrity violations • The system should be highly available • Data should be accessible even in the face of arbitrary failures • Challenges: • Many, many different failure modes • Theory tells us that these goals are impossible to achieve (more on this later)
Outline Distributed Commits (2PC and 3PC) Theory (FLP and CAP) Quorums (Paxos)
Forcing Consistency • One approach building distributed systems is to force them to be consistent • Guarantee that all replicas receive an update… • …Or none of them do • If consistency is guaranteed, then reaching consensus is trivial debit_account(-$75) Bob: $300 Bob: $225 OK debit_account(-$50) Bob: $300 Bob: $225 Bob: $175 Bob Error Bob: $300 Bob: $225 Bob: $175
Distributed Commit Problem • Application that performs operations on multiple replicas or databases • We want to guarantee that all replicas get updated, or none do • Distributed commit problem: • Operation is committed when all participants can perform the action • Once a commit decision is reached, all participants must perform the action • Two steps gives rise to the Two Phase Commit protocol
Motivating Transactions transfer_money(Alice, Bob, $100) • System becomes inconsistent if any individual action fails debit_account(Alice, -$100) Error OK debit_account(Bob, $100) Alice: $600 Alice: $500 Bob: $300 Error OK Bob: $400
Simple Transactions transfer_money(Alice, Bob, $100) • Actions inside a transaction behave as a single action begin_transaction() debit_account(Alice, -$100) Alice: $600 Alice: $500 Alice: $500 debit_account(Bob, $100) Bob: $300 Bob: $400 Bob: $400 At this point, if there haven’t been any errors, we say the transaction is committed end_transaction() OK
Simple Transactions transfer_money(Alice, Bob, $100) • If any individual action fails, the whole transaction fails • Failed transactions have no side effects • Incomplete results during transactions are hidden begin_transaction() debit_account(Alice, -$100) Alice: $600 Alice: $500 debit_account(Bob, $100) Bob: $300 Error
ACID Properties • Traditional transactional databases support the following: • Atomicity: all or none; if transaction fails then no changes are applied to the database • Consistency: there are no violations of database integrity • Isolation: partial results from incomplete transactions are hidden • Durability: the effects of committed transactions are permanent
Two Phase Commits (2PC) • Well known techniques used to implement transactions in centralized databases • E.g. journaling (append-only logs) • Out of scope for this class (take a database class, or CS 5600) • Two Phase Commit (2PC) is a protocol for implementing transactions in a distributed setting • Protocol operates in rounds • Assume we have leader or coordinator that manages transactions • Each replica statesthat it is ready to commit • Leader decides the outcome and instructs replicas to commit or abort • Assume no byzantine faults (i.e. nobody is malicious)
2PC Example Leader Replica 1 Replica 2 Replica 3 • Begin by distributing the update • Txid is a logical clock x x x txid = 678; value = y x y x y x y • Wait to receive “ready to commit” from all replicas ready txid = 678 Time • Tell replicas to commit commit txid = 678 y y y • At this point, all replicas are guaranteed to be up-to-date committed txid = 678
Failure Modes • Replica Failure • Before or during the initial promise phase • Before or during the commit • Leader Failure • Before receiving all promises • Before or during sending commits • Before receiving all committed messages
Replica Failure (1) Leader Replica 1 Replica 2 Replica 3 x x x txid = 678; value = y x y x y ready txid = 678 • Error: not all replicas are “ready” Time • The same thing happens if a write or a “ready” is dropped, a replica times out, or a replica returns an error abort txid = 678 x x aborted txid = 678
Replica Failure (2) Leader Replica 1 Replica 2 Replica 3 x y x y x y commit txid = 678 ready txid = 678 y Time • Known inconsistent state • Leader must keep retrying until all commits succeed committed txid = 678 commit txid = 678 y committed txid = 678
Replica Failure (2) Leader Replica 1 Replica 2 Replica 3 y y x y • Replicas attempt to resume unfinished transactions when they reboot stat txid = 678 commit txid = 678 Time y • Finally, the system is consistent and may proceed committed txid = 678
Leader Failure • What happens if the leader crashes? • Leader must constantly be writing its state to permanent storage • It must pick up where it left off once it reboots • If there are unconfirmed transactions • Send new write messages, wait for “ready to commit” replies • If there are uncommitted transactions • Send new commit messages, wait for “committed” replies • Replicas may see duplicate messages during this process • Thus, it’s important that every transaction have a unique txid
Allowing Progress • Key problem: what if the leader crashes and never recovers? • By default, replicas block until contacted by the leader • Can the system make progress? • Yes, under limited circumstances • After sending a “ready to commit” message, each replica starts a timer • The first replica whose timer expires elects itself as the new leader • Query the other replicas for their status • Send “commits” to all replicas if they are all “ready” • However, this only works if all the replicas are alive and reachable • If a replica crashes or is unreachable, deadlock is unavoidable
New Leader Leader Replica 1 Replica 2 Replica 3 x y x y x y ready txid = 678 ready txid = 678 • Replica 2’s timeout expires, begins recovery procedure stat txid = 678 Time commit txid = 678 committed txid = 678 y y y • System is consistent again
Deadlock Leader Replica 1 Replica 2 Replica 3 x y x y x y ready txid = 678 • Replica 2’s timeout expires, begins recovery procedure stat txid = 678 Time ready txid = 678 • Cannot proceed, but cannot abort stat txid = 678
Garbage Collection • 2PC is somewhat of a misnomer: there is actually a third phase • Garbage collection • Replicas must retain records of past transactions, just in case the leader fails • Example, suppose the leader crashes, reboots, and attempts to commit a transaction that has already been committed • Replicas must remember that this past transaction was already committed, since committing a second time may lead to inconsistencies • In practice, leader periodically tells replicas to garbage collect • Transactions <= some txid in the past may be deleted
2PC Summary • Message complexity: O(2n) • The good: guarantees consistency • The bad: • Write performance suffers if there are failures during the commit phase • Does not scale gracefully (possible, but difficult to do) • A pure 2PC system blocks all writes if the leader fails • Smarter 2PC systems still blocks all writes if the leader + 1 replica fail • 2PC sacrifices availability in favor of consistency
Can 2PC be Fixed? • They issue with 2PC is reliance on the centralized leader • Only the leader knows if a transaction is 100% ready to commit or not • Thus, if the leader + 1 replica fail, recovery is impossible • Potential solution: Three Phase Commit • Add an additional round of communication • Tell all replicas to prepare to commit, before actually committed • State of the system can always be deduced by a subset of alive replicas that can communicate with each other • … unless there are partitions (more on this later)
Leader Replica 1 Replica 2 Replica 3 3PC Example x x • Begin by distributing the update x txid = 678; value = y x y x y x y • Wait to receive “ready to commit” from all replicas ready txid = 678 prepare txid = 678 • Tell all replicas that everyone is “ready to commit” Time commit txid = 678 • Tell replicas to commit committed txid = 678 prepared txid = 678 • At this point, all replicas are guaranteed to be up-to-date y y y
Leader Replica 1 Replica 2 Replica 3 Leader Failures x x • Begin by distributing the update x txid = 678; value = y x y x y x y • Wait to receive “ready to commit” from all replicas ready txid = 678 • Replica 2’s timeout expires, begins recovery procedure Time stat txid = 678 ready txid = 678 • Replica 3 cannot be in the committed state, thus okay to abort abort txid = 678 x x • System is consistent again aborted txid = 678
Leader Replica 1 Replica 2 Replica 3 Leader Failures prepare txid = 678 • Replica 2’s timeout expires, begins recovery procedure stat txid = 678 Time prepared txid = 678 • All replicas must have been ready to commit commit txid = 678 y y prepared txid = 678 • System is consistent again committed txid = 678
Oh Great, I Fixed Everything! • Wrong • 3PC is not robust against network partitions • What is a network partition? • A split in the network, such that full n-to-n connectivity is broken • i.e. not all servers can contact each other • Partitions split the network into one or more disjoint subnetworks • How can a network partition occur? • A switch or a router may fail, or it may receive an incorrect routing rule • A cable connecting two racks of servers may develop a fault • Network partitions are very real, they happen all the time
Leader Replica 1 Replica 2 Replica 3 Partitioning x x x txid = 678; value = y x y x y x y Leader recovery initiated • Network partitions into two subnets! ready txid = 678 prepare txid = 678 Time prepared txid = 678 • Leader assumes replicas 2 and 3 have failed, moves on Abort commit txid = 678 x y x committed txid = 678 • System is inconsistent
3PC Summary • Adds an additional phase vs. 2PC • Message complexity: O(3n) • Really four phases with garbage collection • The good: allows the system to make progress under more failure conditions • The bad: • Extra round of communication makes 3PC even slower than 2PC • Does not work if the network partitions • 2PC will simply deadlock if there is a partition, rather than become inconsistent • In practice, nobody used 3PC • Additional complexity and performance penalty just isn’t worth it • Loss of consistency during partitions is a deal breaker
Outline Distributed Commits (2PC and 3PC) Theory (FLP and CAP) Quorums (Paxos)
A Moment of Reflection • Goals, revisited: • The system should be able to reach consensus • Consensus [n]: general agreement • The system should be consistent • Data should be correct; no integrity violations • The system should be highly available • Data should be accessible even in the face of arbitrary failures • Achieving these goals may be harder than we thought :( • Huge number of failure modes • Network partitions are difficult to cope with • We haven’t even considered byzantine failures
What Can Theory Tell Us? • Lets assume the network is synchronous and reliable • Algorithm can be divided into discreet rounds • If a message from r is not received in a round, then r must be faulty • Since we’re assuming synchrony, packets cannot be delayed arbitrarily • During each round, r may send m <= n messages • n is the total number of replicas • You might crash before sending all n messages • If we are willing to tolerate f total failures (f < n), how many rounds of communication do we need to guarantee consensus?
Consensus in a Synchronous System • Initialization: • All replicas choose a value 0 or 1 (can generalize to more values if you want) • Properties: • Agreement: all non-faulty processes ultimately choose the same value • Either 0 or 1 in this case • Validity: if a replica decides on a value, then at least one replica must have started with that value • This prevents the trivial solution of all replicas always choosing 0, which is technically perfect consensus but is practically useless • Termination: the system must converge in finite time
Algorithm Sketch • Each replica maintains a mapM of all known values • Initially, the vector only contains the replica’s own value • e.g. M = {‘replica1’: 0} • Each round, broadcast M to all other replicas • On receipt, construct the union of received M and local M • Algorithm terminates when all non-faulty replicas have the values from all other non-faulty replicas • Example with three non-faulty replicas (1, 3, and 5) • M = {‘replica1’: 0, ‘replica3’: 1, ‘replica5’: 0} • Final value is min(M.values())
Bounding Convergence Time • How many rounds will it take if we are willing to tolerate f failures? • f + 1 rounds • Key insight: all replicas must be sure that all replicas that did not crash have the same information (so they can make the same decision) • Proof sketch, assuming f = 2 • Worst case scenario is that replicas crash during rounds 1 and 2 • During round 1, replica xcrashes • All other replicas don’t know if x it alive or dead • During round 2, replica y crashes • Clear that x is not alive, but unknown if y is alive or dead • During round 3, no more replicas may crash • All replicas are guaranteed to receive updated info from all other replicas • Final decision can be made
A More Realistic Model • The previous result is interesting, but unrealistic • We assume that the network is synchronous and reliable • Of course, neither of these things are true in reality • What if the network is asynchronous and reliable? • Replicas may take an arbitrarily long time to respond to messages • Let’s also assume that all faults are crash faults • i.e. if a replica has a problem it crashes and never wakes up • No byzantine faults
The FLP Result There is no asynchronous algorithm that achieves consensus on a 1-bit value in the presence of crash faults. The result is true even if no crash actually occurs! • This is known as the FLP result • Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson, 1985 • Extremely powerful result because: • If you can’t agree on 1-bit, generalizing to larger values isn’t going to help you • If you can’t converge with crash faults, no way you can converge with byzantine faults • If you can’t converge on a reliable network, no way you can on an unreliable network
FLP Proof Sketch • In an asynchronous system, a replica x cannot tell whether a non-responsive replica yhas crashed or it is just slow • What can x do? • If it waits, it will block since it might never receive the message from y • If it decides, it may find out later that y made a different decision • Proof constructs a scenario where each attempt to decide is overruled by a delayed, asynchronous message • Thus, the system oscillates between 0 and 1 never converges
Impact of FLP • FLP proves that any fault-tolerant distributed algorithm attempting to reach consensus has runs that never terminate • These runs are extremely unlikely (“probability zero”) • Yet they imply that we can’t find a totally correct solution • And so “consensus is impossible” (“not always possible”) • So what can we do? • Use randomization, probabilistic guarantees (gossip protocols) • Avoid consensus, use quorum systems (Paxos or Raft) • In other words, trade off consistency in favor of availability
Consistency vs. Availability • FLP states that perfect consistency is impossible • Practically, we can get close to perfect consistency, but at significant cost • e.g. using 3PC • Availability begins to suffer dramatically under failure conditions • Is there a way to formalize the tradeoff between consistency and availability?
Eric Brewer’s CAP Theorem • CAP theorem for distributed data replication • Consistency: updates to data are applied to all or none • Availability: must be able to access all data • Network Partition Tolerance: failures can partition network into subtrees • The Brewer Theorem • No system can simultaneously achieve C and A and P • Typical interpretation: “C, A, and P: choose 2” • In practice, all networks may partition, thus you must choose P • So a better interpretation might be “C or A: choose 1” • Never formally proved or published • Yet widely accepted as a rule of thumb