CS 3700 Networks and Distributed Systems

CS 3700Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why can’t we all just get along?)

Black Box Online Services Black Box Service • Storing and retrieving data from online services is commonplace • We tend to treat these services as black boxes • Data goes in, we assume outputs are correct • We have no idea how the service is implemented

Black Box Online Services • Storing and retrieving data from online services is commonplace • We tend to treat these services as black boxes • Data goes in, we assume outputs are correct • We have no idea how the service is implemented debit_transaction(-$75) OK get_recent_transactions() […, “-$75”, …]

Black Box Online Services • Storing and retrieving data from online services is commonplace • We tend to treat these services as black boxes • Data goes in, we assume outputs are correct • We have no idea how the service is implemented add_item_to_cart(“Cheerios”) OK get_cart() [“Lucky Charms”, “Cheerios”]

Black Box Online Services • Storing and retrieving data from online services is commonplace • We tend to treat these services as black boxes • Data goes in, we assume outputs are correct • We have no idea how the service is implemented post_update(“I LOLed”) OK get_newsfeed() […, {“txt”: “I LOLed”, “likes”: 87}]

Peeling Back the Curtain Black Box Service ? • How are large services implemented? • Different types of services may have different requirements • Leads to different design decisions

Centralization ? • Advantages of centralization • Easy to setup and deploy • Consistency is guaranteed (assuming correct software implementation) • Shortcomings • No load balancing • Single point of failure debit_transaction(-$75) OK get_account_balance() Bob: $300 Bob: $225 Bob $225

Sharding <A-M> • Advantages of sharding • Better load balancing • If done intelligently, may allow incremental scalability • Shortcomings • Failures are still devastating debit_account(-$75) Bob: $300 OK Bob: $225 <N-Z> get_account_balance() Bob $225

Replication 100% Agreement <A-M> • Advantages of replication • Better load balancing of reads (potentially) • Resilience against failure; high availability (with some caveats) • Shortcomings • How do we maintain consistency? debit_account(-$75) <A-M> OK Bob: $300 Bob: $225 get_account_balance() <A-M> Bob: $300 Bob: $225 Bob $225 Bob: $300 Bob: $225

Leader cannot disambiguate cases where requests and responses are lost Consistency Failures No ACK Bob: $300 Bob: $225 Bob: $300 Bob: $300 Bob: $300 No Agreement No ACK Bob: $300 Bob: $225 Bob: $225 Bob: $300 Bob: $300 Bob: $300 Too few replicas? Asynchronous networks are problematic Bob: $225 No Agreement Timeout! Bob: $225 Bob: $225

Byzantine Failures • When discussing Distributed Systems, failures due to malice are known as Byzantine Failures • Name comes from the Byzantine generals problem • More on this later… Bob: $300 In some cases, replicas may be buggy or malicious No Agreement Bob: $300 Bob: $1000

Problem and Definitions • Build a distributed system that meets the following goals: • The system should be able to reach consensus • Consensus [n]: general agreement • The system should be consistent • Data should be correct; no integrity violations • The system should be highly available • Data should be accessible even in the face of arbitrary failures • Challenges: • Many, many different failure modes • Theory tells us that these goals are impossible to achieve (more on this later)

Outline Distributed Commits (2PC and 3PC) Theory (FLP and CAP) Quorums (Paxos)

Forcing Consistency • One approach building distributed systems is to force them to be consistent • Guarantee that all replicas receive an update… • …Or none of them do • If consistency is guaranteed, then reaching consensus is trivial debit_account(-$75) Bob: $300 Bob: $225 OK debit_account(-$50) Bob: $300 Bob: $225 Bob: $175 Bob Error Bob: $300 Bob: $225 Bob: $175

Distributed Commit Problem • Application that performs operations on multiple replicas or databases • We want to guarantee that all replicas get updated, or none do • Distributed commit problem: • Operation is committed when all participants can perform the action • Once a commit decision is reached, all participants must perform the action • Two steps gives rise to the Two Phase Commit protocol

Motivating Transactions transfer_money(Alice, Bob, $100) • System becomes inconsistent if any individual action fails debit_account(Alice, -$100) Error OK debit_account(Bob, $100) Alice: $600 Alice: $500 Bob: $300 Error OK Bob: $400

Simple Transactions transfer_money(Alice, Bob, $100) • Actions inside a transaction behave as a single action begin_transaction() debit_account(Alice, -$100) Alice: $600 Alice: $500 Alice: $500 debit_account(Bob, $100) Bob: $300 Bob: $400 Bob: $400 At this point, if there haven’t been any errors, we say the transaction is committed end_transaction() OK

Simple Transactions transfer_money(Alice, Bob, $100) • If any individual action fails, the whole transaction fails • Failed transactions have no side effects • Incomplete results during transactions are hidden begin_transaction() debit_account(Alice, -$100) Alice: $600 Alice: $500 debit_account(Bob, $100) Bob: $300 Error

ACID Properties • Traditional transactional databases support the following: • Atomicity: all or none; if transaction fails then no changes are applied to the database • Consistency: there are no violations of database integrity • Isolation: partial results from incomplete transactions are hidden • Durability: the effects of committed transactions are permanent

Two Phase Commits (2PC) • Well known techniques used to implement transactions in centralized databases • E.g. journaling (append-only logs) • Out of scope for this class (take a database class, or CS 5600) • Two Phase Commit (2PC) is a protocol for implementing transactions in a distributed setting • Protocol operates in rounds • Assume we have leader or coordinator that manages transactions • Each replica statesthat it is ready to commit • Leader decides the outcome and instructs replicas to commit or abort • Assume no byzantine faults (i.e. nobody is malicious)

2PC Example Leader Replica 1 Replica 2 Replica 3 • Begin by distributing the update • Txid is a logical clock x x x txid = 678; value = y x y x y x y • Wait to receive “ready to commit” from all replicas ready txid = 678 Time • Tell replicas to commit commit txid = 678 y y y • At this point, all replicas are guaranteed to be up-to-date committed txid = 678

Failure Modes • Replica Failure • Before or during the initial promise phase • Before or during the commit • Leader Failure • Before receiving all promises • Before or during sending commits • Before receiving all committed messages

Replica Failure (1) Leader Replica 1 Replica 2 Replica 3 x x x txid = 678; value = y x y x y ready txid = 678 • Error: not all replicas are “ready” Time • The same thing happens if a write or a “ready” is dropped, a replica times out, or a replica returns an error abort txid = 678 x x aborted txid = 678

Replica Failure (2) Leader Replica 1 Replica 2 Replica 3 x y x y x y commit txid = 678 ready txid = 678 y Time • Known inconsistent state • Leader must keep retrying until all commits succeed committed txid = 678 commit txid = 678 y committed txid = 678

Replica Failure (2) Leader Replica 1 Replica 2 Replica 3 y y x y • Replicas attempt to resume unfinished transactions when they reboot stat txid = 678 commit txid = 678 Time y • Finally, the system is consistent and may proceed committed txid = 678

Leader Failure • What happens if the leader crashes? • Leader must constantly be writing its state to permanent storage • It must pick up where it left off once it reboots • If there are unconfirmed transactions • Send new write messages, wait for “ready to commit” replies • If there are uncommitted transactions • Send new commit messages, wait for “committed” replies • Replicas may see duplicate messages during this process • Thus, it’s important that every transaction have a unique txid

Allowing Progress • Key problem: what if the leader crashes and never recovers? • By default, replicas block until contacted by the leader • Can the system make progress? • Yes, under limited circumstances • After sending a “ready to commit” message, each replica starts a timer • The first replica whose timer expires elects itself as the new leader • Query the other replicas for their status • Send “commits” to all replicas if they are all “ready” • However, this only works if all the replicas are alive and reachable • If a replica crashes or is unreachable, deadlock is unavoidable

New Leader Leader Replica 1 Replica 2 Replica 3 x y x y x y ready txid = 678 ready txid = 678 • Replica 2’s timeout expires, begins recovery procedure stat txid = 678 Time commit txid = 678 committed txid = 678 y y y • System is consistent again

Deadlock Leader Replica 1 Replica 2 Replica 3 x y x y x y ready txid = 678 • Replica 2’s timeout expires, begins recovery procedure stat txid = 678 Time ready txid = 678 • Cannot proceed, but cannot abort stat txid = 678

Garbage Collection • 2PC is somewhat of a misnomer: there is actually a third phase • Garbage collection • Replicas must retain records of past transactions, just in case the leader fails • Example, suppose the leader crashes, reboots, and attempts to commit a transaction that has already been committed • Replicas must remember that this past transaction was already committed, since committing a second time may lead to inconsistencies • In practice, leader periodically tells replicas to garbage collect • Transactions <= some txid in the past may be deleted

2PC Summary • Message complexity: O(2n) • The good: guarantees consistency • The bad: • Write performance suffers if there are failures during the commit phase • Does not scale gracefully (possible, but difficult to do) • A pure 2PC system blocks all writes if the leader fails • Smarter 2PC systems still blocks all writes if the leader + 1 replica fail • 2PC sacrifices availability in favor of consistency

Can 2PC be Fixed? • They issue with 2PC is reliance on the centralized leader • Only the leader knows if a transaction is 100% ready to commit or not • Thus, if the leader + 1 replica fail, recovery is impossible • Potential solution: Three Phase Commit • Add an additional round of communication • Tell all replicas to prepare to commit, before actually committed • State of the system can always be deduced by a subset of alive replicas that can communicate with each other • … unless there are partitions (more on this later)

Leader Replica 1 Replica 2 Replica 3 3PC Example x x • Begin by distributing the update x txid = 678; value = y x y x y x y • Wait to receive “ready to commit” from all replicas ready txid = 678 prepare txid = 678 • Tell all replicas that everyone is “ready to commit” Time commit txid = 678 • Tell replicas to commit committed txid = 678 prepared txid = 678 • At this point, all replicas are guaranteed to be up-to-date y y y

Leader Replica 1 Replica 2 Replica 3 Leader Failures x x • Begin by distributing the update x txid = 678; value = y x y x y x y • Wait to receive “ready to commit” from all replicas ready txid = 678 • Replica 2’s timeout expires, begins recovery procedure Time stat txid = 678 ready txid = 678 • Replica 3 cannot be in the committed state, thus okay to abort abort txid = 678 x x • System is consistent again aborted txid = 678

Leader Replica 1 Replica 2 Replica 3 Leader Failures prepare txid = 678 • Replica 2’s timeout expires, begins recovery procedure stat txid = 678 Time prepared txid = 678 • All replicas must have been ready to commit commit txid = 678 y y prepared txid = 678 • System is consistent again committed txid = 678

Oh Great, I Fixed Everything! • Wrong • 3PC is not robust against network partitions • What is a network partition? • A split in the network, such that full n-to-n connectivity is broken • i.e. not all servers can contact each other • Partitions split the network into one or more disjoint subnetworks • How can a network partition occur? • A switch or a router may fail, or it may receive an incorrect routing rule • A cable connecting two racks of servers may develop a fault • Network partitions are very real, they happen all the time

Leader Replica 1 Replica 2 Replica 3 Partitioning x x x txid = 678; value = y x y x y x y Leader recovery initiated • Network partitions into two subnets! ready txid = 678 prepare txid = 678 Time prepared txid = 678 • Leader assumes replicas 2 and 3 have failed, moves on Abort commit txid = 678 x y x committed txid = 678 • System is inconsistent

3PC Summary • Adds an additional phase vs. 2PC • Message complexity: O(3n) • Really four phases with garbage collection • The good: allows the system to make progress under more failure conditions • The bad: • Extra round of communication makes 3PC even slower than 2PC • Does not work if the network partitions • 2PC will simply deadlock if there is a partition, rather than become inconsistent • In practice, nobody used 3PC • Additional complexity and performance penalty just isn’t worth it • Loss of consistency during partitions is a deal breaker

Outline Distributed Commits (2PC and 3PC) Theory (FLP and CAP) Quorums (Paxos)

A Moment of Reflection • Goals, revisited: • The system should be able to reach consensus • Consensus [n]: general agreement • The system should be consistent • Data should be correct; no integrity violations • The system should be highly available • Data should be accessible even in the face of arbitrary failures • Achieving these goals may be harder than we thought :( • Huge number of failure modes • Network partitions are difficult to cope with • We haven’t even considered byzantine failures

What Can Theory Tell Us? • Lets assume the network is synchronous and reliable • Algorithm can be divided into discreet rounds • If a message from r is not received in a round, then r must be faulty • Since we’re assuming synchrony, packets cannot be delayed arbitrarily • During each round, r may send m <= n messages • n is the total number of replicas • You might crash before sending all n messages • If we are willing to tolerate f total failures (f < n), how many rounds of communication do we need to guarantee consensus?

Consensus in a Synchronous System • Initialization: • All replicas choose a value 0 or 1 (can generalize to more values if you want) • Properties: • Agreement: all non-faulty processes ultimately choose the same value • Either 0 or 1 in this case • Validity: if a replica decides on a value, then at least one replica must have started with that value • This prevents the trivial solution of all replicas always choosing 0, which is technically perfect consensus but is practically useless • Termination: the system must converge in finite time

Algorithm Sketch • Each replica maintains a mapM of all known values • Initially, the vector only contains the replica’s own value • e.g. M = {‘replica1’: 0} • Each round, broadcast M to all other replicas • On receipt, construct the union of received M and local M • Algorithm terminates when all non-faulty replicas have the values from all other non-faulty replicas • Example with three non-faulty replicas (1, 3, and 5) • M = {‘replica1’: 0, ‘replica3’: 1, ‘replica5’: 0} • Final value is min(M.values())

Bounding Convergence Time • How many rounds will it take if we are willing to tolerate f failures? • f + 1 rounds • Key insight: all replicas must be sure that all replicas that did not crash have the same information (so they can make the same decision) • Proof sketch, assuming f = 2 • Worst case scenario is that replicas crash during rounds 1 and 2 • During round 1, replica xcrashes • All other replicas don’t know if x it alive or dead • During round 2, replica y crashes • Clear that x is not alive, but unknown if y is alive or dead • During round 3, no more replicas may crash • All replicas are guaranteed to receive updated info from all other replicas • Final decision can be made

A More Realistic Model • The previous result is interesting, but unrealistic • We assume that the network is synchronous and reliable • Of course, neither of these things are true in reality • What if the network is asynchronous and reliable? • Replicas may take an arbitrarily long time to respond to messages • Let’s also assume that all faults are crash faults • i.e. if a replica has a problem it crashes and never wakes up • No byzantine faults

The FLP Result There is no asynchronous algorithm that achieves consensus on a 1-bit value in the presence of crash faults. The result is true even if no crash actually occurs! • This is known as the FLP result • Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson, 1985 • Extremely powerful result because: • If you can’t agree on 1-bit, generalizing to larger values isn’t going to help you • If you can’t converge with crash faults, no way you can converge with byzantine faults • If you can’t converge on a reliable network, no way you can on an unreliable network

FLP Proof Sketch • In an asynchronous system, a replica x cannot tell whether a non-responsive replica yhas crashed or it is just slow • What can x do? • If it waits, it will block since it might never receive the message from y • If it decides, it may find out later that y made a different decision • Proof constructs a scenario where each attempt to decide is overruled by a delayed, asynchronous message • Thus, the system oscillates between 0 and 1 never converges

Impact of FLP • FLP proves that any fault-tolerant distributed algorithm attempting to reach consensus has runs that never terminate • These runs are extremely unlikely (“probability zero”) • Yet they imply that we can’t find a totally correct solution • And so “consensus is impossible” (“not always possible”) • So what can we do? • Use randomization, probabilistic guarantees (gossip protocols) • Avoid consensus, use quorum systems (Paxos or Raft) • In other words, trade off consistency in favor of availability

Consistency vs. Availability • FLP states that perfect consistency is impossible • Practically, we can get close to perfect consistency, but at significant cost • e.g. using 3PC • Availability begins to suffer dramatically under failure conditions • Is there a way to formalize the tradeoff between consistency and availability?

Eric Brewer’s CAP Theorem • CAP theorem for distributed data replication • Consistency: updates to data are applied to all or none • Availability: must be able to access all data • Network Partition Tolerance: failures can partition network into subtrees • The Brewer Theorem • No system can simultaneously achieve C and A and P • Typical interpretation: “C, A, and P: choose 2” • In practice, all networks may partition, thus you must choose P • So a better interpretation might be “C or A: choose 1” • Never formally proved or published • Yet widely accepted as a rule of thumb

CS 3700 Networks and Distributed Systems

CS 3700 Networks and Distributed Systems

Presentation Transcript

CS 425 Distributed Systems “Sensor Networks”

CS 3700 Networks and Distributed Systems

Networks and Distributed Systems

CS 425: Distributed Systems

CS-556: Distributed Systems

CS 775 : Distributed Systems

CS 194: Distributed Systems Distributed File Systems

CS 6601 – Distributed Systems

Networks and Distributed Systems

CS 425: Distributed Systems

CS 5620 Distributed Systems and Algorithms

Networks for Distributed Systems

CS 5620 Distributed Systems and Algorithms

CS 194: Distributed Systems Distributed based Object Systems

CS 5620 Distributed Systems and Algorithms

Networks and Distributed Systems

Networks and Distributed Systems