150 likes | 384 Views
Distributed Systems Overview. Ali Ghodsi alig@cs.berkeley.edu. Replicated State Machine (RSM). Distributed Systems 101 Fault-tolerance (partial, byzantine, recovery,...) Concurrency (ordering, asynchrony, timing,...) Generic solution for distributed systems:
E N D
Distributed Systems Overview Ali Ghodsi alig@cs.berkeley.edu
Replicated State Machine (RSM) • Distributed Systems 101 • Fault-tolerance(partial, byzantine, recovery,...) • Concurrency (ordering, asynchrony, timing,...) • Generic solution for distributed systems: Replicated State Machine approach • Represent your system with a deterministic state machine • Replicate the state machine • Feed input to all replicas in the same order
Total Order Reliable Broadcast aka Atomic Broadcast • Reliable broadcast • All or none correct nodes get the message (even if src fails) • Atomic Broadcast • Reliable broadcast that guarantees: All messages delivered in the same order • Replicated state machine trivial with atomic broadcast
Consensus? • Consensus problem • All nodes propose a value • All correct nodes must agree on one of the values • Must eventually reach a decision (availability) • Atomic Broadcast → Consensus • Broadcast proposal, Decide on first received value • Consensus → Atomic Broadcast • Unreliably broadcast message to all • 1 consensus per round: • propose set of messages seen but not delivered • Each round deliver one decided message • Atomic Broadcast equivalent to Atomic Broadcast
Consensus impossible • No deterministic 1-crash-robust consensus algorithm exists for the asynchronous model • 1-crash-robust • Up to one node may crash • Asynchronous model • No global clock • No bounded message delay • Life after impossibility of consensus? What to do?
Solving Consensus with Failure Detectors • Black box that tells us if a node has failed • Perfect failure detector • Completeness • It will eventually tell us if a node has failed • Accuracy (no lying) • It will never tell us a node has failed if it hasn’t • Perfect FD → Consensus xi = input for r:=1 to N do if r=p then forall j do send <val, xi, r> to j; decide xi if collect<val, x´, r> from r then xi = x´; end decide xi
Solving Consensus • Consensus → Perfect FD? • No. Don’t know if a node actually failed or not! • What’s the weakest FD to solve consensus? • Least assumptions on top of asynchronous model!
Enter Omega • Leader Election • Eventually every correct node trusts some correct node • Eventually no two correct nodes trust different correct nodes • Failure detection and leader election are the same • Failure detection captures failure behavior • detect failed nodes • Leader election also captures failure behavior • Detect correct nodes (a single & same for all) • Formally, leader election is an FD • Always suspects all nodes except one (leader) • Ensures some properties regarding that node
Weakest Failure Detector for Consensus • Omega the weakest failure detector for consensus • How to prove it? • Easy to implement in practice
High Level View of Paxos • Elect a single proposer using Ω • Proposer imposes its proposal to everyone • Everyone decides • Done! • Problem with Ω • Several nodes might initially be proposers (contention) • Solution is abortable consensus • Proposer attempts to enforce decision • Might abort if there is contention (safety) • Ω ensures eventually 1 proposer succeeds (liveness)
Replicated State Machine • Paxos approach(Lamport) • Client sends input to leader Paxos • Leader executes Paxos instance to agree on command • Well-understood, many papers, optimizations • View-stamp approach (Liskov) • Have one leader that writes commands to a quorum (no Paxos) • When failures happen, use Paxos to agree • Less understood (Mazieres tutorial)
Paxos Siblings • Cheap Paxos (LM’04) • Fewer messages • Directly contact a quorum (e.g. 3 nodes out of 5) • If fail to get response from 3, expand to 5 • Fast Paxos (L’06) • Reduce from 3 delays to 2 delays (delays ~ delays) • Clients optimistically write to a quorum • Requires recovery
Paxos Siblings • Gaios/SMARTER (Bolosky’11) • Make logging to disk efficient for crash-recovery • Uses pipelining and batching • Generalized Paxos (LM’05) • Commutative operations for repl. state machine
Atomic Commit • Atomic Commit • Commit IFF no failures and everyone votes commit • Else Abort • Consensus on Transaction Commit (LG’04) • One Paxos instance for every TM • Only commit if every instance said Commit
Reconfigurable Paxos • Change the set of nodes • Replace failed nodes • Add/remove new nodes (change size of quorum) • Lamport’s idea • Part of the state of state-machine: set of nodes • SMART (Eurosys’06) • Many problems (e.g. {A,B,C}->{A,B,D} and A fails) • Basic idea, run multiple Paxos instances side by side