220 likes | 227 Views
This comprehensive overview delves into the concepts of Paxos and ZooKeeper in distributed systems. It covers the basic ideas and algorithms of Paxos, including phases and properties, and explores Paxos Atomic Broadcast Algorithm and its properties. The text further discusses Paxos Made Live architecture, highlighting key features such as Multi-Paxos, master leases, group membership, and snapshots. ZooKeeper, a wait-free coordination system, is explained with emphasis on its key ideas and mechanisms, client API, and examples of ZooKeeper primitives like configuration management. The comparison with other services like Chubby and highlights of scalability and performance metrics are also discussed.
E N D
Detour: Distributed Systems Techniques • Paxos overview (based on Lampson’s talk) • Google: Paxos made live (only briefly) • Zookeeper: -- wait-free coordination system by Yahoo! CSci8211: Distributed Systems: Paxos & zookeeper
Paxos: Agent States & Invariants which follows from
PaxosAlgorithm in Plain English • Phase 1 (prepare): • A proposer selects a proposal number n and sends a prepare request with number n to majority of acceptors. • If an acceptor receives a prepare request with number n greater than that of any prepare request it saw, it responses YES to that request with a promise not to accept any more proposals numbered less than n and include the highest-numbered proposal (if any) that it has accepted.
Paxos Algorithm in Plain English … • Phase 2 (accept): • If the proposer receives a response YES to its prepare requests from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a values v which is the value of the highest-numbered proposal among the responses. • If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.
Paxos’s Properties (Invariants) • P1: Any proposal number is unique. • P2: Any two set of acceptors have at least one acceptor in common. • P3: the value sent out in phase 2 is the value of the highest-numbered proposal of all the responses in phase 1.
The Paxos Atomic Broadcast Algorithm • Leader based: each process has an estimate of who is the current leader • To order an operation, a process sends it to its current leader • The leader sequences the operation and launches a Consensus algorithm (Synod) to fix the agreement
Failure-Free Message Flow C C request response S1 S1 S1 S1 S1 S2 S2 S2 . . . . . . (“prepare”) . . . (“ack”) (“accept”) Sn Sn Sn Phase 1 Phase 2
Message Flow: Take 2 w/ Optimization C C request response S1 S1 S1 S1 S1 S1 S2 S2 S2 (“prepare”) . . . (“ack”) . . . . . . (“accept”) Sn Sn Sn Phase 1 Phase 2
Highlights of Paxos Made Live • Implement Paxos in a large, practical distributed system • have to consider many practical failure scenarios as well as efficiency issues, and “prove” implementation correct! • e.g., disk failures • Key Features/Mechanisms: • Multi-Paxos: run multiple instances of Paxos to achieve consensus on a series of values, e.g., in a replicated log • Master & Master Leases • (Global) epoch numbers (to handle master crashes) • Group membership: handle dynamic changes in # of servers • Snapshot to enable faster recovery (& catch up) • Handling disk corruption: a replica w/ corrupted disk re-builds its log by participating as a non-voting member until catch up • & good software engineer: runtime checking & testing, etc.
Highlights of ZooKeeper • Zookeeper: wait-free coordination service for processes of distributed applications • wait-free: asynchronous (no blocking) and no locking • with guaranteed FIFO client ordering and linearizable writes • provide a simple & high-performance kernel for building more complex primitives at the client • e.g., rendezvous, read/write locks, etc. • this is in contrast to Google’s Chubby (distributed lock) service, or Amazon’s Simple Queue Service, … • For target workloads: 2:1 to 100:1 read/write ratio, can handle 10^4 – 10^5 transactions per second • Key Ideas & Mechanisms: • A distributed file system like hierarchical namespace to store data objects (“shared states”): a tree of znodes • but with simpler APIs for clients to coordinate processes
ZooKeeper Service Overview • server: process providing ZooKeeper service • client: user of ZooKeeper service • clients establish a session when they connect to ZooKeeper and obtain a handle thru which to issue requests znode: each associated w/ a version #, & can be of two types • regular: create/delete explicitly • ephemeral: delete explicitly or automatically when the session creates it terminates • znode may have a sequential flag: created w/ a monotonically increasing counter attached to the name • watch (on znode): one-time trigger associated with a session to notify a change in znode (or its child subtree) Zookeeper’s hierarchical namespace (data tree)
ZooKeeper Client API • Each client runs a ZooKeeper library: • expose ZooKeeper service interface thru client APIs • manage network connection (“session”) between client & server • ZooKeeper APIs: • Each API has both a synchronous and asynchronous versions
ZooKeeper Primitive Examples • Configuration Management: • E.g., two clients A & B shares a configuration, and can directly communicate w/ each • A makes a change to the configuration & notify B (but the two servers’ configuration replicas may be out of sync!) • Rendezvous • Group Membership • Simple Lock (w & w/o Herd Effect) • Read/Write Locks • Double Barrier Yahoo and other services using ZooKeeper: • Fetch Service (“Yahoo crawler”) • Katta: a distributed indexer • Yahoo! Message Broker (YMB)
ZooKeeper Implementation convert writes into idempotent transactions • ensure linearizable writes ensure client ordering via a pipelined architecture to allow multiple pending requests • each write is handled by a leader, which broadcast the change to others via Zab, an atomic broadcast protocol • server handling a client request uses a simple majority quorum to decide on a proposal to deliver the state change to the client
ZooKeeper and Zab • Zab: atomic broadcast protocol used by Zookeeper to ensure transaction integrity, primary-order (PO) causality total order, and agreement (among replicated processes) • Leader (primary instance)-based: only leader can abcast • Atomic 2-phase broadcast: abcast + abdeliver => transaction committed, otherwise considered “aborted”
More on Zab • Zab atomic broadcast ensures primary-order causality: • “causability” defined only w.r.t. primary instance • Zab also ensures strict causality (or total ordering) • if a process delivers two transactions, one must precede the other in the PO causality order • Zab assumes a separate leader election/selection process (with a leader selection oracle) • processes: leader (starting w/ a new epoch #) and followers • Zab uses a 3-phase protocol w/ quorum (similar to Raft): • Phase 1 (Discovery): agree on new epoch # and discover history • Phase 2 (Synchronization): synchronize the history of all processes using 2PC-like protocol, commit based on quorum • Phase 3 (broadcast): commit a new transaction via a 2PC-like protocol, commit based on quorum
PO Causality & Strict Causality (a) In PO causality order, but not “causal order” (b) In PO causality order, but not “strict causality” order