Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore

Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore Jun Rao, Eugene J. Shekita, Sandeep Tata IBM Almaden Research Center PVLDB, Jan. 2011, Vol. 4, No. 4 2011-03-25 Presented by Yongjin Kwon

Outline • Introduction • Spinnaker • Data Model and API • Architecture • Replication Protocol • Leader Election • Recovery • Follower Recovery • Leader Takeover • Experiments • Conclusion

Introduction • Cloud computing applications have aggressive requirements. • Scalability • High and continuous availability • Fault Tolerance • TheCAPTheorem[Brewer2000]arguesthatamongConsistency, Availability, and Partition tolerance, only two out of three are possible. • Recent distributed systems such as Dynamo or Cassandra provide high availability and partition tolerance by sacrificing consistency. • Guarantee eventual consistency. • May cause diverse versions of replicas.

Introduction (Cont’d) • Most applications will desire stronger consistency guarantees. • e.g. single datacenter where network partitions are rare • How to preserve consistency? • Two-Phase Commit • Blocking exists when the coordinator fails • Three-Phase Commit [Skeen 1981] • Seldom used because of poor performance • Paxos Algorithm • Generally perceived as too complex and slow

Introduction (Cont’d) timeline Insert Update Update Update Delete • Timeline Consistency [Cooper 2008] • Stops short of full serializability. • All replicas of a record apply all updates in the same order. • At some time, any replica will be one of diverse versions from the timeline. • With some modifications of Paxos, it is possible to provide high availability while ensuring at least timeline consistency with a very small loss of performance.

Spinnaker • Experimental datastore • Designed to run on a large cluster of commodity servers in a single datacenter • Key-based range partitioning • 3-way replication • Strong or timeline consistency • Paxos-based protocol for replication • Example of a CA system

Data Model and API • Data Model • Similar to Bigtable and Cassandra • Data is organized into rows and tables. • Each row in a table can be uniquely identified by its key. • A row may contain any number of columns with corresponding values and version numbers. • API • get(key, colname, consistent) • put(key, colname, colvalue) • delete(key, colname) • conditionPut(key, colname, value, version) • conditionDelete(key, colname, version)

Architecture • System Architecture • Data (or rows in a table) are distributed across a cluster using (key-)range partitioning. • Each group of nodes in a key range is called a cohort. • Cohort for [0, 199] : { A, B, C } • Cohort for [200, 399] : { B, C, D }

Architecture (Cont’d) Logging and Local Recovery Commit Queue memtables SSTables Replication and Remote Recovery Failure Detection, Group Membership, and Leader Selection • Node Architecture • All the components are thread safe. • For logging • Shared write-ahead log is used for performance • Each log record is uniquely identified by an LSN (log sequence number). • Each cohort on a node uses its own logical LSNs.

Replication Protocol • Each cohort consists of a elected leader and two followers. • Spinnaker’s Replication Protocol • Modification of the basic Multi-Paxos protocol • Shared write-ahead log, Not missing any log entries • Reliable in-order messages based on TCP sockets • Distributed coordination service for leader election (Zookeeper) • Two Phases of Replication Protocol • Leader Election Phase • A leader is chosen among the nodes in a cohort. • Quorum Phase • The leader proposes a write. • The followers accept it.

Replication Protocol (Cont’d) write W Cohort Leader propose W propose W Follower Follower • Quorum Phase • Client summits a write W. • The leader, in parallel, • appends a log record for W, and forces it to disk, • appends W to its commit queue, and • sends a propose message for W to its followers.

Replication Protocol (Cont’d) Cohort Leader ACK ACK Follower Follower • Quorum Phase • After receiving the propose message, the followers • appends a log record for W, and forces it to disk, and • appends W to its commit queue, and • sends an ACK to the leader.

Replication Protocol (Cont’d) Committed! Cohort Leader Follower Follower • Quorum Phase • After the leader gets an ACK from “at least one” follower, the leader • applies W to its memtable, effectively committing W, and • sends a response to the client. • There is no separate commit record that needs to be logged.

Replication Protocol (Cont’d) Cohort Leader commit LSN commit LSN Follower Follower • Quorum Phase • Periodically the leader sends an asynchronous commit message to the followers, with a certain LSN, asking them to apply all pending writes up to the LSN, to their memtables. • For recovery, the leader and followers save this LSN, referred to as the last committed LSN.

Replication Protocol (Cont’d) • For strong consistency, • Reads are always routed to the cohort’s leader. • Reads are guaranteed to see the latest value. • For timeline consistency, • Reads can be routed to any node in the cohort. • Reads may see a stale value.

Leader Election • The leader election protocol has to guarantee that • the cohort will appear a majority (i.e. two nodes) and • the new leader is chosen in a way that no committed writes are lost. • With the aid of Zookeeper, this task can be simplified. • Each node includes a Zookeeper client. • Zookeeper [Hunt 2010] • Fault tolerant, distributed coordination service • It is only used to exchange messages between nodes. • Ref : http://hadoop.apache.org/zookeeper/

Leader Election (Cont’d) a b c … … … • Zookeeper’s Data Model • Resembles a directory tree in a file system. • Each node, znode, is identified by its path from the root. • e.g. /a/b/c • A znode can include a sequential attribute. • Persistentznode vs. Ephemeralznode

Leader Election (Cont’d) • Note that information needed for leader election is stored in Zookeeper under “/r”. • Leader Election Phase • One of the cohort’s nodes cleans up any state under /r. • Each node of the cohort adds a sequential ephemeral znode to /r/candidates with value “last LSN.” • After a majority appears under /r/candidates, the new leader is chosen as the candidate with the max “last LSN.” • The leader adds an ephemeral znode under /r/leader with value “its hostname,” and execute leader takeover. • The followers learn about the new leader by reading /r/leader.

Leader Election (Cont’d) • Verification of the guarantee that no committed writes are lost • Committed write is forced to the logs of at least 2 nodes. • At least 2 nodes have to participate in leader election. • Hence, at least one of the nodes participating in leader election will have the last committed write in its log. • Choosing the node with max “last LSN” ensures that the new leader will have this committed write in its log. • If committed writes are not unresolved on the other nodes, leader takeover will make sure that it is re-proposed.

Recovery • When a cohort’s leader and followers fails, the recovery should be performed, using log records, after they come back up. • Two Recovery Processes • Follower Recovery • When a follower or even leader fails, how can the node be recovered after it comes back up? • Leader Takeover • When a leader has failed, what should the new leader perform after leader election?

Follower Recovery Local Recovery Catch Up … checkpoint last committed LSN last LSN • The follower recovery is executed whenever a node comes back up after a failure. • Two Phases of Follower Recovery • Local Recovery Phase • Re-apply log records from its most recent checkpoint through its last committed LSN. • If the follower has lost all its data due to a disk failure, then it moves to the catch up phase immediately. • Catch Up Phase • Send its last committed LSN to the leader. • The leader responds by sending all committed writes after the follower’s last committed LSN.

Follower Recovery (Cont’d) • If a leader went down and a new leader was elected, it would be possible that the new leader neglected some of the log records after the last committed LSN. • The discarded log records should be removed so that they are never re-applied by future recovery. • Logical Truncation of the follower’s log • The LSNs of log records belonging to the follower are stored in a skipped LSN list. • Before processing log records, check the skipped LSN list whether the log record should be discarded.

Leader Takeover Catch up Re-proposal … follower’s last committed LSN leader’s last commited LSN leader’s last LSN • When a leader fails, the corresponding cohort becomes unavailable for write. • Execute the leader election to choose a new leader! • After a new leader is elected, leader takeover occurs. • Leader Takeover • Catch up each follower to the new leader’s last committed LSN. • This step may be ignored by the follower. • Re-propose the writes between leader’s last committed LSN and leader’s last LSN, and commit using the normal replication protocol.

Recovery (Cont’d) Cohort Leader cmt : 1.20 lst : 1.21 cmt : 1.25 lst : 1.25 Follower Follower cmt : 1.10 lst : 1.20 cmt : 1.25 lst : 1.25 cmt : 1.10 lst : 1.22 cmt : 1.20 lst : 1.25 • Follower Recover • Follower goes down while the others are still alive. • The cohort accepts new writes. • When the follower comes back up, the follower is recovered.

Recovery (Cont’d) Cohort Follower Leader cmt : 2.30 lst : 2.30 logical truncation (LSN 1.21) cmt : 1.20 lst : 1.21 Follower Follower Leader cmt : 1.10 lst : 1.19 cmt : 1.10 lst : 1.20 cmt : 1.20 lst : 1.20 cmt : 2.30 lst : 2.30 cmt : 1.20 lst : 1.20 cmt : 2.30 lst : 2.30 • Leader Takeover • Leader goes down while the others are still alive. • The new leader is elected, and leader takeover is executed. • The cohort accepts new writes. • When the old leader comes back up, it is recovered.

Experiments • Experimental Setup • Two clusters (one for datastore, the other for clients) of 10 nodes, each of which consists of • Two quad-core 2.1 GHz AMD processors • 16GB memory • 5 SATA disks, with 1 disk for logging (without write-back cache) • Rack-level 1Gbit Ethernet switch • Cassandra trunk as of October 2009 • Zookeeper version 3.20

Experiments (Cont’d) • In these experiments, Spinnaker was compared with Cassandra. • Common things • Implementation of SSTables, memtables, log manager • 3-way replication • Different things • Replication protocol, recovery algorithms, commit queue • Cassandra’s weak/quorum reads • Weak read accesses just 1 replica. • Quorum read accesses 2 replicas to check for conflicts. • Cassandra’s weak/quorum writes • Both are sent to all 3 replicas. • Weak write waits for an ACK from just 1 replica. • Quorum write waits for ACKs from any 2 replicas.

Experiments (Cont’d)

Conclusion • Spinnaker • Paxos-based replication protocol • Scalable, consistent, and highly available datastore • Future Work • Support for multi-operation transactions • Load balancing • Detailed comparison to other datastores

References [Brewer 2000] E. A. Brewer, “Towards Robust Distributed Systems,” In PODC, pp. 7-7, 2000. [Cooper 2008] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, R. Yerneni, “PNUTS: Yahoo!’s Hosted Data Serving Platform,” In PVLDB, 1(2), pp. 1277-1288, 2008. [Hunt 2010] P. Hunt, M. Konar, F. P. Junqueira, B. Reed, “Zookeeper: Wait-Free Coordination for Internet-scale Systems,” In USENIX, 2010. [Skeen 1981] D. Skeen, “Nonblocking Commit Protocols,” In SIGMOD, pp. 133-142, 1981.

Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore

Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore

Presentation Transcript

How to Build a Highly- Available, Fault Tolerant GroupWise 7 System

Megastore: Providing Scalable Highly Available Storage for Interactive Services.

Building Highly Scalable and Available Applications and Services with Windows Azure AppFabric MID315

Building Scalable, Global, and Highly Available Web Apps

Megastore: Providing Scalable, Highly Available Storage for Interactive Services .

Highly available services

Building Global and Highly Available Services Using Windows Azure

Highly Scalable Packetised correlators

Building Highly Scalable Websites

SCALABLE EVOLUTION OF HIGHLY AVAILABLE SYSTEMS

A Highly Scalable Perfect Hashing Algorithm

Porcupine: a highly scalable email service

DATASTORE

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

Highly Available, Highly Scalable – Enterprise Manager 12c for Large Enterprises

ASE136: How to build highly available applications using OpenSwitch?

Enterprise Mobility Solutions and Management Services Designed To Build a Highly Scalable and Secure Ecosystem

Using Gossip to Build Scalable Services

A Scalable Distributed Datastore for BioImaging