Virtual Synchrony

Virtual Synchrony Justin W. Hart CS 614 11/17/2005

Papers • The Process Group Approach to Reliable Distributed Computing. Birman. CACM, Dec 1993, 36(12):37-53. • Understanding the Limitations of Causally and Totally Ordered Communication. Cheriton and Skeen. 14th SOSP, 1993.

Background • Chandy-Lamport Logical Clocks • Consistent Cuts • Distributed Snapshots • Publish/Subscribe • Fail-Stop

Fail Stop • Group Membership Service • Processes appear to fail by halting • How does this affect the FLP result?

Motivation • Information Backplane • Customization • Hierarchical Structure • Fault-Tolerance • Reliability

Types of groups Anonymous groups Explicit groups Implementation Requirements Group communication Group membership as input Synchronization Process Groups

Anonymous Groups • Group addressing • Messages sent exactly once to all or no recipients • Ordering • Logging

Explicit Groups • Group members cooperate directly • May execute algorithms based on membership knowledge • Communication is sensitive to membership changes

Building groups over conventional technology • Conventional message passing technologies • Group addressing • Logical time & causal dependency • Message delivery ordering • State transfer • Fault tolerance

Close Synchrony • Close Synchrony • 100% lock-step execution model

A synchronous execution p q r s t u • With true synchrony executions run in genuine lock-step.

So… what’s wrong with that? • Under close synchrony, execution is limited by the slowest process in the group!

Virtual Synchrony • Relax synchronization requirements where possible • Benefit by allowing for asynchronous interactions • Do this where the result is identical to close synchrony

A few protocols… • fbcast • cbcast • abcast • gbcast

Four protocols!?!? • …but Justin. The paper only discussed 2 protocols… you’re getting off-topic!

A few protocols… • fbcast • Simple protocol upon which we’ll build the others. • Delivery is FIFO ordered, with respect to the original sender • Accomplished easily with a logical timestamp • cbcast • abcast • gbcast

Single updater • If p is the only update source, the need is a bit like the TCP “fifo” ordering • fbcast is a good choice for this case 1 2 3 4 p r s t

A few protocols… • fbcast • cbcast • Receipt is causally ordered • Protocol in paper uses token passing • Another simple protocol uses vector timestamps • abcast • gbcast

Causally ordered updates • Simple protocol based on token passing

Causally ordered updates • Example: messages from p and s arrive out of order at t VT(b)=[1,0,0,1] c is early: VT(c) = [1,0,1,1] but VT(t)=[0,0,0,1]: clearly we are missing one message from s p VT(c) = [1,0,1,1] When b arrives, we can deliver both it and message c, in order r s t VT(a) = [0,0,0,1]

Causally ordered updates • Each thread corresponds to a different lock • In effect: red “events” never conflict with green ones! 2 5 p 1 r 3 s t 2 1 4

Hey… that sped things up! • Now I get it! Processes only have to wait for processes that they depend on. Not the slowest in the group!

A few protocols… • fbcast • cbcast • abcast • Atomic delivery ordering • With respect to other abcasts • More costly than cbcast, but with a stronger ordering property • ISIS builds abcast over cbcast • gbcast

A few protocols… • fbcast • cbcast • abcast • gbcast • Atomic delivery ordering • With respect to everything

Three Round Multicast

As a time-line picture Phase 1 Phase 2 Vote? Commit! 2PC initiator p q r s t All vote “commit”

Just one more…

Flush protocol • We say that a message is unstable if some receiver has it but (perhaps) others don’t • For example, q’s message is unstable at process r • If q fails we want to “flush” unstable messages out of the system

Styles of groups • Peer Groups • Processes cooperate closely • Client-Server Groups • Group acts as a server • Client multicasts repeatedly to the group • Diffusion Groups • Group serves information • Clients connect to receive data from group • Hierarchical Groups • Offer scalability through a hierarchy of connected groups

Historical Aside • Two major classes of real systems • Virtual synchrony • Weaker properties – not quite “FLP consensus” • Much higher performance (orders of magnitude) • Requires that majority of system remain connected. Partitioning failures force protocols to wait for repair • Quorum-based state machine protocols are • Closer to FLP definition of consensus • Slower (by orders of magnitude) • Sometimes can make progress in partitioning situations where virtual synchrony can’t

Names of some famous systems • Isis was first practical virtual synchrony system • Later followed by Transis, Totem, Horus • Today: Best options are Jgroups, Spread, Ensemble • Technology is now used in IBM Websphere and Microsoft Windows Clusters products! • Paxos was first major state machine system • BASE and other Byzantine Quorum systems now getting attention from the security community • (End of Historical aside)

Sounds good… what’s wrong with it? • Tries to solve state problems at communication level • This violates the end-to-end argument! • Consistency requirements are typically stated with respect to application state

Stable vs Durable • Stable – messages are buffered until received by all group members • Durable – message will be delivered, even if the sender dies

Ordering semantics • Incidental Ordering • Semantic Ordering • Prescriptive Ordering

The problem with CATOCS • It can’t say “for sure” • It can’t say the “whole story” • It can’t say “together” • It can’t say it efficiently

It can’t say “for sure” • Processes communicating over a “hidden” channel • Common database • Shared memory • Two threads reacting to external event

It can’t say “together” • Standard solution – locking • Transaction models allow for abort and rollback • Higher level conditions… what happens if a message arrives, but is not successfully processed

Stock trading example

Not everything can be expressed through the “happens-before” relationship Semantic ordering constraints Causal memory, the weakest of these, cannot be expressed in causal multicast Total ordering helps some of these, but is far too expensive Inexpensive, state-level protocols with logical clocks can solve these Can’t say the “whole story”

It can’t say it efficiently • False causality • Potential causality != Actual causality • Memory requirements for buffering “unstable” messages • Ordering information during transmission and reception

And… what of the end to end argument? • All of this considers our communication channels… isn’t the application-level check far more important?

Classes of distributed applications • Data dissemination • Netnews • Trading application example • Global predicate evaluation • Transactional applications • Replicated data • Replication in the large • Distributed real-time applications

Implementing only part of the messaging? • Can you cut down on overhead by implementing only part of the messaging using CATOCS?

Semantics • Are the semantics of state-based approaches superior to those of virtual synchrony?

Scalability • N Processes • Time T to propagate a message across the system • Grows roughly proportional with the square root of the number of processes • Arcs in the active causal graph grow quadratically • Quadratic causal graph

Buffering grows • Quadratic arcs • Linear communication of causal dependencies • Linear growth in required buffering • Changing topologies doesn’t help • CATOCS would require separate process groups for read and write to accomplish optimization of updates vs queries

Group membership protocols • Must enforce atomic delivery semantics • Run our most expensive protocol… gbcast • Failures increase with the size of the system, increasing load on the GMS

Who uses ISIS? • Brokerage • Database replication and triggers

ISIS-based utilities • NEWS • A pub/sub application with that will replay histories • NMGR • Manages batch-style jobs and performs load sharing • Parallel make

ISIS-based utilities • DECEIT • NFS compatible file system • META/LOMITA • Sensors & actuators • Abstract sensors • Specify control actions in high-level terms • SPOOLER/LONG-HAUL FACILITY

Virtual Synchrony