Virtual Synchrony

Virtual Synchrony Krzysztof Ostrowski krzys@cs.cornell.edu

? ? ? ? ? A motivating example SENSORS notifications detection decision made well-coordinated response orders GENERALS DANGER WEAPON

Requirements for a distributedsystem (or a replicated service) Consistent views acrosscomponents • E.g. vice-generals see the same events as thechief general • Agreement on what messages have been received or delivered • E.g. each general has same view oftheworld (consistent state) • Replicas of the distributed service do not diverge • E.g. everyone should have same view of membership • If a component is unavailable, all others decide up or down together

Requirements for a distributed system (or a replicated service) Consistent actions • E.g. generals don’t contradict each other(don’t issue conflicting orders) • A single service may need to respond to a single request • Responses to independent requests may need to be consistent • But: Consistent  Same (same actions  determinism  no fault tolerance)

System as a set of groups client-server groups event peer group multicasting a decision diffusion group

A process group • A notion of process group • Members know each other, cooperate • Suspected failures  group restructures itself, consistently • A failed node is excluded, eventually learns of its failure • A recovered node rejoins • A group maintains a consistent common state • Consistency refers only to members • Membership is a problem on its own... but it CAN be solved

A model of a dynamic process group consistent state A CRASH B RECOVER B C C D E E JOIN F state transfers consistent state membership views

The lifecycle of a member (a replica) alive, but not ingroup assumed tabula rasa all information = lost join in group come up dead or suspected to be dead unjoin transfer state here processing requests fail or just unreachable

The Idea (Roughly) • Take some membership protocol, or an external service • Guarantee consistency in inductive manner • Start in an identical replicated state • Apply any changes • Atomically, that is either everywhere or nowhere • In the exact same order at all replicas • Consistency of all actions / responses comes as a result • Same events seen: • Rely on ordering + atomicity of failures and message delivery

The Idea (Roughly) • We achieve it by the following primitives: • Lower-level • Create / join / leave a group • Multicasting: FBCAST, CBCAST / ABCAST (the "CATOCS") • Higher-level • Download current state from the existing active replicas • Request / release locks (read-only / read-write) • Update • Read (locally)

Why another approach, though? • We have the whole range of other tools • Transactions: ACID; one-copy serializability with durability • Paxos, Chandra-Toueg (FLP-syle consensus schemes) • All kinds of locking schemes, e.g. two-phase locking (2PL) • Virtual Synchrony is a point in the space of solutions • Why are other tools not perfect • Some are very slow: lots of messages, roundtrip latencies • Some limit asynchrony (e.g. transactions at commit time) • Have us pay very high cost for freatures we may not need

A special class of applications • Command / Control • Joint Battlespace Infosphere, telecommunications, • Distribution / processing / filtering data streams • Trading system, air traffic control system, stock exchange, real-time data for banking, risk-management • Real-Time Systems • Shop floor process control, medical decision support, power grid • What do they have in common: • A distributed, but coordinated processing and control • Highly-efficient, pipelined distributed data processing

Distributed trading system Pricing DB’s 1. Historical Data Market Data Feeds Trader Clients 2. Analytics Current Pricing 3. • Availability for historical data • Load balancing and consistentmessage delivery for price distribution • Parallel execution for analytics Long-Haul WAN Spooler Tokyo, London, Zurich, ...

What’s special about these systems? • Need high performance: we must weaken consistency • Data is of different nature: more dynamic • More relevant online, in context • Storing it persistently often doesn’t make that much sense • Communication-oriented • Online progress: nobody cares about faulty nodes • Faulty nodes can be rebooted • Rebooted nodes are just spare replicas in the common pool

Differences (in a Nutshell)

Back to virtual synchrony • Our plan: • Group membership • Ordered multicast within group membership views • Delivery of new views synchronized with multicast • Higher-level primitives

A process group: joining / leaving group membership protocol V2 = {A,B,C,D} V1 = {A,B,C} A B C request to leave sending a new view request to join D V3 = {B,C,D} ...OR: Group Membership Service

A process group: joining / leaving • How it looks in practice • Application makes a call to the virtual synchrony library • Node communicates with other nodes • Locates the appropriate group, follows the join protocol etc. • Obtains state or whatever it needs (see later) • Application starts communicating, eg. joins replica pool Application all the virtual synchrony just fits into the protocol stack Application V.S. Module V.S. Module Network Network

A process group: hadling failures •  We rely on a failure detector (it doesn’t concern us here) • A faulty or unreachable node is simply excluded from the group • Primary partition model: group cannot split into two active parts recovery V1 = {A,B,C,D} V2 = {B,C,D} A CRASH B join C D "B" realizes that somethings is wrong with "A"and initiates the membership change protocol V3 = {A,B,C,D}

Causal delivery and vector clocks (0,0,0,0,0) (0,1,1,0,0) (0,1,1,0,0) A B C D E (0,0,1,0,0) (0,0,1,0,0) delayed cannot deliver CBCAST = FBCAST + vector clocks

What’s great about fbcast / cbcast ? • Can deliver to itself immediately • Asynchronous sender: no need to wait for anything • No need to wait for delivery order on receiver • Can issue many sends in bursts • Since processes are less synchronous... • ...the system is more resilient to failures • Very efficient, overheads are comparable with TCP

Asynchronous pipelining sender never needs to wait, and can send requests at a high rate A B C buffering may reduce overhead A buffering B C

What’s to be careful with ? • Asynchrony: • Dataaccumulates in buffer in the sender • Must put limits to it! • Explicit flushing: send data to the others, force it out of buffers; if completes, data is safe; needed as a • A failure of the sender causes lots of updates to be lost • Sender gets ahead of anybody else......good if the others are doing something that doesn’t conflict. • Cannot do multiple conflicting tasks without a form of locking, while ABCAST can

Why use causality? • Sender need not include context for every message • One of the very reasons why we use FIFO delivery, TCP • Could "encode" context in message, but: costly or complex • Causal delivery simply extends FIFO to multiple processes • Sender knows that the receiver will have received same msgs • Think of a thread "migrating" between servers, causing msgs

A migrating thread and FIFO analogy A B C D E A way to think about the above... which might explain why it is analogous to FIFO. A B C D E

Why use causality? • Receiver does not need to worry about "gaps" • Could be done by looking at "state", but may be quite hard • Ordering may simply be driven by correctness • Synchronizing after every message could be unacceptable • Reduces inconsistency • Doesn’t prevent it altogether, but it isn’t always necessary • State-level synchronization can thus be done more rarely! • All this said... causality makes most sense in context

Causal vs. total ordering Note: Unlike causal odering, total ordering may require that local delivery be postponed! Causal, but not total ordering Causal and total ordering A,E A B A,E C A,E D E,A E E,A

Total ordering: atomic, synchronous A B C D E A B C D E

backup server requests traces primary server results Why total ordering? • State machine approach • Natural, easy to understand, still cheaper than transactions • Guarantees atomicity, which is sometimes desirable • Consider a primary-backup scheme

postponed until ordering determined coordinator A B C D E actual message delivery here original messages ordering info Implementing totally ordered multicast

Atomicity of failures Uniform: • Delivery guarantees: • Among all the surviving processes delivery is all or none (both cases) • Uniform: here also if (but not only if) • delivered to a crashed node • No guarantees for the newly joined CRASH Nonuniform: Wrong: CRASH CRASH

Why atomicity of failures? • Reduce complexity • After we hear about failure, we need to quickly "reconfigure" • We like to think in stable epochs... • During epochs, failures don’t occur • Any failure or a membership change begins a new epoch • Communication does not cross epoch boundaries • System does not begin a new epoch before all messages are either consistently delivered or all consistently forgotten • We want to know we got everything the faulty process sent,to completely finish the old epoch and open a "fresh" one

Atomicity: message flushing A (logical partitioning) B C D E retransmitting messages of failed nodes A B C D E changing membership retransmitting own messages

A multi-phase failure-atomic protocol this is the point when messages are delivered to the applications Phase 1 Phase 2 Phase 3 Phase 4 A B C D E save message OK to deliver all have seen garbage collect this is only for uniform atomicity these three phases are always present

Simple tools • Replicated services:lockingfor updates, state transfer • Divide responsibility for requests: load balancing • Simpler because of all the communication guarantees we get • Work partitioning: subdivide tasks within a group • Simple schemes for fault-tolerance: • Primary Backup,Coordinator-Cohort

Simple replication: state machine • Replicate all data and actions • Simplest pattern of usage for v.s. groups • Same state everywhere (state transfer) • Updates propagated atomically using ABCAST • Updates applied in the exact same order • Reads or queries: always can be served locally • Not very efficient, updates too synchronous • A little to slow • We try to sneak-in CBCAST into the picture... • ...and use it for data locking and for updates

Updates with token-style locking We may not need anything more but just causality... CRASH A B C D E granting lock ownership of a shared resource performing an update requesting lock

Updates with token-style locking others must confirm (release any read locks) individual confirmations request updates updates A B C D E request request granting the lock message from token owner

Multiple locks on unrelated data CRASH A B C D E

Replicated services query update load balancing scheme

Replicated services Primary-Backup Scheme Coordinator-Cohort Scheme backup server requests traces primary server results

Other types of tools • Publish-Subscribe • Every topic as a separate group • subscribe = join the group • publish = multicast • state transfer = load prior postings • Rest of ISIS toolkit • News, file storage, job cheduling with load sharing, framework for reactive control apps etc.

Complaints (Cheriton / Skeen) • Oter techniques are better: transctions, pub./sub. • Depends on applications... we don’t compete w. ACID ! • A matter of taste • Too many costly features, which we may not need • Indeed: and we need to be able to use them selectively • Stackable microprotocols - ideas picked up on in Horus • End-to-End argument • Here taken to the extreme, could be used against TCP • But indeed, there are overheads: use this stuff wisely

At what level to apply causality? • Communication level (ISIS) • An efficient technique that usually captures all that matters • Speeds-up implementation, simplifies design • May not always be most efficient: might sometimes over-order(might lead to incidental causality) • Not a complete or accurate solution, but often just enough • What kinds of causality really matter to us?

At what level to apply causality? • Communication level (ISIS) • May not recognize all sources of causality • Existence of external channels (shared data, external systems) • Semantics ordering, recognized / understood only by applications • Semantics- or State-level • Prescribe ordering by the senders (prescriptive causality) • Timestamping, version numbers

(skip) Overheads • Causality information in messages • With unicast as a dominant communication pattern, it could be a graph, but only in unlikely patterns of communication • With multicast, it’s simply one vector (possibly compressed) • Usually we have only a few active senders and bursty traffic • Buffering • Overhead linear with N, but w. small constant, similar to TCP • Buffering is bounded together with communication rate • Anyway is needed for failure atomicity (an essential feature) • Can be efficiently traded for control traffic via explicit flushing • Can be greatly reduces by introducing hierarchical structures

(skip) Overheads • Overheads on the critical path • Delays in delivery, but in reality: comparable to those in TCP • Arriving out of order is uncommon, a window with a few messages • Checking / updating causality info + maintaining msg buffers • A false (non-semantic) causality: messages unnecessarily delayed • Group membership changes • Requires agreement and slow multi-phase photocols • A tension between latency and bandwidth • Does introduce a disruption: suppresses delivery of new messages • Costly flushing protocols place load on the network

(skip) Overheads • Control traffic • Acknowledgements, 2nd / 3rd phase: additional messages • Not on critical path, but latency matters, as it affects buffering • Can be piggybacked on other communication • Atomic communication, flush, membership changes slowed down to the slowest participant • Heterogeneity • Need sophisticated protocols to avoid overloading nodes • Scalability • Group size: asymmetric load on senders, large timestamps • Number of groups: complex protocols, not easy to combine

Conclusions • Virtual synchrony is an intermediate solution... • ...less than consistency w. serializability, better than nothing. • Strong enough for some classes of systems • Effective in practice, successfully used in many real systems • Inapplicable or inefficient in database-style settings • Not a monolithic scheme... • Selected features should be used only based on need • At the same time, a complete design paradigm • Causality isn’t so much helpful as an isolated feature, but... • ...it is a key piece of a much larger picture

Virtual Synchrony