Fundamentals Stream Session 4: Group Communication

Distributed Systems Fundamentals Stream Session 4:Group Communication CSC 253 Gordon Blair, François Taïani

Overview of the Session • What is Group Communication about • Reliability Guarantees • Unreliable Multicast • Reliable Multicast • Atomic Multicast • Ordering Guarantees • Unordered • FIFO ordering • Total Ordering • Causal Ordering • Total Ordering and the Stabilisation Problem G. Blair/ F. Taiani

Introducing Group Communication • What is group communication? • Enables the multicasting of a message to a group of processes as a single action • Sender is unaware of the destinations for the message • Why group communication? • Support for replication • fault tolerance & performance  See W8/9 Lectures on FT • Support the efficient dissemination of data • Service discovery, publish/subscribe Associated reading: section 7.4 of Tanenbaum and Van Steen G. Blair/ F. Taiani

A Typical Group Service interfaceGroupCommunicationService{ // creates a new group and returns the groups ID publicGroupIDgroupCreate(); // Adding & Removing a member to/from a group public void groupJoin ( GroupID group, Participant member);public void groupLeave ( GroupID group, Participant member); // multicasts a message to the named group with // the specified delivery semantics, and // optionally collects a number of replies publicMessages[] multicast ( GroupID group, OrderType order,Messages message, int nbReplies) } appli appli groupcomm groupcomm Crucial issue of delivery semantics: reliability & ordering G. Blair/ F. Taiani

Reliability Guarantees • Unreliable multicast • Message sent to all members and may or may not arrive • Reliable multicast • Reasonable efforts are made to ensure delivery in spite of message losses • Can be based on positive or negative acknowledgements • No guarantees is the sender crashes during multicast • Atomic multicast • All members receive message, or none do • Main issue: tolerate the sender’s crash G. Blair/ F. Taiani

Implementing Atomic Multicast • Originator sends a message to each member of the group, and awaits acknowledgements • If some acknowledgements are not received in a given period of time, re-send message; repeat this n times if necessary • If all acknowledgements received • then report success to caller • else remove offending members(s) from group • Each member must monitor the originator • On failure  one of the member must become 'originator' • Each member must also monitor for completion • Awaits arrival of a next message from the same originator to indicate previous multicast completed A ack ack B msg msg D ack msg C G. Blair/ F. Taiani

? x2 +1 ? Atomic Multicast: Notes (I) • Crashed members never receive the multicast message • Solved by deciding that crashed hosts don’t belong to any group • Importance of message ordering: • No guarantees of message ordering for parallel multicast B A C D A 5 2 4 2 2 4 B Time D 3 6 5 C 6 3 2 G. Blair/ F. Taiani

Atomic Multicast: Notes (II) • What happens if a process joins or leaves during the multicast operation? • Everybody needs to agree about group membership to be able to take over the role of originator • But agreeing requires a new multicast operation! • Multiple “originators” might be active. • The previous algorithm does not handle this • For this we need ordering guarantees + something called virtual synchrony • The previous protocol does not scale well • The originator must wait for the ACKS of all group members • Known as ACK explosion G. Blair/ F. Taiani

NACK(m1) last msg from B = 0 A m1 B m2 m1 m2 m1 counter = 1 C last msg from B = 1 Homework: represent the above protocol on a time-sequence diagram Avoiding ACKs Explosion • Using negative ACKs (abbreviated in NACKs) • If everything is fine the receiver does say anything • If a message is lost the receiver complains to the sender • Problem: How do we know that a message should be there? last msg from B = 0 ? A I’ve missed 1 message! B m2 counter = 0 counter = 1 counter = 2 C last msg from B = 0 last msg from B = 1 last msg from B = 2 G. Blair/ F. Taiani

NACK Mechanism: Notes • ACK explosion is avoided • A NACK explosion might still occur but is less likely • Garbage collection problem • In theory sender should keep all past messages indefinitely • In practice past messages only kept for a “long enough” period • More advanced schemes possible • Limiting NACK instances using NACK broadcast • Hierarchical Feedback Control G. Blair/ F. Taiani

Ordering Guarantees • Unordered multicast • No guarantees • FIFO (First In, First Out) • Messages sent from the same process are delivered in the order they were sent at different sites • Messages sent from different processes may be delivered in different orders at different sites • Totally ordered multicast • Consider messages m1 and m2 sent to the group by (potentially) different processes • Either m1 will be delivered before m2 or vice versa for all members of the group • Causally ordered multicast • As above, except the ordering of m1 and m2 is only important if a “happened-before” relationship exists between the messages G. Blair/ F. Taiani

A Matter of Vocabulary • For some “Atomic Multicast” means • Atomic delivery • And total ordering • Tanenbaum and Van Steen use this definition • We use the meaning of Coulouris et al. • “A-tomic” originally means “that can’t be cut”, all-or-nothing • For us does not imply total-ordering • It’s only a matter of vocabulary • The concepts and algorithms are the same, the names change • But be sure to know what someone means when he/she says “atomic multicast” Atomic Multicast G. Blair/ F. Taiani

Implementing Total Ordering • The sequencer approach (centralised) • All requests sent to a sequencer, where they are given an ID • The sequencer assigns consecutive increasing IDs • Requested arriving at sites are held back until they are next in sequence • Problems: sequencer = bottleneck + single point of failure • Other approaches • Distributed agreement to generate ids (as in Isis) • Assign timestamps from a (global) logical or physical clock G. Blair/ F. Taiani

Total Ordering with Time-Stamps • Based on Lamport’s logical clocks (cf. last week lecture) • Messages belonging to same multicast operation are given the same timestamp • At destination messages delivered according to their timestamp • Logical clocks are needed to make sure delivery order is consistent with “happened before” relation • The key of the algorithm is to solve the stabilisation problem G. Blair/ F. Taiani

mB:5.2 mB:5.2 mB:5.2 mD:7.4 mD:7.4 mD:7.4 Message are delivered by the local middleware in the correct order We know that A should not deliver mD yet but A itself has not way to know that mB is coming! Stabilisation problem • A message is stable when the local middleware is sure no other messages with lower timestamps will be received clockB = 5.2 B C A D clockD = 7.4 G. Blair/ F. Taiani

Stabilisation Problem: Solution • Assumption: underlying network provides FIFO delivery • Messages from same sender delivered in same order as sent • If not provided can be implemented by consecutive IDs • On receiving a message mX a process • Puts the message in its local middleware queue • Multicasts ACK(mX) to all other group members • The acknowledgement is time-stamped with a higher timestamp than mX • A process delivers a message mX to the application when • This message is at the top of the local middleware queue • I.e. it has the lowest timestamp of all the messages in the queue • And all ACK(mX) for this message have been received G. Blair/ F. Taiani

mB:5.2 mD:7.4 ackB(mD):8.2 mB:5.2 ackD(mD):8.4 ackB(mD):8.2 ackA(mD):9.1 mB:5.2 mD:7.4 mD:7.4 ackC(mD):11.3 A knows that all future message from B will have a time-stamp bigger than 8.2! mB:5.2 mD:7.4 ackA(mD) ackB(mD) sorted queue ackD(mD) Total Ordering with Time-Stamps:Final Algorithm (ACKs for mB not represented) clockB = 5.2 clockB = 8.2 B clockA = 8.1 clockA = 9.1 A C clockc = 11.3 clockc = 9.3 D clockD = 8.4 clockD = 7.4 G. Blair/ F. Taiani

Notes On Previous Algo • Completely decentralised • Rapidly becomes messy • Distributed algorithm are hard to design and understand • Still not fault tolerant • All ACKs are needed. Any process crash blocks the system • Can be dealt with by using even more complex and more costly algorithms providing fault-tolerant consensus • Not scalable • ACKs needed from all group members! • Can be improved if we assume message latency is bounded See section 5.2.1 of Tanenbaum and Van Steen for revision G. Blair/ F. Taiani

Example: Isis • What is Isis? • A framework for reliable distributed computing based on process groups • Offers a range of delivery semantics • Unordered (FBCAST), causally ordered (CBCAST), totally ordered (ABCAST), etc • Also provides group-view management and state transfer on join • Status • A successful product • Ongoing research on Horus, Ensemble, Java Groups G. Blair/ F. Taiani

Expected Learning Outcomes At the end of this 4th Unit: • You should be able to define what a group communication service is • You should be able to explain the different reliability and ordering guarantees discussed today • You should appreciate scalability issues involves in reliable multicast • You should be able to distinguish total from causal ordering • You should understand the problem of stabilisation for total ordering G. Blair/ F. Taiani

Fundamentals Stream Session 4: Group Communication