240 likes | 357 Views
Multicast Protocols. Jed Liu 28 February 2002. Introduction. Recall Atomic Broadcast: All correct processors receive same set of messages. All messages delivered in same order to all processors. Any message sent by a correct processor is eventually delivered to all processors.
E N D
Multicast Protocols Jed Liu 28 February 2002
Introduction • Recall Atomic Broadcast: • All correct processors receive same set of messages. • All messages delivered in same order to all processors. • Any message sent by a correct processor is eventually delivered to all processors.
Introduction (cont’d) • But what happens if the network partitions? • Atomic Broadcast becomes unsolvable! • Define Totally Ordered Broadcast • If a majority of the processes form a connected component, guarantee Atomic Broadcast for this component only. • COReL is an implementation of this.
The Model • Network uses datagram message delivery. • Asynchronous fail-stop model. • Stable storage. • Communication links are transient. • Message Integrity: Messages cannot be corrupted or generated by the network spontaneously.
The System Architecture Application Totally Ordered Application messages Totally Ordered Broadcast messages COReL – Totally Ordered Broadcast COReL messages Messages with TS Delivered views Group Communication Service
Properties of the GCS • No Duplication: Every message delivered at a process p is delivered only once at p. • Total Order: A logical, globally unique timestamp is attached to every message when it is delivered. Causal order is preserved. GCS delivers messages in TS order. • Virtual Synchrony: Any two processes undergoing the same two consecutive views in a group G deliver the same set of messages in G within the former view
Properties of the GCS (cont’d) P and Q in same view. P Q Deliver m Deliver m’ Deliver m’ Send m” Deliver m” Q also delivers m.
Guarantees Made by COReL • Safety: • At each process, messages become totally ordered in an order which is a prefix of some common global total order. • Total ordering of messages preserves the causal partial order. • Liveness: • Messages are eventually totally ordered by the members of a view.
The COReL Algorithm • GCS supplies a unique timestamp for each message that gets delivered to COReL. • On delivery, the message gets written to stable storage, and an acknowledgement is sent. • Within a majority component, messages are ordered in TS order. Concurrent messages are ordered such that messages from the majority component come first.
The Primary Component • Use the notion of a primary component to allow members of one network component to continue ordering messages when a partition occurs. (Can be a majority, or in general, a quorum.) • Ordering Rule: Members of the current primary component PM are allowed to totally order a message once the message was acknowledged by all members of PM.
The Colours Model • Green: messages that have been totally ordered according to the Ordering Rule. • Yellow: messages received and acknowledged in the context of a primary component. May have become green at other members of the primary component. • Red: no knowledge about message’s total order.
Invariants • Order of green messages determines the global total order of those messages. • Order of such messages cannot change, and processes have to agree on the order. • Causal order of messages is preserved.
View Changes • Set the primary component bit to FALSE. • Stop handling regular messages and stop sending regular messages. • If new view v contains new members, run a Recovery Procedure. • If v is a majority, establish a new primary component. • Continue handling regular messages and sending regular messages.
State Variables • Last_Committed_Primary • Number of last primary component that the process has committed to establish. • Last_Attempted_Primary • Number of last primary component that the process has attempted to establish.
Recovery Procedure • Send state message to members of new group. • Wait for state messages from all other group members. • Find a set of Representatives in the group. • Set of processes with the largest Last_Committed_Primary in the group. • Get Representatives to agree on the set of green messages and the set of yellow messages. • Set of green messages determined by the union.Set of yellow messages determined by the intersection.
Recovery Procedure (cont’d) • A deterministically chosen representative retransmits green and yellow messages to get all group members to agree on the set of green and yellow messages. • Non-representatives re-colour yellow messages as red if the message is not yellow at any representative. • Retransmit red messages as necessary to get all group members to agree on the state and colour of their message queues.
View Change During Recovery? • If in the middle of recovery and we get a view change, we immediately restart recovery with the new view. No need to undo anything. • If view change only removes processes from group, no need to retransmit messages.
Establishing a New Primary Component • Attempt: Record attempt on stable storage and send attempt message to all other members. Wait for attempt messages from all other members. • Commit: Record commit on stable storage. Mark all non-green messages as yellow. Send a commit message. • Establish: When commit messages from all other members arrive, set primary component bit to TRUE and mark all messages as green.
View Change while Establishing? • A process marks the messages in its message queue as green only when it knows that all other members have marked them as yellow. • If a failure occurs during the protocol, the invariants are not violated.
COReL Summary • An algorithm for totally-ordered multicast in an asynchronous environment. • Resilient to network partitions and communication link failures. • But only live in the primary component! • Allows members of minority components to initiate messages. These messages can become totally ordered even if the originating process in never a member of the primary component.
Transis • Another multicast protocol that deals with network partitions. • Regulates network flow to avoid flooding and message loss. • Uses a sliding-window algorithm similar to that used in TCP.
The Persistent Replication Services Layer (PRSL) • Built on top of Transis. • Provides applications with long term services such as message logging and replaying, and reconciliation of states among recovered and reconnected endpoints. • With just Transis: • Message delivery only guaranteed within the current group. • No end-to-end acknowledgement at application level, so no guarantee that any destination actually acted on the message.
Replication Groups • The basis of PRSL operations. • A static set of processes defined at startup time. • Different from multicast groups — can only change through startup and shutdown of members.
Replication Group Operations • Uniform multicast. • Totally ordered uniform multicast. • Stable multicast. • Explicit application-level acknowledgement. • Startup/shutdown for adding/removing a member to/from the replication group.