430 likes | 577 Views
Asynchronous Point-to-Point Message Passing. Interface is: inputs : send i ( M ) models p i sending set of msgs M each msg indicates sender and recipient (must be consistent with assumed topology) outputs : recv i ( M ) models p i receiving set of msgs M
E N D
Asynchronous Point-to-Point Message Passing Interface is: • inputs: sendi(M) • models pisending set of msgs M • each msg indicates sender and recipient (must be consistent with assumed topology) • outputs: recvi(M) • models pi receiving set of msgs M • each msg in M must have pi as its recipient
Asynch Message Passing • For a sequence of inputs and outputs (sends and receives) to be allowable, there must exist a mapping from the msgs in recv events to msgs in send events s.t. • each msg in a recv event is mapped to a msg in a preceding send event • is well-defined: every msg received was previously sent (no corruption or spurious msgs) • is one-to-one: no duplicates • is onto: every msg sent is received
Broadcast Slides by Prof. Jennifer Welch
Broadcast Specifications Specification of a broadcast service: • Inputs: bc-sendi(m) • an input to the broadcast service • pi wants to use the broadcast service to send m to all the procs • Outputs:bc-recvi(m,j) • an output of the broadcast service • broadcast service is delivering msg m, sent by pj, to pi
Broadcast Specifications • A sequence of inputs and outputs (bc-sends and bc-recvs) is allowable iff there exists a mapping from each bc-recvi(m,j) event to an earlier bc-sendj(m) event s.t. • is well-defined: every msg bc-recv'ed was previously bc-sent (Integrity) • restricted to bc-recvi events, for each i, is one-to-one: no msg is bc-recv'ed more than once at any single proc. (No Duplicates) • restricted to bc-recvi events, for each i, is onto: every msg bc-sent is received at every proc. (Liveness)
Ordering Properties • Sometimes we might want a broadcast service that also provides some kind of guarantee on the order in which messages are delivered. • We can add additional constraints on the mapping : • single-source FIFO or • totally orderedor • causally ordered
Single-Source FIFO Ordering • For all messages m1and m2 and all piand pj, if pi sends m1 before it sends m2, and if pj receives m1and m2, then pj receives m1before it receives m2. • Phrased carefully to avoid requiring that both messages are received. • that is the responsibility of a liveness property
Totally Ordered • For all messages m1and m2 and all piand pj, if both pi and pj receive both messages, then they receive them in the same order. • Phrased carefully to avoid requiring that both messages are received by both procs. • that is the responsibility of a liveness property
Happens Before for Broadcast Messages • Earlier we defined "happens before" relation for events. • Now extend this definition to broadcast messages. • Assume all communication is through broadcast sends and receives. • Msg m1happens before msg m2 if • some bc-recv event for m1happens before the bc-send event for m2, or • m1and m2are bc-sent by the same proc. and m1is bc-sent before m2 is bc-sent.
m3 m2 m4 Example of Happens Before for Broadcast Messages m1 m1 happens before m3 and m4 m2 happens before m4 m3 happens before m4
Causally Ordered • For all messages m1and m2 and all pi, if m1 happens before m2, and if pi receivesboth m1and m2, then pi receives m1before it receives m2. • Phrased carefully to avoid requiring that both messages are received. • that is the responsibility of a liveness property
Example a b Yes. single-source FIFO? No. totally ordered? Yes. causally ordered?
Example a b No. single-source FIFO? Yes. totally ordered? No. causally ordered?
Example a b Yes. single-source FIFO? No. totally ordered? No. causally ordered?
Algorithm BB to Simulate Basic Broadcast on Top of Point-to-Point • When bc-sendi(m) occurs: • pi sends a separate copy of m to every processor (including itself) using the underlying point-to-point message passing communication system • When can pi perform bc-recvi(m)? • when it receives m from the underlying point-to-point message passing communication system
Basic Broadcast Simulation bc-sendi bc-recvi bc-sendj bc-recvj basic broadcast Alg BB … BB0 BBn-1 recvi recvj sendi sendj asynch pt-to-pt message passing
Correctness of Basic Broadcast Algorithm • Assume the underlying point-to-point message passing system is correct (i.e., conforms to the spec given earlier). • Check that the simulated broadcast service satisfies: • Integrity • No Duplicates • Liveness
Single-Source FIFO Algorithm • Assume the underlying communication system is basic broadcast. • when ssf-bc-sendi(m)occurs: • piuses the underlying basic broadcast service to bcast m together with a sequence number • piincrements sequence number by 1 each time it initiates a bcast • when can pi perform ssf-bc-recvi(m)? • when pihas bc-recv'ed m with sequence number T and has ssf-bc-recv'ed messages from pj(the ssf-bc-sender of m) with all smaller sequence numbers
Single-Source FIFO Algorithm user of SSF bcast ssf-bc-send ssf-bc-recv ssf bcast SSF alg (timestamps) bc-send bc-recv basic bcast alg (n copies) basic bcast send recv point-to-point message passing
Asymmetric Algorithm for Totally Ordered Broadcast • Assume underlying communication service is basic broadcast. • There is a distinguished proc. pc • when to-bcasti(m) occurs: • pi sends m to pc (either assume the basic broadcast service also has a point-to-point mechanism, or have recipients other than pcignore the msg) • when pc receives m from pifrom the basic broadcast service: • append a sequence number to m and bc-send it
Asymmetric Algorithm for Totally Ordered Broadcast • when can pi perform to-bc-recv(m)? • when pihas bc-recv'ed m with sequence number T and has to-bc-recv'ed messages with all smaller sequence numbers
Asymmetric Algorithm Discussion • Simple • Only requires basic broadcast • But pc is a bottleneck • Alternative approach next…
Symmetric Algorithm for Totally Ordered Broadcast • Assume the underlying communication service is single-source FIFO broadcast. • Each proc. tags each msg it sends with a timestamp (increasing). • Break ties using proc. ids. • Each proc. keeps a vector of estimates of the other proc's timestamps: • If pi 's estimate for pj is k, then pi will not receive any later msg from pj with timestamp k. • Estimates are updated based on msgs received and "timestamp update" msgs
Symmetric Algorithm for Totally Ordered Broadcast • Each proc. keeps its timestamp to be ≥ all its estimates: • when pi has to increase its timestamp because of the receipt of a message, it sends a timestamp update msg • A proc. can deliver a msg with timestamp T once every entry in the proc's vector of estimates is at least T.
when to-bc-sendi(m) occurs: ts[i]++ add (m,ts[i],i) to pending invoke ssf-bc-sendi((m,ts[i])) invoke to-bc-recvi(m,j) when: (m,T,j) is entry in pending with smallest (T,j), & T ≤ ts[k] for all k result: remove (m,T,j) from pending when ssf-bc-recvi((m,T))from pj occurs: ts[j] := T add (m,T,j) to pending if T > ts[i] then ts[i] := T invoke ssf-bc-sendi("ts-up",T) when ssf-bc-recvi("ts-up",T) from pjoccurs: ts[j] := T Symmetric Algorithm
user of TO bcast to-bc-send to-bc-recv TO bcast symmetric TO alg ssf-bc-send ssf-bc-recv SSF alg (timestamps) ssf bcast bc-send bc-recv basic bcast alg (n copies) basic bcast send recv point-to-point message passing
Correctness of Symmetric Algorithm Lemma (8.2): Timestamps assigned to msgs form a total order (break ties with id of sender). Theorem (8.3): Symmetric algorithm simulates totally ordered broadcast service. Proof: Must show top-level outputs of symmetric algorithm satisfy 4 properties, in every admissible execution (relies on underlying ssf-bcast service being correct).
Correctness of Symmetric Alg. Integrity: follows from same property for ssf-bcast. No Duplicates: follows from same property for ssf-bcast. Liveness: • Suppose in contradiction some pi has some entry (m,T,j) stuck in its pending set forever, where (T,j) is the smallest timestamp of all stuck entries. • Eventually (m,T,j) has the smallest timestamp of all entries in pi's pending set. • Why is (m,T,j) stuck at pi? Because pi's estimate of some pk's timestamp is stuck at some value T' < T. • But that would mean either pk never receives (m,T,j) or pk's timestamp-update msg resulting from pk receiving (m,T,j) is never received at pi, contradicting correctness of the SSF broadcast.
Correctness of Symmetric Alg. Total Ordering: Suppose pidoes to-bc-recv for msg m with timestamp (T,j), and later it does to-bc-recv for msg m' with timestamp (T',j'). Show (T,j) < (T',j'). • By the code,if (m',T',j') is in pi's pending set when pidoes the to-bc-recv for m, then (T,j) < (T',j'). • Suppose (m',T',j') is not yet in pi's pending set at that time. • When pi does the to-bc-recv for m, precondition ensures that T ≤ ts[j']. So pi has received a msg from pj'with timestamp ≥ T. • By the SSF property, every subsequent msg pi receives from pj' will have timestamp > T, so T' must be > T.
Causal Ordering Algorithms • The symmetric total ordering algorithm ensures causal ordering: • timestamp order extends the happens-before order on messages. • Causal ordering can also be attained without the overhead of total ordering using an algorithm based on vector clocks…
when co-bc-sendi(m) occurs: vt[i]++ invoke co-bc-recvi(m) invoke bc-sendi((m,vt)) invoke co-bc-recvi(m,j) when: (m,w,j) is in pending w[j] = vt[j] + 1 w[k] ≤ vt[k] for all k ≠ j result: remove (m,w,j) from pending vt[j]++ when bc-recvi((m,w))from pjoccurs: add (m,w,j) to pending Causal Order Algorithm Note: vt[j] records how many msgs from pj have been co-recv'ed
Causal Order Algorithm Discussion • Vector clocks are implemented slightly differently than in the point-to-point case. • In point-to-point case, we exploited indirect (transitive) information about messages received by other procs. • In the broadcast case, we don't need to do that, since very proc will eventually receive every message directly.
Causal Order Algorithm Example • Algorithm delays the delivery of the C.O. msgs until causal order property won't be violated. (1,3,0) (0,1,0) (0,2,0) (0,3,0)
Correctness of Causal Order Algorithm (Sketch) Lemma (8.6): The local array variables vt serve as vector clocks. Theorem (8.7): The algorithm simulates causally ordered broadcast, if the underlying communication system satisfies (basic) broadcast. Proof:Integrity and No Duplicates follow from the same properties of the basic broadcast. Liveness requires some arguing. Causal Ordering follows from the lemma.
Reliable Broadcast • What do we require of a broadcast service when some of the procs can be faulty? • Specifications differ from those of the corresponding non-fault-tolerant specs in two ways: • proc indices are partitioned into "faulty" and "nonfaulty" • Liveness property is modified…
Reliable Broadcast Specification • Nonfaulty Liveness:Every msg bc-sent by a nonfaulty proc is eventually bc-recv'ed by all nonfaulty procs. • Faulty Liveness: Every msg bc-sent by a faulty proc is bc-recv'ed by either all the nonfaulty procs or none of them.
Discussion of Reliable Bcast Spec • Specification is independent of any particular fault model. • We will only consider implementations for crash faults. • No guarantee is given concerning which messages are received by faulty procs. • Can extend this spec to the various ordering variants: • msgs that are received by nonfaulty procs must conform to the relevant ordering property.
Spec of Failure-Prone Point-to-Point Message Passing System • Before we can design an algorithm to implement reliable (i.e., fault-tolerant) broadcast, we need to know what we can rely on from the lower layer communication system. • Modify the previous point-to-point spec from the no-fault case in two ways: • partition proc indices into "faulty" and "nonfaulty" • Liveness property is modified…
Spec of Failure-Prone Point-to-Point Message Passing System • Nonfaulty Liveness: every msg sent by a nonfaulty proc to any nonfaulty proc is eventually received. Note that this places no constraints on messages received by faulty procs.
when rel-bc-sendi(m) occurs: invoke sendi(m) to all procs when recvi(m)from pjoccurs: if m has not already been recv'ed then invoke sendi(m) to all procs invoke rel-bc-recvi(m) Reliable Broadcast Algorithm
Correctness of Reliable Bcast Alg • Integrity: follows from Integrity property of underlying point-to-point msg system. • No Duplicates: follows from No Duplicates property of underlying point-to-point msg system and the check that this msg was not already received. • Nonfaulty Liveness: follows from Nonfaulty Liveness property of underlying point-to-point msg system. • Faulty Liveness: follows from relaying and underlying Nonfaulty Liveness.
Total Ordering with Crash Failure • Cannot be achieved in async systems • Total ordering can be used to achieve consensus, which is impossible in async systems with failure
Causal Ordering with Crash Failures • Can be achieved in async systems with crash failures • Use previous causal ordering algo, but with reliable broadcast replacing basic broadcast