Synchronization

Synchronization Part II Global State, Election, & Critical Sections Chapter 5

Global State

Global State – Motivation R1 R2 P1 P2 P1 request R1 R2 allocate request P2 allocate P1 release request R1 R2 allocate request P2

Global State • We cannot determine the exact global state of the system • We can approximate it • Distributed Snapshot: a state the system might have been in [Chandy and Lamport]

Distributed Snapshots • System = n processes P1 to Pn • Complete (unidirectional) graph • State of Pi is si , an infinite set of states • Entire contents of process address space! • State of all processes S = {s1,s2, …, sn} • Ci,j indicates channel between Pi and Pj • Reliable FIFO channels • Contents of Ci,j = Li,j = (m1,m2, …, mk) • L = {Li,j|i,j  1, …, n} • Global State of the system is G = (S,L)

Cuts • A consistent cut (meaningful global state) • An inconsistent cut

Distributed Snapshot Algorithm – Description • Provides consistent cuts • Any process can request a snapshot • Processes can request snapshots concurrently • A special message token is used to request a snapshot • The snapshot consists of a global state of the system G = (S,L) • Taken at a consistent cut

Distributed Snapshot Algorithm – Requesting a Snapshot • When a process P requests a snapshot it sends a token(P) to each other process Q • When a process Q receives a token(P) message its action depends on: • If Q receives the token for the first time • Q did not save its state for this token yet • If Q received the token before • Q has already saved its state for this token

Distributed Snapshot Algorithm – Receive Token for the First Time • When Q receives token(X) (from P) for the first time : • Save its state in sQ • Consider LP,Q to be empty • Reliable, FIFO channels: all messages before the token has been received and included in sQ • Cut takes effect at token • Send token(X) to everybody else • Note: Q must save state before receiving any subsequent messages from P. Why?

Distributed Snapshot Algorithm – Receive Token Again • When Q receives token(X) (from P) NOT for the first time : • Consider all messages received from P: • After Q has saved its state • Before receiving this token • These messages are part of LP,Q • Why? • Termination: When Q receives token(X) n-1 times, Q finished its part for the snapshot requested by X.

Distributed Snapshot Example (1) • Organization of a process and channels for a distributed snapshot

Distributed Snapshot Example (2) • Process Q receives a marker for the first time and records its local state • Q records all incoming message • Q receives a marker for its incoming channel and finishes recording the state of the incoming channel

Distributed Snapshot Algorithm – Variables for Process Pi • int my_version = 0 /* my snapshot version */ • int current_snap[1 .. n] = [0 .. 0] • /* If process Pi has current_snap[j] = k, then Pi has saved its state for snapshot version k initiated by Pj*/ • int tokens_received[1 .. n] = [0 .. 0] • /* If process Pi has tokens_received[j] = k, then Pi has received k tokens from Pj, need it to detect termination */ • process_state Si/* Process Pi saves its state in Si */ • channel_state Li[1 .. n] • /* Process Pi saves channel Ci,j contents in Li[j] */

Distributed Snapshot Algorithm – Code for Process Pi (1) • Request a snapshot: • my-version++ • save current state into Si • current_snap[i] = my_version • for each Pj  Pi • send token(i, my_version) • /* token(i, ver) : Pi is requesting a snapshot of version ver */

Distributed Snapshot Algorithm – Code for Process Pi (2) • Receive token(j, ver) from Pk : • if current_snap[j] < ver /* first time token of Pj */ • Save current state into Si • current_snap[j] = ver /* saved state for this token */ • Lk[i] = empty /* cut starts after token */ • for each Pr  Pi • send token(j,ver) to Pr • tokens_received[j] = 1 /* received first token */ • else (next slide) …

Distributed Snapshot Algorithm – Code for Process Pi (3) • else if current_snap[j] == ver /* notfirst time token */ • tokens_received[j]++ /* how many tokens */ • Lk[i] = all messages received from Pk since receiving token(j,ver) /* essential for a consistent cut */ • if tokens_received[j] == n –1 • local snapshot for (j,ver) is done • /* my participation for this snapshot is done */

Distributed Snapshot Algorithm • When a process finishes local snapshot, it collects its local state (S and L) and send it to the initiator of the distributed snapshot • The initiator can then analyze the state

Distributed Snapshot Algorithm – Correctness • Generates a consistent cut • If after P saves its state, P receives a message m from Q. Then m cannot be part of P’s state • Otherwise, no consistent cut • Algorithm considers it part of the channel • Any message sent before the token to P will be part of P’s state • FIFO, reliable channels • When P receives a token, it saves its state before receiving any subsequent messages

Distributed Computing Models

Synchronous Distributed Computing Model • Synchronous model: • Process execution speed: bounded • Message transmission delay: bounded • Clock drift rate: bounded • Useful for analysis of algorithms • Can be built if processes can be guaranteed • Enough CPU cycles and N/W capacity • Clocks with bounded drift rates • Can make use of time-outs to detect failures

Asynchronous Distributed Computing Model • Asynchronous model: • Process execution speed: not bounded • Message transmission delay: not bounded • Clock drift rate: not bounded • More realistic (e.g Internet) • Harder • More general • Cannot make use of time-outs to detect failures

Leader Election

Leader Election (1) • A central authority is often needed in a DS • Primary replica, scheduler, etc … • All processes have unique ids • Id could be any useful measure for election • Example ids: (1/load), name, priority, etc … • At the end, only one leader is elected • All process agree on the same leader (unanimous) • Choose the leader with the highest id • System can be synchronous or asynchronous

Leader Election (2) • Several processes can call for an election • There could be at most n concurrent elections at once • A process does not call more than one election at a time • Correctness: • Safety: When election is over, each process Pi has leaderi = j for some process Pj

Ring Algorithm – Asynchronous Model Predecessor(x) Successor(x) s Messages: - election(id) - leader(id) w n x Local variables: - runningi = false - leaderi m Initiate_Election(i): runningi = true Send election(i) to Successor(i)

A Ring Algorithm – Process Pi • Election(i): • Receive a message from Predecessor(i) • Case message is election(k): • if k > i then send election(k) to Successor(i) • if k < i & not runningi then send election(i) to • Successor(i) • if k = i then send leader(i) to Successor(i) • Case message is leader(k): • leaderi = k • runningi = false • quit election

Ring Algorithm – Example l(10) e(10) e(8) e(8) e(10) l(10) 5 9 8 e(9) e(10) l(10) l(10) e(10) e(2) 3 2 e(9) e(10) l(10) l(10) e(10) 4 10 e(10) l(10) l(10) e(10) 7 6 1 l(10) e(10) e(10) l(10)

Ring Algorithm – Analysis • What if more than one process Initiate_Election()? • Exercise: trace the algorithm on an example ring • Message Complexity (bandwidth utilization) O(n2): • There are always n leader messages • Best Case: n messages when Pn sends election(n) and no body else sends a message. election(n) travels back to Pn. • Worst Case: O(n2) messages • Exercise: find a ring arrangement that gives 1 + 2 + 3 + … + (n-2) + (n-1) + n = O(n2) messages

Bully Algorithm – Synchronous Model (1) • Reliable channels • Processes can crash • Process have minimal knowledge of each other: • Direct communication • Each processes knows which processes have higher ids than itself

Bully Algorithm (2) • Message types: • election: start an election • leader: announce a winner • bully: bully a nominee to quit • Synchronous model: • Can make use of time outs to detect failures • The process with the highest id, say P, can send a leader(P) message to all others

Bully Algorithm (3) • A process Q with a lower id can start an election by • sending an election message to all processes with higher ids • waiting for some time T • If after T, no response is received (all bigger guys are crashed), then send leader(Q) to all processes with lower ids • If Q receives a response (it must be bully) • It waits for a leader message

The Bully Algorithm - Example (1) • The bully election algorithm • Process 4 holds an election • Process 5 and 6 respond, telling 4 to stop (OK = bully) • Now 5 and 6 each hold an election

The Bully Algorithm – Example (2) • Process 6 tells 5 to stop • Process 6 wins and tells everyone (coordinator = leader)

Bully Algorithm – Failure Detector • Boolean Failure_Detector(int id) • send message to id • wait for T time units • /* time out */ if no response return true else return false • T = 2Ttrans + Tproc (upper bound) • Ttrans= upper bound on time to transmit a message • Tproc= upper bound on time to process a message

Bully Algorithm – Initiating an Election (1) • Initiate_Election(int i) /* process Pi */ • runningi = true /* I am running in this elections */ • if i is the highest id then • send leader(i) to all Pj, where j  i • else • send election(i) to all Pj, where j > i • /* check if there are bigger guys out there */ • wait for T time units

Bully Algorithm – Initiating an Election (2) • if no response /* time out, no response */ • leaderi = i /* I am the leader */ • send leader(i) to all Pj, where j  i • else /* bully is received */ • wait for T’ time units • if no leader(k) message Initiate_Election(i) • else (leader(k) from k) • leaderi = k • runningi = false /* leader elected */

The Bully Algorithm – Process Pi • Upon receiving a message m from Pj: • Case m is leader(j) : leaderi = j; runningi = false • Case m is election(j) : • if j < i then send bully to Pj • if not runningi then Initiate_Election(i) • Upon noticing the leader crashes: • Initiate_Election(i) • Upon introducing a new process Pi (replacing a crashed one): • Initiate_Election(i) • There is a problem here? Exercise: find it!

Bully Algorithm – Analysis n-1 • Bandwidth utilization • Message Complexity • Best Case: • When the process with the next highest id (after the leader) notices the leader crash • Sends n – 1(leader) messages, O(n) • Worst Case: • When the process with the lowest id notices the leader crash • Send (n-1)[E]+(n-2)[B] + (n-2)[E] + (n-3)[B] + … 2[E] + 1[B] +(n -1) [L], O(n2)

Critical Sections

Critical Sections CS c • Process P: • Enter(c) • Exit(c) • Remainder(c) • Critical Section correctness • Mutual Exclusion: safety • Deadlock-Freedom: progress • Starvation-Freedom: fairness

Critical Sections – Leader-based Algorithm • One leader process • Utilize leader election • Message types: • request(P,c) = P is requesting entry to CS c • release(P,c) = P is releasing CS c • acquire(P,c) = leader is telling P that it can enter c

Leader-based Algorithm – Example • Process 1 asks the leader for permission to enter a critical section. Permission is granted • Process 2 then asks permission to enter the same critical section. The leader does not reply. • When process 1 exits the critical section, it tells the leader, which then replies to 2

Leader-based Algorithm – Non-Leader Code • Enter(c) by process P • Send request(P,c) to the leader • Wait for acquire(P,c) message from the leader • Exit(c) by process P • Send release(P,c) to the leader

Leader-based Algorithm – Leader Code (1) • Boolean mutex[M] = false /* M critical sections */ • /* mutex[c] is true means that some process • is in CS c */ • Process_Queue CSQ[M] • /* CSQ[c] is a FIFO Queue of processes waiting to enter CS c */

Leader-based Algorithm – Leader Code (2) • Wait for a message (from process P) • Case message is request(P,c) /* P wants to enter CS c */ • if mutex[c] then CSQ[c].add(P) /* CS c is busy, P must wait */ • else /* P can enter CS c */ • mutex[c] = true /* CS cis taken by P now */ • send acquire(P,c) to P /* Inform P that it can enter CS c */

Leader-based Algorithm – Leader Code (3) • Case message is release(P,c) /* P exited CS c */ • if CSQ[c].empty() then /* No processes are waiting for CS c */ • mutex[c] = false /* CS c is available now */ • else /* At least one process is waiting for CS c */ • Q = CSQ[c].remove() /* Take process Q out, the one on the head of CSQ[c] */ • send acquire(Q,c) to Q /* Inform Q it can enter CS c */

Leader-based Algorithm – Analysis • What if the leader crashes? • Modify algorithm (exercise) • Correctness • Mutual Exclusion? • Progress? • Fairness? • Message Complexity • 3 messages per entry-exit

Critical Sections – Timestamps-based Algorithm • No leader process • Requires a total order on messages • Utilize Lamport’s timestamps • Message types: • request(P,c,ts) = P is requesting entry to CS c at ts • acquire(P,Q,c) = P is telling Q that it can acquire c

Timestamps-based Algorithm – Example • Two processes want to enter the same critical section at the same moment. • Process 0 has the lowest timestamp, so it wins. • When process 0 is done, it sends an OK also, so 2 can now enter the critical region.

Timestamps-based Algorithm (1) • Process_Queue CSQ[M] /* M critical sections */ • /* FIFO Queues to wait for entry */ • Enter(c) by process P • Send request(P,c,ts) to all processes • /* request entry from all other processes */ • Wait for acquire(Q,P,c) message from all other processes Q • /* when all processes permit, enter CS c */

Synchronization