Exploration of Multiprocessor Synchronization Algorithms

Multiprocessor synchronization algorithms (20225241) Local-Spin Algorithms Lecturer: Danny Hendler This presentation is based on the book “Synchronization Algorithms and Concurrent Programming” by G. Taubenfeld and on a the survey “Shared-memory mutual exclusion: major research trends since 1986” by J. Anderson, Y-J. Kim and T. Herman

The Cache-Coherent (CC) and Distributed Shared Memory (DSM) models This figure is taken from the survey “Shared-memory mutual exclusion: major research trends since 1986” by J. Anderson, Y-J. Kim and T. Herman

local remote Remote and local memory accesses In a DSM system: In a Cache-coherent system: An access of v by p is remote if it is the first access of vor if v has been written by another process since p’s last access of it.

Local-spin algorithms • In a local-spin algorithm, all busy waiting (‘await’) is done by read-only loops of local-accesses, that do not cause interconnect traffic. • The same algorithm may be local-spin on one architecture (DSM or CC) and non-local spin on the other. For local-spin algorithms, our complexity metric is theworst-case number of Remote Memory References (RMRs)

Peterson’s 2-process algorithm • Program for process 0 • b[0]:=true • turn:=0 • await (b[1]=false or turn=1) • CS • b[1]:=false • Program for process 1 • b[1]:=true • turn:=1 • await (b[0]=false or turn=0) • CS • b[1]:=false No Is this algorithm local-spin on a DSM machine? Yes Is this algorithm local-spin on a CC machine?

Peterson’s 2-process algorithm • Program for process 0 • b[0]:=true • turn:=0 • await (b[1]=false or turn=1) • CS • b[0]:=false • Program for process 1 • b[1]:=true • turn:=1 • await (b[0]=false or turn=0) • CS • b[1]:=false What is the RMR complexity on a DSM machine? Unbounded Constant What is the RMR complexity on a CC machine?

Recall the following simple test-and-set based algorithm Shared lock initially 0 • While (! lock.test-and-set() ) // entry section • Critical Section • Lock := 0 // exit section Is this algorithm local-spin on either a DSM or CC machine? Nope.

A better algorithm: test-and-test-and-set Shared lock initially 0 • While (! lock.test-and-set() )// entry section • await(lock == 0) • Critical Section • Lock := 0 // exit section Creates less traffic in CC machines, still not local-spin.

Local Spinning Mutual ExclusionUsing Strong Primitives

Anderson’s queue-based algorithm(Anderson, 1990) Shared:integer ticket – A RMW object, initially 0bit valid[0..n-1], initially valid[0]=1 and valid[i]=0, for i{1,..,n-1}Local: integer myTicket ticket 0 1 1 2 3 n-1 valid 1 0 0 0 0 0 • Program for process i • myTicket=fetch-and-inc-modulo-n(ticket) ; take a ticket • await valid[myTicket]=1 ; wait for your turn • CS • valid[myTicket]:=0 ; dequeue • valid[myTicket+1 mod n]:=1 ; signal successor

After entry section of p3 myTicket3 0 1 0 ticket ticket valid 1 0 0 0 0 valid 1 0 0 0 0 After p1 performs entry section After p3 exits myTicket3 myTicket1 myTicket1 2 0 1 ticket 2 1 ticket valid 1 0 0 0 0 valid 0 1 0 0 0 Anderson’s queue-based algorithm (cont’d) Initial configuration

Anderson’s queue-based algorithm (cont’d) • Program for process i • myTicket=fetch-and-inc-modulo-n(ticket) ; take a ticket • await valid[myTicket]=1 ; wait for your turn • CS • valid[myTicket]:=0 ; dequeue • valid[myTicket+1 mod n]:=1 ; signal successor What is the RMR complexity on a DSM machine? Unbounded Constant What is the RMR complexity on a CC machine?

swap(w, new) do atomically prev:=*w *w:=new return prev Graunke and Thakkar’s algorithm(Graunke and Thakkar, 1990) Uses the more common swap (a.k.a. fetch-and-store) primitive:

Graunke and Thakkar’s algorithm (cont’d) Shared:bit slots[0..n-1], initially slots[i]=1, for i{0,..,n-1} structure {bit value, bit *node} tail, initially {0, &slots[0]}Local: structure {bit value, bit *node} myRecord, prevbit temp tail 0 0 1 2 3 n-1 slots 1 1 1 1 1

Graunke and Thakkar’s algorithm (cont’d) Shared:bit slots[0..n-1], initially slots[i]=1, for i{0,..,n-1} structure {bit value, bit* slot} tail, initially {0, &slot[0]}Local: structure {bit value, bit* node} myRecord, prev, bit temp • Program for process i • myRecord.value:=slots[i] ; prepare to thread yourself to queue • myRecord.slot:=&slots[i] • prev=swap(&tail, &myRecord) ; prev now points to predecessor • await (*prev.slot≠prev.value) ;local spin until predecessor’s value changes • CS • temp:=1-slots[i] • slots[i]:=temp; signal successor

Graunke and Thakkar’s algorithm (cont’d)

Graunke and Thakkar’s algorithm (cont’d) • Program for process i • myRecord.value:=slots[i] ; prepare to thread yourself to queue • myRecord.slot:=&slots[i] • prev=swap(&tail, myRecord) ; prev now points to predecessor • await (*prev.slot≠prev.value) ;local spin until predecessor’s value changes • CS • temp:=1-slots[i] • slots[i]:=temp; signal successor What is the RMR complexity on a DSM machine? Unbounded Constant What is the RMR complexity on a CC machine?

The MCS queue-based algorithm(Mellor-Crummey and Scott, 1991) • Has constant RMR complexity under both the DSM and CC models • Uses swap and CAS Type:Qnode: structure {bit locked, Qnode *next}Shared:Qnode nodes[0..n-1] Qnode *tail initially nilLocal: Qnode *myNode, initially &nodes[i] Qnode *prev, *successor Tail nodes F T T n-1 n 1 3 2

The MCS queue-based algorithm (cont’d) • Program for process i • myNode->next := nil; prepare to be last in queue • prev := &myNode ;prepare to thread yourself • pred=swap(&tail, prev) ;tail now points to myNode • if (pred≠ nil) ;I need to wait for a predecessor • myNode->locked := true ;prepare to wait • pred->next := myNode ;let my predecessor know it has to unlock me • await myNode.locked := false • CS • if (myNode.next = nil) ; if not sure there is a successor • if (compare-and-swap(&tail, myNode, nil) = false) ; if there is a successor • await (myNode->next≠ null) ; spin until successor lets me know its identity • successor := myNode->next ; get a pointer to my successor • successor->locked := false ; unlock my successor • else ; for sure, I have a successor • successor := myNode->next ; get a pointer to my successor • successor->locked := false ; unlock my successor

The MCS queue-based algorithm (cont’d)

Local Spinning Mutual ExclusionUsing reads and writes

0 0 1 0 1 2 3 7 2 6 5 1 4 3 0 A local-spin tournament-tree algorithm(Anderson, Yang, 1993) Each node is identified by (level, number) Level 2 Level 1 Level 0 Processes O(log n) RMR complexity for both DSM and CC systems This is optimal (Attiya, Hendler, woelfel, 2008) Uses O(n log n) registers

A local-spin tournament-tree algorithm (cont’d) Shared:- Per each node, v, there are 3 registers: name[level, 2node], name[level, 2node+1] initially -1turn[level, node]- Per each level l and process i, a spin flag: flag[ level, i ] initially 0 Local: level, node, id

Program for process i • node:=i • For level = o to log n-1 do ;from leaf to root • node:= node/2 ;compute node in new level • id=node mod 2 ; compute ID for 2-process mutex algorithm (0 or 1) • name[level, 2node + id]:=i ;identify yourself • turn[level,node]:=i ;update the tie-breaker • flag[level, i]:=0 ;initialize my locally-accessible spin flag • rival:=name[level, 2node+1-id] • if ( (rival ≠ -1) and (turn[level, node] = i) ) ;if not sure I should precede rival • if (flag[level, rival] =0) If rival may get to wait at line 14 • flag[level, rival]:=1 ;Release rival by letting it know I updated tie-breaker • await flag[level, i] ≠ 0 ;await until signaled by rival (so it updated tie-breaker) • if (turn[level,node]=i) ;if I lost • await flag[level,i]=2 ;wait till rival notifies me its my turn • id:=node ;move to the next level • EndFor • CS • for level=log n –1 downto 0 do ;begin exit code • id:=  i/2level , node:= id/2 ;set node and id • name[level, 2node+id ]) :=-1 ;erase name • rival := turn[level,node] ;find who rival is (if there is one) • if rival ≠ i ;if there is a rival • flag[level,rival] :=2 ;notify rival A local-spin tournament-tree algorithm (cont’d)

Local-Spin Leader Election • Exactly one process is elected • All other processes are not-elected • Processes may busy-wait

Choy and Sing's filter m processes Filter The rest are “halted” Between 1 and m/2 processes “exit “ • Filter guarantees: • Safety: if m processes enter a filter, at most m/2 exit. • Progress: if some processes enter a filter, at least one exits.

Choy and Singh's filter (cont’d) Shared:integer turn Boolean b, initially false • Program for process i • turn := i • await b // wait for barrier to open • b := true // close barrier • if turn ≠ i // not last to cross the barrier • b := false // open barrier • halt • else • exit Why does the barrier has to be re-opened? Why are filter guarantees satisfied?

Choy and Sing’s filter algorithm Filter #1 Filter #2 Filter #i

Choy and Sing’s filter algorithm (cont’d) Shared:typdef struct{integer turn, boolean b,c initially false} filter filter A[log n + 1] • Program for process i • For (curr=0; cur < log n +1; curr++) • A[curr].turn := p • Await  A[curr].b • A[curr].b:=true • if (A[curr]. turn ≠ i) • A[curr].c := true // mark that some process failed on filter • A[curr].b := false • return not-elected • else if (curr > 0)  A[curr-1].c • return elected // Other processes will never reach this filter • Else • curr := curr+1 • EndFor Do you see any problem with this algorithm?How can this be fixed?

Choy and Sing’s filter algorithm (cont’d) • What is the DSM RMR complexity? • What is the CC RMR complexity? • What is the worst-case average (CC) RMR complexity?

Exploration of Multiprocessor Synchronization Algorithms

Exploration of Multiprocessor Synchronization Algorithms

Presentation Transcript

Local Search Algorithms

Spin (Spin Control)

Local search algorithms

Local search algorithms

Local Search Algorithms

Spin Spin Spin

Local search algorithms

Pure Spin Currents via Non-Local Injection and Spin Pumping

Local Search Algorithms

Local-Spin Algorithms

Non-local exciton-polariton spin switches

SPIN-SPIN SPLITTING

Local Search Algorithms

Stochastic Local Search Algorithms

Local search algorithms

Local Search Algorithms