300 likes | 322 Views
Multiprocessor synchronization algorithms (20225241). Local-Spin Algorithms. Lecturer: Danny Hendler.
E N D
Multiprocessor synchronization algorithms (20225241) Local-Spin Algorithms Lecturer: Danny Hendler This presentation is based on the book “Synchronization Algorithms and Concurrent Programming” by G. Taubenfeld and on a the survey “Shared-memory mutual exclusion: major research trends since 1986” by J. Anderson, Y-J. Kim and T. Herman
The Cache-Coherent (CC) and Distributed Shared Memory (DSM) models This figure is taken from the survey “Shared-memory mutual exclusion: major research trends since 1986” by J. Anderson, Y-J. Kim and T. Herman
local remote Remote and local memory accesses In a DSM system: In a Cache-coherent system: An access of v by p is remote if it is the first access of vor if v has been written by another process since p’s last access of it.
Local-spin algorithms • In a local-spin algorithm, all busy waiting (‘await’) is done by read-only loops of local-accesses, that do not cause interconnect traffic. • The same algorithm may be local-spin on one architecture (DSM or CC) and non-local spin on the other. For local-spin algorithms, our complexity metric is theworst-case number of Remote Memory References (RMRs)
Peterson’s 2-process algorithm • Program for process 0 • b[0]:=true • turn:=0 • await (b[1]=false or turn=1) • CS • b[1]:=false • Program for process 1 • b[1]:=true • turn:=1 • await (b[0]=false or turn=0) • CS • b[1]:=false No Is this algorithm local-spin on a DSM machine? Yes Is this algorithm local-spin on a CC machine?
Peterson’s 2-process algorithm • Program for process 0 • b[0]:=true • turn:=0 • await (b[1]=false or turn=1) • CS • b[0]:=false • Program for process 1 • b[1]:=true • turn:=1 • await (b[0]=false or turn=0) • CS • b[1]:=false What is the RMR complexity on a DSM machine? Unbounded Constant What is the RMR complexity on a CC machine?
Recall the following simple test-and-set based algorithm Shared lock initially 0 • While (! lock.test-and-set() ) // entry section • Critical Section • Lock := 0 // exit section Is this algorithm local-spin on either a DSM or CC machine? Nope.
A better algorithm: test-and-test-and-set Shared lock initially 0 • While (! lock.test-and-set() )// entry section • await(lock == 0) • Critical Section • Lock := 0 // exit section Creates less traffic in CC machines, still not local-spin.
Anderson’s queue-based algorithm(Anderson, 1990) Shared:integer ticket – A RMW object, initially 0bit valid[0..n-1], initially valid[0]=1 and valid[i]=0, for i{1,..,n-1}Local: integer myTicket ticket 0 1 1 2 3 n-1 valid 1 0 0 0 0 0 • Program for process i • myTicket=fetch-and-inc-modulo-n(ticket) ; take a ticket • await valid[myTicket]=1 ; wait for your turn • CS • valid[myTicket]:=0 ; dequeue • valid[myTicket+1 mod n]:=1 ; signal successor
After entry section of p3 myTicket3 0 1 0 ticket ticket valid 1 0 0 0 0 valid 1 0 0 0 0 After p1 performs entry section After p3 exits myTicket3 myTicket1 myTicket1 2 0 1 ticket 2 1 ticket valid 1 0 0 0 0 valid 0 1 0 0 0 Anderson’s queue-based algorithm (cont’d) Initial configuration
Anderson’s queue-based algorithm (cont’d) • Program for process i • myTicket=fetch-and-inc-modulo-n(ticket) ; take a ticket • await valid[myTicket]=1 ; wait for your turn • CS • valid[myTicket]:=0 ; dequeue • valid[myTicket+1 mod n]:=1 ; signal successor What is the RMR complexity on a DSM machine? Unbounded Constant What is the RMR complexity on a CC machine?
swap(w, new) do atomically prev:=*w *w:=new return prev Graunke and Thakkar’s algorithm(Graunke and Thakkar, 1990) Uses the more common swap (a.k.a. fetch-and-store) primitive:
Graunke and Thakkar’s algorithm (cont’d) Shared:bit slots[0..n-1], initially slots[i]=1, for i{0,..,n-1} structure {bit value, bit *node} tail, initially {0, &slots[0]}Local: structure {bit value, bit *node} myRecord, prevbit temp tail 0 0 1 2 3 n-1 slots 1 1 1 1 1
Graunke and Thakkar’s algorithm (cont’d) Shared:bit slots[0..n-1], initially slots[i]=1, for i{0,..,n-1} structure {bit value, bit* slot} tail, initially {0, &slot[0]}Local: structure {bit value, bit* node} myRecord, prev, bit temp • Program for process i • myRecord.value:=slots[i] ; prepare to thread yourself to queue • myRecord.slot:=&slots[i] • prev=swap(&tail, &myRecord) ; prev now points to predecessor • await (*prev.slot≠prev.value) ;local spin until predecessor’s value changes • CS • temp:=1-slots[i] • slots[i]:=temp; signal successor
Graunke and Thakkar’s algorithm (cont’d) • Program for process i • myRecord.value:=slots[i] ; prepare to thread yourself to queue • myRecord.slot:=&slots[i] • prev=swap(&tail, myRecord) ; prev now points to predecessor • await (*prev.slot≠prev.value) ;local spin until predecessor’s value changes • CS • temp:=1-slots[i] • slots[i]:=temp; signal successor What is the RMR complexity on a DSM machine? Unbounded Constant What is the RMR complexity on a CC machine?
The MCS queue-based algorithm(Mellor-Crummey and Scott, 1991) • Has constant RMR complexity under both the DSM and CC models • Uses swap and CAS Type:Qnode: structure {bit locked, Qnode *next}Shared:Qnode nodes[0..n-1] Qnode *tail initially nilLocal: Qnode *myNode, initially &nodes[i] Qnode *prev, *successor Tail nodes F T T n-1 n 1 3 2
The MCS queue-based algorithm (cont’d) • Program for process i • myNode->next := nil; prepare to be last in queue • prev := &myNode ;prepare to thread yourself • pred=swap(&tail, prev) ;tail now points to myNode • if (pred≠ nil) ;I need to wait for a predecessor • myNode->locked := true ;prepare to wait • pred->next := myNode ;let my predecessor know it has to unlock me • await myNode.locked := false • CS • if (myNode.next = nil) ; if not sure there is a successor • if (compare-and-swap(&tail, myNode, nil) = false) ; if there is a successor • await (myNode->next≠ null) ; spin until successor lets me know its identity • successor := myNode->next ; get a pointer to my successor • successor->locked := false ; unlock my successor • else ; for sure, I have a successor • successor := myNode->next ; get a pointer to my successor • successor->locked := false ; unlock my successor
0 0 1 0 1 2 3 7 2 6 5 1 4 3 0 A local-spin tournament-tree algorithm(Anderson, Yang, 1993) Each node is identified by (level, number) Level 2 Level 1 Level 0 Processes O(log n) RMR complexity for both DSM and CC systems This is optimal (Attiya, Hendler, woelfel, 2008) Uses O(n log n) registers
A local-spin tournament-tree algorithm (cont’d) Shared:- Per each node, v, there are 3 registers: name[level, 2node], name[level, 2node+1] initially -1turn[level, node]- Per each level l and process i, a spin flag: flag[ level, i ] initially 0 Local: level, node, id
Program for process i • node:=i • For level = o to log n-1 do ;from leaf to root • node:= node/2 ;compute node in new level • id=node mod 2 ; compute ID for 2-process mutex algorithm (0 or 1) • name[level, 2node + id]:=i ;identify yourself • turn[level,node]:=i ;update the tie-breaker • flag[level, i]:=0 ;initialize my locally-accessible spin flag • rival:=name[level, 2node+1-id] • if ( (rival ≠ -1) and (turn[level, node] = i) ) ;if not sure I should precede rival • if (flag[level, rival] =0) If rival may get to wait at line 14 • flag[level, rival]:=1 ;Release rival by letting it know I updated tie-breaker • await flag[level, i] ≠ 0 ;await until signaled by rival (so it updated tie-breaker) • if (turn[level,node]=i) ;if I lost • await flag[level,i]=2 ;wait till rival notifies me its my turn • id:=node ;move to the next level • EndFor • CS • for level=log n –1 downto 0 do ;begin exit code • id:= i/2level , node:= id/2 ;set node and id • name[level, 2node+id ]) :=-1 ;erase name • rival := turn[level,node] ;find who rival is (if there is one) • if rival ≠ i ;if there is a rival • flag[level,rival] :=2 ;notify rival A local-spin tournament-tree algorithm (cont’d)
Local-Spin Leader Election • Exactly one process is elected • All other processes are not-elected • Processes may busy-wait
Choy and Sing's filter m processes Filter The rest are “halted” Between 1 and m/2 processes “exit “ • Filter guarantees: • Safety: if m processes enter a filter, at most m/2 exit. • Progress: if some processes enter a filter, at least one exits.
Choy and Singh's filter (cont’d) Shared:integer turn Boolean b, initially false • Program for process i • turn := i • await b // wait for barrier to open • b := true // close barrier • if turn ≠ i // not last to cross the barrier • b := false // open barrier • halt • else • exit Why does the barrier has to be re-opened? Why are filter guarantees satisfied?
Choy and Sing’s filter algorithm Filter #1 Filter #2 Filter #i
Choy and Sing’s filter algorithm (cont’d) Shared:typdef struct{integer turn, boolean b,c initially false} filter filter A[log n + 1] • Program for process i • For (curr=0; cur < log n +1; curr++) • A[curr].turn := p • Await A[curr].b • A[curr].b:=true • if (A[curr]. turn ≠ i) • A[curr].c := true // mark that some process failed on filter • A[curr].b := false • return not-elected • else if (curr > 0) A[curr-1].c • return elected // Other processes will never reach this filter • Else • curr := curr+1 • EndFor Do you see any problem with this algorithm?How can this be fixed?
Choy and Sing’s filter algorithm (cont’d) • What is the DSM RMR complexity? • What is the CC RMR complexity? • What is the worst-case average (CC) RMR complexity?