270 likes | 496 Views
ECE 259 / CPS 221 Advanced Computer Architecture II. Synchronization without Contention. John M. Mellor-Crummey and Michael L. Scott+. Presenter : Tae Jun Ham 2012. 2. 16. Problem. Busy-waiting synchronization incurs high memory/network contention
E N D
ECE 259 / CPS 221 Advanced Computer Architecture II Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ Presenter : Tae Jun Ham 2012. 2. 16
Problem • Busy-waiting synchronization incurs high memory/network contention • Creation of hot spot = degradation of performance • Causes cache-line invalidation (for every write on lock) • Possible Approach : Add special-purpose hardware for synchronization • Add synchronization variable to the switching nodes on interconnection • Implement lock queuing mechanisms on cache controller • Suggestion in this paper : Use scalable synchronization algorithm (MCS) instead of special-purpose hardware
Review of Synchronization Algorithms • Test and Set • Require : Test and Set (Atomic operation) • Problem : • Large Contention – Cache / Memory • Lack of Fairness - Random Order LOCK while (test&set(x) == 1); UNLOCK x = 0;
Review of Synchronization Algorithms • Test and Set with Backoff • Almost similar to Test and Set but has delay • Time : • Linear : Time = Time + Some Time • Exponential : Time = Time * Some constant • Performance : Reduced contention but still not fair LOCK while (test&set(x) == 1) { delay(time); } UNLOCK x = 0;
Review of Synchronization Algorithms • Ticket Lock • Requires : fetch and increment (Atomic Operation) • Advantage : Fair (FIFO) • Disadvantage : Contention (Memory/Network) LOCK myticket = fetch & increment (&(L->next_ticket)); while(myticket!=L->now_serving) { delay(time * (myticket-L->now_serving)); } UNLOCK L->now_serving = L->now_serving+1;
Review of Synchronization Algorithms • Anderson Lock (Array based queue lock) • Requires : fetch and increment (Atomic Operation) • Advantage : Fair (FIFO), No cache contention • Disadvantage : Requires coherent cache / Space LOCK myplace= fetch & increment (&(L->next_location)); while(L->location[myplace] == must_wait) ; L->location[myplace]=must_wait; } UNLOCK L->location[myplace+1]=has_lock;
MCS Lock • MCS Lock – Based on Linked List • Acquire • Fetch & Store Last processor node (Get predecessor & set tail) • Set arriving processor node to locked • Set last processor node’s next node to arriving processor node • Spin till Locked=false tail 1 2 3 4 Locked : False (Run) Locked :True(Spin) Locked :True (Spin) tail 1 2 3 4 Locked :True (Spin) Locked :True (Spin) Locked :True (Spin) Locked : False (Run)
MCS Lock • MCS Lock – Based on Linked List • Release Check if next processor node is set (check if we completed acquisition) - If set, make next processor node unlocked tail 1 2 3 4 Locked :True (Spin) Locked :True (Spin) Locked :True (Spin) Locked : False (Run) tail 1 2 3 4 Locked :True (Spin) Locked : False (Run) Locked :True (Spin) Locked : False (Finished)
MCS Lock • MCS Lock – Based on Linked List • Release Check if next processor node is set (check if we completed acquisition) • If not set, check if tail points itself (compare & swap with null) • If not, wait till next processor node is set • Then, unlock next processor node tail tail tail 1 2 1 2 1 2 Locked : True (Run) Locked : False (Run) Locked : False (Run) Locked : False (Finished) Locked : False (Run)
MCS Lock – Concurrent Read Version • MCS Lock – Based on Linked List • MCS Lock – Concurrent Read Version
MCS Lock – Concurrent Read Version • Start_Read :- Ifpredecessor is nill or active reader, reader_count++ (atomic) ; proceed;- Else, spin till (another Start_Read or End_Write) unblocks this=> Then, this unblocks its successor reader (if any) • End_Read : - Ifsuccessor is writer, set next_writer=successor- reader_count-- (atomic)- Iflast reader(reader_count==0), check next_writer and unblocks it • Start_Write : - If predecessor is nill and there’s no active reader(reader_count=0), proceed- Else, spin till (last End_Read ) unblocks this • End_Write : - If successoris reader, reader_count++ (atomic) and unblocks it
Review of Barriers • Centralized counter barrier Keeps checking(test & set) centralized counter • Advantage : Simplicity • Disadvantage : Hot spot, Contention
Review of Barriers • Combining Tree Barrier • Advantage : Simplicity, Less contention, Parallelized fetch&increment • Disadvantage : Still spins on non-local location
Review of Barriers • Bidirectional Tournament Barrier • Winner is statically determined • Advantage : No need for fetch and op / Local Spin
Review of Barriers • Dissemination Barrier • Can be understood as a variation of tournament (Statically determined) • Suitable for MPI system
MCS Barriers • MCS Barrier (Arrival) • Similar to Combined Tree Barrier • Local Spin / O(P) Space / 2(P-2) communication / O(log p) critical path
MCS Barriers • MCS Barrier (Wakeup) • Similar to Combined Tree Barrier • Local Spin / O(P) Space / 2(P-2) communication / O(log p) critical path 0 2 1 3 4 5
Spin Lock Evaluation • Butterfly Machine result • Three scaled badly; Four scaled well. MCS was best • Backoff was effective
Spin Lock Evaluation • Butterfly Machine result • Measured consecutive lock acquisitions on separate processors instead of acquire/release pair from start to finish
Spin Lock Evaluation • Symmetry Machine Result • MCS and Anderson scales well • Ticket lock cannot be implemented in Symmetry due to lack of fetch and increment operation • Symmetry Result seems to be more reliable
Spin Lock Evaluation • Network Latency • MCS has greatly reduced increases in network latency • Local Spin reduces contention
Barrier Evaluation • Butterfly Machine • Dissemination was best • Bidirectional and MCS Tree was okay • Remote memory access degrades performance a lot
Barrier Evaluation • Symmetry Machine • Counter method was best • Dissemination was worst • Bus-based architecture: Cheap broadcast • MCS arrival tree outperforms counter for more than 16 processors
Local Memory Evaluation • Having a local memory is extremely important • It both affects performance and network contention • Dancehall system is not reallyscalable
Summary • This paper proposed a scalable spin-lock synchronization algorithmwithout network contention • This paper proposed a scalable barrier algorithm • This paper proved that network contention due to busy-wait synchronization is not really a problem • This paper proved an idea that hardware for QOSB lock would not be cost-effective when compared with MCS lock • This paper suggests the use of distributed memory or coherentcaches rather than dance-hall memory without coherent caches
Discussion • What would be the primary disadvantage of MCS lock? • In what case MCS lock would have worse performance than other locks? • How do you think about special-purpose hardware based locks? • Is space usage of lock important? • Can we benefit from dancehall style memory architecture? (disaggregated memory ?) • Is there a way to implement energy-efficient spin-lock?