260 likes | 427 Views
Efficient Synchronization for Non-Uniform Communication Architecture. Uppsala University Department of Information Technology Uppsala Architecture Research Team [ UART ]. Zoran Radovic and Erik Hagersten {zoran.radovic, erik.hagersten}@it.uu.se. Synchronization Basics. A:=0. BARRIER.
E N D
Efficient Synchronization for Non-Uniform Communication Architecture Uppsala UniversityDepartment of Information TechnologyUppsala Architecture Research Team [UART] Zoran Radovic and Erik Hagersten{zoran.radovic, erik.hagersten}@it.uu.se
Synchronization Basics A:=0 BARRIER LOCK(L) A:=A+1 UNLOCK(L) LOCK(L) B:=A+5 UNLOCK(L) • Locks are used to protect the shared critical section data Uppsala Architecture Research Team (UART)
Simple Spin Locks Busy-wait/ backoff • test_and_test&set (TATAS), ‘84 • TATAS with exponential backoff (TATAS_EXP), ‘90 • Many variations TATAS_LOCK(L) { if (tas(L)) { do { if (*L) continue; } while (tas(L)); }} TATAS_UNLOCK(L) { *L = 0; // = FREE} Memory FREE BUSY Lock: $ $ $ … $ Pn P1 P2 P3 P3 BUSY BUSY FREE BUSY Uppsala Architecture Research Team (UART)
Performance Under Contention Spin locks Spin locks w/ backoff CS Cost IF (more contention) THEN less efficient CS … Amount of Contention Uppsala Architecture Research Team (UART)
First-come,first-served order Starvation avoidance Maximal fairness Reduced traffic Queue-based locks HW: QOLB ‘89 SW: MCS ‘91 SW: CLH ‘93 Making it Scalable: Queues … Uppsala Architecture Research Team (UART)
Queue Locks Under Contention Queue-based locks Spin locks Spin locks w/ backoff CS Cost IF (more contention) THEN constant CS cost … Amount of Contention Uppsala Architecture Research Team (UART)
Non-Uniform MemoryArchitecture (NUMA) • Many NUMA optimizations are proposed • Page migration • Page replication Memory Memory Switch $ $ $ $ $ $ $ $ 1 2 – 10 P1 P2 P3 Pn P1 P2 P3 Pn Uppsala Architecture Research Team (UART)
Non-Uniform CommunicationArchitecture (NUCA) Memory Memory Switch NUCA ratio $ $ $ $ $ $ $ $ 1 2 – 10 P1 P2 P3 Pn P1 P2 P3 Pn • NUCA examples (NUCA ratios): • 1992: Stanford DASH (~ 4.5) • 1996: Sequent NUMA-Q (~ 10) • 1999: Sun WildFire (~ 6) • 2000: Compaq DS-320 (~ 3.5) • Future: CMP, SMT (~ 10) Our NUCA … Uppsala Architecture Research Team (UART)
Our Goals • Design a scalable spin lock that exploits the NUCAs • Creating node affinity • For lock handover • For CS data • “Stable lock” • Reducing the traffic compared with the test&set locks Uppsala Architecture Research Team (UART)
Outline • Background & Motivation • NUMA vs. NUCA • The RH Lock • Performance Results • Application Study • Conclusions Uppsala Architecture Research Team (UART)
Key Ideas Behind RH Lock • Minimizing global traffic at lock-handover • Only one thread per node will try to acquire a remotely owned lock • Maximizing node locality of NUCAs • Handover the lock to a neighbor in the same node • Creates locality for the critical section (CS) data as well • Especially good for large CS and high contention • RH lock in a nutshell: • Double TATAS_EXP: one node-local lock + one “global” Uppsala Architecture Research Team (UART)
The RH Lock Algorithm Acquire: SWAP(my_TID, Lock) If (FREE or L_FREE) You’ve got it! Release: CAS(my_TID, FREE) else L_FREE) else: TATAS(my_TID, Lock) until FREE or L_FREE if “REMOTE”: Spin remotely CAS(FREE, REMOTE) until FREE (w/ exp backoff) Cabinet 1: Memory Cabinet 2: Memory FREE Lock2: REMOTE Lock2: 16 2 L_FREE 16 1 FREE Lock1: 19 32 REMOTE Lock1: … … $ $ $ $ $ $ $ $ P2 P19 P1 P2 P3 P16 P17 P18 P19 P32 2 FREECS FREECS 1 REMOTE Uppsala Architecture Research Team (UART)
Our NUCA: Sun WildFire 14 14 Memory Memory Switch NUCA ratio $ $ $ $ $ $ $ $ 1 6 P1 P2 P3 Pn P1 P2 P3 Pn WF Uppsala Architecture Research Team (UART)
NUCA-performance 14 14 Uppsala Architecture Research Team (UART)
New Microbenchmark • More realistic node handoffs for queue-based locks • Constant number of processors • Amount of Critical Section (CS) work can be increased • we can control the “amount of contention” for (i = 0; i < iterations; i++) {LOCK(L); delay(critical_work); // CSUNLOCK(L); static_delay(); random_delay();} Uppsala Architecture Research Team (UART)
Performance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUs 14 14 WF Uppsala Architecture Research Team (UART)
Traffic MeasurementsNew microbenchmark; critical_work = 1500 Uppsala Architecture Research Team (UART)
Application PerformanceRaytrace Speedup WF Uppsala Architecture Research Team (UART)
Application PerformanceRaytrace Speedup WF Uppsala Architecture Research Team (UART)
RH Lock Under Contention RH lock Spin locks Spin locks w/ backoff CS Cost Queue-based locks Amount of Contention Uppsala Architecture Research Team (UART)
Total Traffic: Raytrace Uppsala Architecture Research Team (UART)
Application Performance28-processor runs Uppsala Architecture Research Team (UART)
Conclusions • First-come, first-served not desirable for NUCAs • The RH lock exploits NUCAs by • creating locality through CS affinity (stable lock) • reducing traffic compared with the test&set locks • The first lock that performs better under contention • Global traffic is significantly reduced • Applications with contented locks scale better with RH locks on NUCAs Uppsala Architecture Research Team (UART)
Any Drawbacks? • Proof-of-concept NUCA-aware lock for 2 nodes • Hard to port to some architectures • Memory needs to be allocated/placed in different nodes • Lock storage is proportional to #NUCA nodes • Sensitive for starvation • “Non-uniform nature” of the algorithm • No mechanism for lowering the risk of starvation Uppsala Architecture Research Team (UART)
We propose a new set of NUCA-aware locks Hierarchical Backoff Locks (HBO) HPCA-9: Anaheim, California, February 2003 Teaser … Portable Scalable to many NUCA nodes Only cas atomic operations are used Only node_id is needed Lowers the risk of starvation Can We Fix It? Uppsala Architecture Research Team (UART)
UART’s Home Page http://www.it.uu.se/research/group/uart Uppsala Architecture Research Team (UART)