490 likes | 688 Views
Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and Coherence for Nonuniform Communication Architectures Zoran Radovic zoran.radovic@it.uu.se. Introduction: Cache. “Scratch pad” Kladdpapper. $. Memory. A. B. P. A: 5. B: 80. A = 5 B = A + 75.
E N D
Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and Coherence for Nonuniform Communication Architectures Zoran Radovic zoran.radovic@it.uu.se Licentiate Thesis Seminar
Introduction: Cache “Scratch pad” Kladdpapper $ Memory A B P A: 5 B: 80 A = 5 B = A + 75 Licentiate Thesis Seminar
Introduction: Cache Coherence BARRIER LOCK (CS) UNLOCK Y:=X A:=0 A:=56 A = 5 B = A + 75 A = A + 1 Memory A B Cache Coherence A: 5 Cache-to-cache Transfer A: 5 A: 6 B: 80 Web server Database server etc. P1 P2 P3 Licentiate Thesis Seminar
Inside a Real Thing ... Licentiate Thesis Seminar
Nonuniform Memory AccessArchitecture (NUMA) • Many NUMA optimizations are proposed • Page migration speed up accesses to “private” data • Page replication speed up reads to “shared” data • Does not help communication… • E.g., cache-to-cache transfers Memory Memory Switch Access time ratio $ $ $ $ $ $ $ $ ... 1 2 – 10 P P P P P P P P Licentiate Thesis Seminar
Nonuniform CommunicationArchitecture (NUCA) Memory Memory Switch NUCA ratio $ $ $ $ $ $ $ $ ... 1 2 – 10 P P P P P P P P NUCA optimizations are getting important for future architectures! • NUCA examples (NUCA ratios): • 1992: Stanford DASH (~ 4.5) • 1996: Sequent NUMA-Q (~ 10) • 1999: Sun WildFire (~ 6) • 2000: Compaq DS-320 (~ 3.5) • Future (Today): CMP, SMT (~ 10) A “new” property of NUMAs… NUCA Licentiate Thesis Seminar
Outline • Introduction • NUCA Locks • Paper A: RH Lock • Paper B: HBO Locks • Beating the Real Thing … • Paper C: DSZOOM – Software-based Shared Memory • Paper D: THROOM – POSIX Front-end • Paper E: SAIT & Write Permission Cache (WPC) • Contributions • Future Work Licentiate Thesis Seminar
Synchronization Basics A:=0 BARRIER LOCK(L) A:=A+1 UNLOCK(L) LOCK(L) B:=A+5 UNLOCK(L) • Locks are used to protect critical section (CS) data • CS examples: • Bank account status • Global counters • Number of on-line visitors • … Licentiate Thesis Seminar
Synchronization Example Lock CS flag Update CS data Unlock Test / Spin Test / Spin Test / Spin Lock CS flag Update CS data Unlock lock handover “CS efficiency” = CS flag Locks are used to protect critical section (CS) data Memory = CS data $ $ … $ P1 P2 P4 Write BUSY token to the flag… Licentiate Thesis Seminar
Large System Synchronization Three problems under contention with Spin (Test&Set) locks: 1) Test and invalidation traffic 2) Lock handover 3) CS efficiency Switch Memory Memory Memory $ $ … $ $ $ … $ $ $ … $ P1 P2 P4 P5 P6 P8 P9 P10 P12 Lock Update Unlock Test Test Test Test Test Test Test Test Test Lock Update Unlock Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Lock Update Unlock Licentiate Thesis Seminar
Vasaloppet“Contention Problem in Sweden” 85.6533 km to go… CS Traditional cross-country ski race 90 km … Licentiate Thesis Seminar
Spin Locks Under Contention Spin locks Spin locks w/ backoff IF (more contention) THEN less efficient CS … “The more important the slower it runs…” Critical Section (CS) Cost Amount of Contention Licentiate Thesis Seminar
First-come,first-served order Starvation avoidance Maximal fairness Reduced traffic Queue-based locks HW: QOLB ‘89 SW: MCS ‘91 SW: CLH ‘93 Making it Scalable: Queues … Licentiate Thesis Seminar
Queue-based Locks Queue-based locks Spin locks Spin locks w/ backoff CS Cost IF (more contention) THEN constant CS cost … Amount of Contention Licentiate Thesis Seminar
Raytrace Speedup 14 14 WF Sun WildFire (WF) NUCA Ratio = 6 Licentiate Thesis Seminar
This Thesis NUCA locks Spin locks Spin locks w/ backoff CS Cost IF (more contention) THEN more efficient CS … “The more important the faster it runs…” Queue-based locks Amount of Contention Licentiate Thesis Seminar
Raytrace Speedup NUCA Locks 14 14 WF Sun WildFire (WF) Licentiate Thesis Seminar
NUCA Locks 1) Reduces traffic (one CPU per node is testing…) 2) Improves lock handover 3) More efficient CS (local traffic is cheaper) Switch Memory Memory Memory $ $ … $ $ $ … $ $ $ … $ P P P P P P P P P Lock/Unlock Test Test Test Test Lock/Unlock Test Test Test Test Test Test Test Licentiate Thesis Seminar
Application PerformanceRaytrace Speedup WF Licentiate Thesis Seminar
Application PerformanceRaytrace Speedup RH Lock WF Licentiate Thesis Seminar
Total Traffic: Raytrace Licentiate Thesis Seminar
Outline • Introduction • NUCA Locks • Paper A: RH Lock • Paper B: HBO Locks • Beating the Real Thing … • Paper C: DSZOOM – Software-based Shared Memory • Paper D: THROOM – POSIX Front-end • Paper E: SAIT & Write Permission Cache (WPC) • Contributions • Future Work Licentiate Thesis Seminar
Servers vs. Clusters BARRIER BARRIER LOCK (CS) UNLOCK Y:=X LOCK (CS) UNLOCK Y:=X A:=0 A:=0 A:=56 A:=56 ? Licentiate Thesis Seminar
Popular Solutions • Solution 1: more hardware (HW-DSM) • Transparent for programmers • Usually good scalability • Expensive, hard verification, long time to market … • Solution 2: simple HW + software (SW-DSM) • Can use more complex (adaptive) protocols • Traditionally poor scalability for many programs • Shorter time to market, simple to upgrade/customize Licentiate Thesis Seminar
The DSZOOM proposal Licentiate Thesis Seminar
DSZOOM Cluster • DSZOOM Nodes: • Each node consists of an unmodified workstation/server • Server’s hardware provides memory protocols for caches and memory within each machine + • DSZOOM Cluster Network: • “Standard” and fast cluster interconnect • Inexpensive user-level remote memory access + • DSZOOM software • Memory protocols between nodes, synchronization Licentiate Thesis Seminar
Problems with Traditional SW-DSMs • Large coherence units (4-8kB) • False Sharing! Weaker Memory Models [e.g., Ivy, Munin, TreadMarks, Cashmere-2L, GeNIMA, …] • Protocol agent messaging is slow • Most efficiency lost in interrupt/poll Mem Prot.agent Prot.agent Mem CPUs CPUs LD a Licentiate Thesis Seminar
Our proposal: DSZOOM • Run entire protocol in requesting-processor • No protocol agent communication! • Assumes user-level remote memory access • put, get, and atomics [ InfiniBand] • Fine-grain memory protocols (64 bytes) • Hardware-like memory models [Shasta, Blizzard, Sirocco] Licentiate Thesis Seminar
Global Coherency ActionRead data modified in a third node: 3–hop read 2a. atomic 3a.put 2b.get data 1. atomic 3b. put “Blocking directory” protocol Node 3 Mem Write Perm. DIR Node 2 Requestor LD a Node 1 Licentiate Thesis Seminar
Squeezing protocols into binaries… DSZOOM Program Original Program ...cmp %g0, %l5 bne 0x24431nop ldd [%o0 + 16], %f4clr %l5... ...cmp %g0, %l5 bne 0x24431nop ldd [%o0 + 16], %f4clr %l5... ld [%o1 + 64], %o0mov 255, %g6and %g6, %o0, %g6cmp %g6, 170bne 0x24450nop Fast-path Protocol Code ld [%o1 + 64], %o0 Slow-path Protocol Code (C-code) Licentiate Thesis Seminar
Compilation Process UnmodifiedApplication Parallel Programming Constructs GNU gcc link DSZOOMRun-TimeLibrary (Un)executable MemoryProtocols (C-code) EEL a.out Licentiate Thesis Seminar
ResultsExecution Times in Seconds (16 CPUs) 8 8 8 8 16 HW SW DSZOOM Licentiate Thesis Seminar
Outline • Introduction • NUCA Locks • Paper A: RH Lock • Paper B: HBO Locks • Beating the Real Thing … • Paper C: DSZOOM – Software-based Shared Memory • Paper D: THROOM – POSIX Front-end • Paper E: SAIT & Write Permission Cache (WPC) • Contributions • Future Work Licentiate Thesis Seminar
THROOMTowards Higher Transparency … UnmodifiedApplication Parallel Programming Constructs GNU gcc link DSZOOMRun-TimeLibrary (Un)executable MemoryProtocols (C-code) EEL a.out Unmodified POSIX thread (Pthread) Application MemoryProtocols (C-code) EEL Transparent runtime support: -- memory allocation -- thread creation / termination -- synchronization -- I/O … a.out Licentiate Thesis Seminar
SAIT Overview • SAIT = SPARC Assembler Instrumentation Tool • Instrument assembler files • More information about programs is available • Support for liveness analysis SAIT SourceFile .s assembler output .s instrumented assembler a.out ld cc calls Used in several UART projects! User Library (e.g., protocols) link User Library (e.g., protocols) User Library (e.g., protocols) snippets.txt Licentiate Thesis Seminar
Write Permission Cache (WPC) Store A Store instrumentation is expensive… Memory Write permission: A, B, D Write Permission? A P P P P WPC WPC WPC Licentiate Thesis Seminar
Contributions • Nonuniform Communication Architecture (NUCA) • Several NUCA-locks that exploit NUCAs: • RHlock • Three HBO locks • DSZOOM: Novel SW-DSM system • THROOM: Supporting POSIX binaries on clusters • SAIT: SPARC Assembler Instrumentation Tool • WPC: Write Permission Cache Licentiate Thesis Seminar
Future Work • NUCA locks for the DSZOOM system • Instrumentation optimizations • Compiler support • Optimizing backend • Further WPC studies/optimizations • Protocol optimizations • Adaptive Invalidate/Update • “Push based” protocols Licentiate Thesis Seminar
Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and Coherence for Nonuniform Communication Architectures Zoran Radovic zoran.radovic@it.uu.se Licentiate Thesis Seminar
Fairness Study2-node Sun WildFire, 28 CPUs t Licentiate Thesis Seminar
Traditional Microbenchmark • For each thread: for (i = 0; i < iterations; i++) {LOCK(L); /* null/small Critical Section */UNLOCK(L);} Licentiate Thesis Seminar
Lock performanceTraditional microbenchmark WF Licentiate Thesis Seminar
New Microbenchmark • More realistic node handoffs for queue-locks • Constant number of processors • Control the “amount of contention” for (i = 0; i < iterations; i++) {LOCK(L); delay(critical_work); // CSUNLOCK(L); static_delay(); random_delay();} Licentiate Thesis Seminar
Performance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUs 14 14 WF Licentiate Thesis Seminar
Results (2)Normalized Execution Time Breakdowns (16 CPUs) 8 8 SW EEL Licentiate Thesis Seminar
Instrumentation Performance Licentiate Thesis Seminar
1-entry WPC Licentiate Thesis Seminar
2-entry WPC Licentiate Thesis Seminar