1 / 48

Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and Coherence for Nonuniform Communication Architectures Zoran Radovic zoran.radovic@it.uu.se. Introduction: Cache. “Scratch pad” Kladdpapper. $. Memory. A. B. P. A: 5. B: 80. A = 5 B = A + 75.

noah
Download Presentation

Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and Coherence for Nonuniform Communication Architectures Zoran Radovic zoran.radovic@it.uu.se Licentiate Thesis Seminar

  2. Introduction: Cache “Scratch pad” Kladdpapper $ Memory A B P A: 5 B: 80 A = 5 B = A + 75 Licentiate Thesis Seminar

  3. Introduction: Cache Coherence BARRIER LOCK (CS) UNLOCK Y:=X A:=0 A:=56 A = 5 B = A + 75 A = A + 1 Memory A B Cache Coherence A: 5 Cache-to-cache Transfer A: 5 A: 6 B: 80 Web server Database server etc. P1 P2 P3 Licentiate Thesis Seminar

  4. Inside a Real Thing ... Licentiate Thesis Seminar

  5. Nonuniform Memory AccessArchitecture (NUMA) • Many NUMA optimizations are proposed • Page migration  speed up accesses to “private” data • Page replication  speed up reads to “shared” data • Does not help communication… • E.g., cache-to-cache transfers Memory Memory Switch Access time ratio $ $ $ $ $ $ $ $ ... 1 2 – 10 P P P P P P P P Licentiate Thesis Seminar

  6. Nonuniform CommunicationArchitecture (NUCA) Memory Memory Switch NUCA ratio $ $ $ $ $ $ $ $ ... 1 2 – 10 P P P P P P P P NUCA optimizations are getting important for future architectures! • NUCA examples (NUCA ratios): • 1992: Stanford DASH (~ 4.5) • 1996: Sequent NUMA-Q (~ 10) • 1999: Sun WildFire (~ 6) • 2000: Compaq DS-320 (~ 3.5) • Future (Today): CMP, SMT (~ 10) A “new” property of NUMAs…  NUCA Licentiate Thesis Seminar

  7. Outline • Introduction • NUCA Locks • Paper A: RH Lock • Paper B: HBO Locks • Beating the Real Thing … • Paper C: DSZOOM – Software-based Shared Memory • Paper D: THROOM – POSIX Front-end • Paper E: SAIT & Write Permission Cache (WPC) • Contributions • Future Work Licentiate Thesis Seminar

  8. Synchronization Basics A:=0 BARRIER LOCK(L) A:=A+1 UNLOCK(L) LOCK(L) B:=A+5 UNLOCK(L) • Locks are used to protect critical section (CS) data • CS examples: • Bank account status • Global counters • Number of on-line visitors • … Licentiate Thesis Seminar

  9. Synchronization Example Lock CS flag Update CS data Unlock Test / Spin Test / Spin Test / Spin Lock CS flag Update CS data Unlock lock handover “CS efficiency” = CS flag Locks are used to protect critical section (CS) data Memory = CS data $ $ … $ P1 P2 P4 Write BUSY token to the flag… Licentiate Thesis Seminar

  10. Large System Synchronization Three problems under contention with Spin (Test&Set) locks: 1) Test and invalidation traffic 2) Lock handover 3) CS efficiency Switch Memory Memory Memory $ $ … $ $ $ … $ $ $ … $ P1 P2 P4 P5 P6 P8 P9 P10 P12 Lock Update Unlock Test Test Test Test Test Test Test Test Test Lock Update Unlock Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Lock Update Unlock Licentiate Thesis Seminar

  11. Vasaloppet“Contention Problem in Sweden” 85.6533 km to go… CS Traditional cross-country ski race 90 km … Licentiate Thesis Seminar

  12. Spin Locks Under Contention Spin locks Spin locks w/ backoff IF (more contention)  THEN less efficient CS … “The more important the slower it runs…” Critical Section (CS) Cost Amount of Contention Licentiate Thesis Seminar

  13. First-come,first-served order Starvation avoidance Maximal fairness Reduced traffic Queue-based locks HW: QOLB ‘89 SW: MCS ‘91 SW: CLH ‘93 Making it Scalable: Queues … Licentiate Thesis Seminar

  14. Queue-based Locks Queue-based locks Spin locks Spin locks w/ backoff CS Cost IF (more contention)  THEN constant CS cost … Amount of Contention Licentiate Thesis Seminar

  15. Raytrace Speedup 14 14 WF Sun WildFire (WF) NUCA Ratio = 6 Licentiate Thesis Seminar

  16. This Thesis NUCA locks Spin locks Spin locks w/ backoff CS Cost IF (more contention)  THEN more efficient CS … “The more important the faster it runs…” Queue-based locks Amount of Contention Licentiate Thesis Seminar

  17. Raytrace Speedup NUCA Locks 14 14 WF Sun WildFire (WF) Licentiate Thesis Seminar

  18. NUCA Locks 1) Reduces traffic (one CPU per node is testing…) 2) Improves lock handover 3) More efficient CS (local traffic is cheaper) Switch Memory Memory Memory $ $ … $ $ $ … $ $ $ … $ P P P P P P P P P Lock/Unlock Test Test Test Test Lock/Unlock Test Test Test Test Test Test Test Licentiate Thesis Seminar

  19. Application PerformanceRaytrace Speedup WF Licentiate Thesis Seminar

  20. Application PerformanceRaytrace Speedup RH Lock WF Licentiate Thesis Seminar

  21. Total Traffic: Raytrace Licentiate Thesis Seminar

  22. Outline • Introduction • NUCA Locks • Paper A: RH Lock • Paper B: HBO Locks • Beating the Real Thing … • Paper C: DSZOOM – Software-based Shared Memory • Paper D: THROOM – POSIX Front-end • Paper E: SAIT & Write Permission Cache (WPC) • Contributions • Future Work Licentiate Thesis Seminar

  23. Servers vs. Clusters BARRIER BARRIER LOCK (CS) UNLOCK Y:=X LOCK (CS) UNLOCK Y:=X A:=0 A:=0 A:=56 A:=56 ? Licentiate Thesis Seminar

  24. Popular Solutions • Solution 1: more hardware (HW-DSM) • Transparent for programmers • Usually good scalability • Expensive, hard verification, long time to market … • Solution 2: simple HW + software (SW-DSM) • Can use more complex (adaptive) protocols • Traditionally poor scalability for many programs • Shorter time to market, simple to upgrade/customize Licentiate Thesis Seminar

  25. The DSZOOM proposal Licentiate Thesis Seminar

  26. DSZOOM Cluster • DSZOOM Nodes: • Each node consists of an unmodified workstation/server • Server’s hardware provides memory protocols for caches and memory within each machine + • DSZOOM Cluster Network: • “Standard” and fast cluster interconnect • Inexpensive user-level remote memory access + • DSZOOM software • Memory protocols between nodes, synchronization Licentiate Thesis Seminar

  27. Problems with Traditional SW-DSMs • Large coherence units (4-8kB) • False Sharing!  Weaker Memory Models [e.g., Ivy, Munin, TreadMarks, Cashmere-2L, GeNIMA, …] • Protocol agent messaging is slow • Most efficiency lost in interrupt/poll Mem Prot.agent Prot.agent Mem CPUs CPUs LD a Licentiate Thesis Seminar

  28. Our proposal: DSZOOM • Run entire protocol in requesting-processor • No protocol agent communication! • Assumes user-level remote memory access • put, get, and atomics [  InfiniBand] • Fine-grain memory protocols (64 bytes) • Hardware-like memory models [Shasta, Blizzard, Sirocco] Licentiate Thesis Seminar

  29. Global Coherency ActionRead data modified in a third node: 3–hop read 2a. atomic 3a.put 2b.get data 1. atomic 3b. put “Blocking directory” protocol Node 3 Mem Write Perm. DIR Node 2 Requestor LD a Node 1 Licentiate Thesis Seminar

  30. Squeezing protocols into binaries… DSZOOM Program Original Program ...cmp %g0, %l5 bne 0x24431nop ldd [%o0 + 16], %f4clr %l5... ...cmp %g0, %l5 bne 0x24431nop ldd [%o0 + 16], %f4clr %l5... ld [%o1 + 64], %o0mov 255, %g6and %g6, %o0, %g6cmp %g6, 170bne 0x24450nop Fast-path Protocol Code ld [%o1 + 64], %o0 Slow-path Protocol Code (C-code) Licentiate Thesis Seminar

  31. Compilation Process UnmodifiedApplication Parallel Programming Constructs GNU gcc link DSZOOMRun-TimeLibrary (Un)executable MemoryProtocols (C-code) EEL a.out Licentiate Thesis Seminar

  32. ResultsExecution Times in Seconds (16 CPUs) 8 8 8 8 16 HW SW DSZOOM Licentiate Thesis Seminar

  33. Outline • Introduction • NUCA Locks • Paper A: RH Lock • Paper B: HBO Locks • Beating the Real Thing … • Paper C: DSZOOM – Software-based Shared Memory • Paper D: THROOM – POSIX Front-end • Paper E: SAIT & Write Permission Cache (WPC) • Contributions • Future Work Licentiate Thesis Seminar

  34. THROOMTowards Higher Transparency … UnmodifiedApplication Parallel Programming Constructs GNU gcc link DSZOOMRun-TimeLibrary (Un)executable MemoryProtocols (C-code) EEL a.out Unmodified POSIX thread (Pthread) Application MemoryProtocols (C-code) EEL Transparent runtime support: -- memory allocation -- thread creation / termination -- synchronization -- I/O … a.out Licentiate Thesis Seminar

  35. SAIT Overview • SAIT = SPARC Assembler Instrumentation Tool • Instrument assembler files • More information about programs is available • Support for liveness analysis SAIT SourceFile .s assembler output .s instrumented assembler a.out ld cc calls Used in several UART projects! User Library (e.g., protocols) link User Library (e.g., protocols) User Library (e.g., protocols) snippets.txt Licentiate Thesis Seminar

  36. Write Permission Cache (WPC) Store A Store instrumentation is expensive… Memory Write permission: A, B, D Write Permission? A P P P P WPC WPC WPC Licentiate Thesis Seminar

  37. Contributions • Nonuniform Communication Architecture (NUCA) • Several NUCA-locks that exploit NUCAs: • RHlock • Three HBO locks • DSZOOM: Novel SW-DSM system • THROOM: Supporting POSIX binaries on clusters • SAIT: SPARC Assembler Instrumentation Tool • WPC: Write Permission Cache Licentiate Thesis Seminar

  38. Future Work • NUCA locks for the DSZOOM system • Instrumentation optimizations • Compiler support • Optimizing backend • Further WPC studies/optimizations • Protocol optimizations • Adaptive Invalidate/Update • “Push based” protocols Licentiate Thesis Seminar

  39. Licentiate Thesis Seminar Uppsala University, 25/9 – 2003 Efficient Synchronization and Coherence for Nonuniform Communication Architectures Zoran Radovic zoran.radovic@it.uu.se Licentiate Thesis Seminar

  40. Fairness Study2-node Sun WildFire, 28 CPUs t Licentiate Thesis Seminar

  41. Traditional Microbenchmark • For each thread: for (i = 0; i < iterations; i++) {LOCK(L); /* null/small Critical Section */UNLOCK(L);} Licentiate Thesis Seminar

  42. Lock performanceTraditional microbenchmark WF Licentiate Thesis Seminar

  43. New Microbenchmark • More realistic node handoffs for queue-locks • Constant number of processors • Control the “amount of contention” for (i = 0; i < iterations; i++) {LOCK(L); delay(critical_work); // CSUNLOCK(L); static_delay(); random_delay();} Licentiate Thesis Seminar

  44. Performance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUs 14 14 WF Licentiate Thesis Seminar

  45. Results (2)Normalized Execution Time Breakdowns (16 CPUs) 8 8 SW EEL Licentiate Thesis Seminar

  46. Instrumentation Performance Licentiate Thesis Seminar

  47. 1-entry WPC Licentiate Thesis Seminar

  48. 2-entry WPC Licentiate Thesis Seminar

More Related