1 / 28

Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memor

Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se. Outline. NUCA Locks DSZOOM – Software-based Shared Memory TMA – Trap-based Memory Architecture.

Leo
Download Presentation

Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se Dissertation Seminar

  2. Outline • NUCA Locks • DSZOOM – Software-based Shared Memory • TMA – Trap-based Memory Architecture Dissertation Seminar

  3. Vasaloppet“Contention Problem in Sweden” 85.6533 km to go… CS Traditional cross-country ski race 90 km … Dissertation Seminar

  4. Spin Locks under Contention Spin locks Spin locks with backoff IF (more contention)  THEN less efficient CS … “The more important the slower it runs…” Critical Section (CS) Cost Amount of Contention Dissertation Seminar

  5. Queue-based Locks Queue-based locks Spin locks Spin locks with backoff CS Cost IF (more contention)  THEN constant CS cost … Amount of Contention Dissertation Seminar

  6. This Dissertation NUCA locks Spin locks Spin locks with backoff CS Cost IF (more contention)  THEN more efficient CS … “The more important the faster it runs…” Queue-based locks Amount of Contention Dissertation Seminar

  7. NUCA Locks (Basic Idea) 1) Reduce traffic - one CPU per node is testing… 2) Improve lock handover 3) More efficient CS - local traffic is cheaper Switch Memory Memory Memory $ $ … $ $ $ … $ $ $ … $ P P P P P P P P P Lock/Unlock Test Test Test Test Lock/Unlock Test Test Test Test Test Test Test Dissertation Seminar

  8. The HBO Lock (the simplest HBO) Creates Communication Affinity • What do we need? • node_id • Compare&swap (CAS) atomic operation CAS(Lock_address,FREE, node_id) • lock-acquire: • If the lock-value is in the state FREE: • The node_id is CAS-ed into the lock location • Else: 2 cases • The lock is “local”  Spin with small backoff • The lock is “remote”  Spin with large backoff • Simple but fairly effective… Dissertation Seminar

  9. Performance ResultsRealistic microbenchmark, 2-node WildFire, 28 CPUs 14 14 WF Fairness? Dissertation Seminar

  10. Fairness StudyRealistic microbenchmark, 2-node WildFire, 28 CPUs t Dissertation Seminar

  11. Application Performance28-processor runs ≈ 4x Dissertation Seminar

  12. Total Traffic: Raytrace Dissertation Seminar

  13. HBO Locks inside Linux Kernel • Patch provided by Silicon Graphics, Inc. • Linux-IA64 kernel implementation, May 2005 • Page-fault handler runs 3x faster • 60 processors Dissertation Seminar

  14. Outline • NUCA Locks • DSZOOM – Software-based Shared Memory • TMA – Trap-based Memory Architecture Dissertation Seminar

  15. The DSZOOM Proposal Dissertation Seminar

  16. The DSZOOM Proposal • Run entire protocol in requesting-processor • No protocol agent communication! • Assumes user-level remote memory access • put, get, and atomics [  InfiniBand] • Fine-grain memory protocols (e.g., 64 bytes) • Hardware-like memory models [Shasta, Blizzard, Sirocco] Dissertation Seminar

  17. “Squeezing” Protocols into Binaries… DSZOOM Program Original Program ...cmp %g0, %l5 bne 0x24431nop ldd [%o0 + 16], %f4clr %l5... ...cmp %g0, %l5 bne 0x24431nop ldd [%o0 + 16], %f4clr %l5... ld [%o1 + 64], %o0mov 255, %g6and %g6, %o0, %g6cmp %g6, 170bne 0x24450nop Fast-path Protocol Code ld [%o1 + 64], %o0 Slow-path Protocol Code (C-code) Binary/Assembler level instrumentation Dissertation Seminar

  18. Write Permission Caching • Problem: store instrumentation relies on locking • More complex instrumentation • Solution: write permission cache (WPC) • Small and fast software-managed cache • Keeps write permissions • The WPC idea: • Exploit store locality • Dynamically reduce the number of memory references in store checking code Dissertation Seminar

  19. Other “Features” • Two kinds of protocols • Invalidate • Update • Many optimizations • Instrumentation scheduling (update and invalidate) • Instrumentation batching (invalidate) • WPC-based write batching (update) • WPC-based dirty-data filtering (update) • Private-data filtering (update) • # of WPC entries (update and invalidate) • Coherence unit size (update and invalidate) Dissertation Seminar

  20. Coherence Flags and Profiling • Coherence flags • Similar to optimization flags of compilers • Possible scenario: gcc -dszoom-cl 128 -dszoom-inv –O3 my_app.c • Execution profiling • Similar to profile feedback of compilers • Helps finding appropriate coherence flag settings • Low overhead implementation in DSZOOM • Less than 30 percent overhead • Works for both small and large input sets Dissertation Seminar

  21. DSZOOM Results2-node WildFire, 16 CPUs 1.45x 1.11x Dissertation Seminar

  22. Outline • NUCA Locks • DSZOOM – Software-based Shared Memory • TMA – Trap-based Memory Architecture Dissertation Seminar

  23. Instrumentation Drawbacks DSZOOM Program Original Program ...cmp %g0, %l5 bne 0x24431nop ldd [%o0 + 16], %f4clr %l5... ...cmp %g0, %l5 bne 0x24431nop ldd [%o0 + 16], %f4clr %l5... ld [%o1 + 64], %o0mov 255, %g6and %g6, %o0, %g6cmp %g6, 170bne 0x24450nop Fast-path Protocol Code ld [%o1 + 64], %o0 Slow-path Protocol Code (C-code) • Binary transparency? • Run-time execution overhead Dissertation Seminar

  24. Trap-Based Memory Architectures • Basic idea • Detect fine-grained coherence violations in hardware • Trigger a coherence trap when one occur • Maintain coherence by software protocols • No memory system modifications • Minimal processor modifications • Binary Transparency • No need to instrument binaries/applications Dissertation Seminar

  25. TMA LiteProof-of-concept Implementation • Load permission check • Hardware implementation of software check • Predefined “magic-value” convention • Store permission check • Hardware WPC • Can be seen as a very small cache • Operates on virtual addresses • Accessed in parallel with the data TLB Dissertation Seminar

  26. TMA Lite Performance[TMA: simulation study, 4 nodes | DSZOOM: 2-node WildFire] 1.75x 1.01x Dissertation Seminar

  27. Topics not Presented • RH lock algorithm • Controlled (un)fairness • HBO_GT and HBO_GT_SD algorithms • Global throttling and starvation detection • DSZOOM implementation details • Instrumentation challenges; scheduling, batching, etc. • Bandwidth filtering techniques; dirty- & private-data • Innovative TMA simulation tricks • Low-level “good days” hacks • Reusing Simics checkpoints Dissertation Seminar

  28. Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for Distributed Shared Memory Zoran Radovic zoran.radovic@it.uu.se Dissertation Seminar

More Related