1 / 19

Removing the Overhead from Software-Based Shared Memory

Removing the Overhead from Software-Based Shared Memory. Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [ UART ]. Zoran Radov ic and Erik Hagersten {zoranr, eh}@it.uu.se. Problems with Traditional SW-DSMs.

Download Presentation

Removing the Overhead from Software-Based Shared Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Removing the Overhead from Software-Based Shared Memory Uppsala UniversityInformation TechnologyDepartment of Computer SystemsUppsala Architecture Research Team [UART] Zoran Radovic and Erik Hagersten{zoranr, eh}@it.uu.se

  2. Problems with Traditional SW-DSMs • Page-sized coherence unit • False Sharing! [e.g., Ivy, Munin, TreadMarks, Cashmere-2L, GeNIMA, …] • Protocol agent messaging is slow • Most efficiency lost in interrupt/poll Mem Prot.agent Prot.agent Mem CPUs CPUs LD x Uppsala Architecture Research Team (UART)

  3. Our proposal: DSZOOM atomic, get/put get • Run entire protocol in requesting-processor • No protocol agent communication! • Assumes user-level remote memory access • put, get, and atomics [  InfiniBandSM] • Fine-grain access-control checks [e.g., Shasta, Blizzard-S, Sirocco-S] Mem DIR Mem DIR Protocol CPUs CPUs LD x Uppsala Architecture Research Team (UART)

  4. Outline • Motivation • General DSZOOM Overview • DSZOOM-WF Implementation Details • Experimentation Environment • Performance Results • Conclusions Uppsala Architecture Research Team (UART)

  5. DSZOOM Cluster • DSZOOM Nodes: • Each node consists of an unmodified SMP multiprocessor • SMP hardware keeps coherence among the caches and the memory within each SMP node • DSZOOM Cluster Network: • Non-coherent cluster interconnect • Inexpensive user-level remote memory access • Remote atomic operations [e.g., InfiniBandSM] Uppsala Architecture Research Team (UART)

  6. Squeezing Protocols into Binaries … • Static Binary Instrumentation • EEL— Machine-independent Executable Editing Library implemented in C++ • Instrument global LOADs with snippets containing fine-grain access control checks • Instrument global STOREs with MTAG snippets • Insert calls to coherence protocols implemented in C Uppsala Architecture Research Team (UART)

  7. Fine-grain Access Control Checks • The “magic” value is a small integer corresponding to an IEEE floating-point NaN [e.g., Blizzard-S, Sirocco-S] • Floating-point load example: 1: ld [address],%reg // original LOAD 2: fcmps %fcc0,%reg,%reg // compare reg withitself 3: fbe,pt %fcc0,hit// if (reg == reg) goto hit 4: nop 5: // Call global coherence load routine hit: CoherenceProtocols (C-code) Uppsala Architecture Research Team (UART)

  8. Blocking Directory Protocols MEM_STORE One DIR_ENTRYper cache line DIR_ENTRY Presence bits LOCK 0 1 1 0 1 0 0 0 LOCK 0 0 0 0 0 0 0 1 Before MEM_STORE After MEM_STORE • Originally proposed to simplify the design and verification of HW-DSMs • Eliminates race conditions • DSZOOM implements a distributed version of a blocking protocol Distributed DIR Node 0 G_MEM Uppsala Architecture Research Team (UART)

  9. Global Coherency ActionRead data from home node: 2–hop read 2. put 1a. f&s 1b. get data = Small packet (~10 bytes) = Large packet (~68 bytes) = Message on the critical path = Message off the critical path DIR Mem Requestor LD x Uppsala Architecture Research Team (UART)

  10. Global Coherency ActionRead data modified in a third node: 3–hop read 2a.f&s 3a.put 2b.get data 1. f&s 3b. put Mem MTAG DIR Requestor LD x Uppsala Architecture Research Team (UART)

  11. Compilation Process Unmodified SPLASH-2Application DSZOOM-WFImplementationof PARMACSMacros m4 GNUgcc DSZOOM-WFRun-TimeLibrary (Un)executable CoherenceProtocols (C-code) EEL a.out Uppsala Architecture Research Team (UART)

  12. Instrumentation Performance Uppsala Architecture Research Team (UART)

  13. Instrumentation BreakdownSequential Execution Uppsala Architecture Research Team (UART)

  14. Current DSZOOM Hardware • Two E6000 connected through a hardware-coherent interface (Sun-WildFire) with a raw bandwidth of 800 MB/s in each direction • Data migration and coherent memory replication (CMR) are kept inactive • 16 UltraSPARC II (250 MHz) CPUs per node and 8 GB memory • Memory access times: 330 ns local / 1700 ns remote (lmbench latency) • Run as 16-way SMP, 28 CC-NUMA, and 28 SW-DSM Uppsala Architecture Research Team (UART)

  15. Process and Memory Distribution Cabinet 2 Physical MemoryCabinet 2 Stack Stack Stack Stack Stack shmget shmid = B shmat Cabinet_2_G_MEM Cabinet_2_G_MEM Cabinet_2_G_MEM shmat Cabinet_1_G_MEM Cabinet_1_G_MEM Cabinet_1_G_MEM Cabinet_1_G_MEM ”Aliasing” Physical MemoryCabinet 1 shmat G_MEM G_MEM G_MEM G_MEM 0x80000000 shmget shmid = A fork fork fork PRIVATE_DATA PRIVATE_DATA PRIVATE_DATA PRIVATE_DATA PRIVATE_DATA Heap Heap Heap pset_bindpset_bind pset_bind Heap Heap Text & Data Text & Data Text & Data Text & Data Text & Data Stack Stack Stack Cabinet_2_G_MEM Cabinet_1_G_MEM G_MEM PRIVATE_DATA PRIVATE_DATA PRIVATE_DATA Heap Heap Heap Text & Data Text & Data Text & Data Cabinet 1 Stack Cabinet_2_G_MEM PRIVATE_DATA Heap fork fork fork Text & Data Uppsala Architecture Research Team (UART)

  16. Results (1)Execution Times in Seconds (16 CPUs) 8 8 8 8 8 16 16 SW HW SW EEL 8 EEL EEL Uppsala Architecture Research Team (UART)

  17. Results (2)Normalized Execution Time Breakdowns (16 CPUs) 8 8 SW EEL Uppsala Architecture Research Team (UART)

  18. Conclusions • DSZOOM completely eliminates asynchronous messaging between protocol agents • Consistently competitive and stable performance in spite of high instrumentation overhead •  30% slowdown compared to hardware • State-of-the-art checking overheads are in the range of 5–35% (e.g., Shasta), DSZOOM: 3–59% Uppsala Architecture Research Team (UART)

  19. DSZOOM’s Home Page http://www.it.uu.se/research/group/uart Uppsala Architecture Research Team (UART)

More Related