190 likes | 256 Views
Removing the Overhead from Software-Based Shared Memory. Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [ UART ]. Zoran Radov ic and Erik Hagersten {zoranr, eh}@it.uu.se. Problems with Traditional SW-DSMs.
E N D
Removing the Overhead from Software-Based Shared Memory Uppsala UniversityInformation TechnologyDepartment of Computer SystemsUppsala Architecture Research Team [UART] Zoran Radovic and Erik Hagersten{zoranr, eh}@it.uu.se
Problems with Traditional SW-DSMs • Page-sized coherence unit • False Sharing! [e.g., Ivy, Munin, TreadMarks, Cashmere-2L, GeNIMA, …] • Protocol agent messaging is slow • Most efficiency lost in interrupt/poll Mem Prot.agent Prot.agent Mem CPUs CPUs LD x Uppsala Architecture Research Team (UART)
Our proposal: DSZOOM atomic, get/put get • Run entire protocol in requesting-processor • No protocol agent communication! • Assumes user-level remote memory access • put, get, and atomics [ InfiniBandSM] • Fine-grain access-control checks [e.g., Shasta, Blizzard-S, Sirocco-S] Mem DIR Mem DIR Protocol CPUs CPUs LD x Uppsala Architecture Research Team (UART)
Outline • Motivation • General DSZOOM Overview • DSZOOM-WF Implementation Details • Experimentation Environment • Performance Results • Conclusions Uppsala Architecture Research Team (UART)
DSZOOM Cluster • DSZOOM Nodes: • Each node consists of an unmodified SMP multiprocessor • SMP hardware keeps coherence among the caches and the memory within each SMP node • DSZOOM Cluster Network: • Non-coherent cluster interconnect • Inexpensive user-level remote memory access • Remote atomic operations [e.g., InfiniBandSM] Uppsala Architecture Research Team (UART)
Squeezing Protocols into Binaries … • Static Binary Instrumentation • EEL— Machine-independent Executable Editing Library implemented in C++ • Instrument global LOADs with snippets containing fine-grain access control checks • Instrument global STOREs with MTAG snippets • Insert calls to coherence protocols implemented in C Uppsala Architecture Research Team (UART)
Fine-grain Access Control Checks • The “magic” value is a small integer corresponding to an IEEE floating-point NaN [e.g., Blizzard-S, Sirocco-S] • Floating-point load example: 1: ld [address],%reg // original LOAD 2: fcmps %fcc0,%reg,%reg // compare reg withitself 3: fbe,pt %fcc0,hit// if (reg == reg) goto hit 4: nop 5: // Call global coherence load routine hit: CoherenceProtocols (C-code) Uppsala Architecture Research Team (UART)
Blocking Directory Protocols MEM_STORE One DIR_ENTRYper cache line DIR_ENTRY Presence bits LOCK 0 1 1 0 1 0 0 0 LOCK 0 0 0 0 0 0 0 1 Before MEM_STORE After MEM_STORE • Originally proposed to simplify the design and verification of HW-DSMs • Eliminates race conditions • DSZOOM implements a distributed version of a blocking protocol Distributed DIR Node 0 G_MEM Uppsala Architecture Research Team (UART)
Global Coherency ActionRead data from home node: 2–hop read 2. put 1a. f&s 1b. get data = Small packet (~10 bytes) = Large packet (~68 bytes) = Message on the critical path = Message off the critical path DIR Mem Requestor LD x Uppsala Architecture Research Team (UART)
Global Coherency ActionRead data modified in a third node: 3–hop read 2a.f&s 3a.put 2b.get data 1. f&s 3b. put Mem MTAG DIR Requestor LD x Uppsala Architecture Research Team (UART)
Compilation Process Unmodified SPLASH-2Application DSZOOM-WFImplementationof PARMACSMacros m4 GNUgcc DSZOOM-WFRun-TimeLibrary (Un)executable CoherenceProtocols (C-code) EEL a.out Uppsala Architecture Research Team (UART)
Instrumentation Performance Uppsala Architecture Research Team (UART)
Instrumentation BreakdownSequential Execution Uppsala Architecture Research Team (UART)
Current DSZOOM Hardware • Two E6000 connected through a hardware-coherent interface (Sun-WildFire) with a raw bandwidth of 800 MB/s in each direction • Data migration and coherent memory replication (CMR) are kept inactive • 16 UltraSPARC II (250 MHz) CPUs per node and 8 GB memory • Memory access times: 330 ns local / 1700 ns remote (lmbench latency) • Run as 16-way SMP, 28 CC-NUMA, and 28 SW-DSM Uppsala Architecture Research Team (UART)
Process and Memory Distribution Cabinet 2 Physical MemoryCabinet 2 Stack Stack Stack Stack Stack shmget shmid = B shmat Cabinet_2_G_MEM Cabinet_2_G_MEM Cabinet_2_G_MEM shmat Cabinet_1_G_MEM Cabinet_1_G_MEM Cabinet_1_G_MEM Cabinet_1_G_MEM ”Aliasing” Physical MemoryCabinet 1 shmat G_MEM G_MEM G_MEM G_MEM 0x80000000 shmget shmid = A fork fork fork PRIVATE_DATA PRIVATE_DATA PRIVATE_DATA PRIVATE_DATA PRIVATE_DATA Heap Heap Heap pset_bindpset_bind pset_bind Heap Heap Text & Data Text & Data Text & Data Text & Data Text & Data Stack Stack Stack Cabinet_2_G_MEM Cabinet_1_G_MEM G_MEM PRIVATE_DATA PRIVATE_DATA PRIVATE_DATA Heap Heap Heap Text & Data Text & Data Text & Data Cabinet 1 Stack Cabinet_2_G_MEM PRIVATE_DATA Heap fork fork fork Text & Data Uppsala Architecture Research Team (UART)
Results (1)Execution Times in Seconds (16 CPUs) 8 8 8 8 8 16 16 SW HW SW EEL 8 EEL EEL Uppsala Architecture Research Team (UART)
Results (2)Normalized Execution Time Breakdowns (16 CPUs) 8 8 SW EEL Uppsala Architecture Research Team (UART)
Conclusions • DSZOOM completely eliminates asynchronous messaging between protocol agents • Consistently competitive and stable performance in spite of high instrumentation overhead • 30% slowdown compared to hardware • State-of-the-art checking overheads are in the range of 5–35% (e.g., Shasta), DSZOOM: 3–59% Uppsala Architecture Research Team (UART)
DSZOOM’s Home Page http://www.it.uu.se/research/group/uart Uppsala Architecture Research Team (UART)