210 likes | 334 Views
Implementing Low Latency Distributed Software-Based Shared Memory. Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems. Problems with Traditional SW-DSMs. Page-sized coherence unit False Sharing!
E N D
Implementing Low Latency DistributedSoftware-Based Shared Memory Zoran Radović and Erik Hagersten{zoranr, eh}@it.uu.seUppsala UniversityInformation TechnologyDepartment of Computer Systems Uppsala Architecture Research Team (UART)
Problems with Traditional SW-DSMs • Page-sized coherence unit • False Sharing! [e.g., Ivy, Munin, TreadMarks, Cashmere-2L, Shasta, GeNIMA, …] • Protocol agent messaging is slow • Most efficiency lost in interrupt/poll Mem Prot.agent Prot.agent Mem CPUs CPUs LD x Uppsala Architecture Research Team (UART)
atomic get Our proposal: DSZOOM • Run entire protocol in requesting-processor • No protocol agent communication! • Assumes user-level remote memory access • put, get, and atomics [ InfiniBand] • Fine-grain access-control checks [e.g., Shasta, Blizzard-S, Sirocco-S] Mem DIR Mem DIR Protocol CPUs CPUs LD x Uppsala Architecture Research Team (UART)
Outline • Motivation • General DSZOOM Overview • Experimentation Environment • DSZOOM-WF Implementation Details • Performance Results • Improved DSZOOM… [SC2001] Uppsala Architecture Research Team (UART)
DSZOOM Cluster • DSZOOM Nodes: • Each node consists of an unmodified SMP multiprocessor • SMP hardware keeps coherence among the caches and the memory within each SMP node • DSZOOM Cluster Network: • Non-coherent cluster interconnect • Inexpensive user-level remote memory access • Remote atomic operations [e.g., InfiniBand] Uppsala Architecture Research Team (UART)
Current DSZOOM Hardware • Two E6000 connected through a hardware-coherent interface (Sun-WildFire) with a raw bandwidth of 800 MB/s in each direction • Data migration and coherent memory replication (CMR) are kept inactive • 16 UltraSPARC II (250 MHz) CPUs per node and 8 GB memory • Memory access times: 330 ns local / 1700 ns remote (lmbench latency) • Run as 16-way SMP, 28 HW-ccNUMA, and 28 SW-DSM Uppsala Architecture Research Team (UART)
Compilation Process Unmodified SPLASH-2Application DSZOOM-WFImplementationof PARMACSMacros m4 GNUgcc DSZOOM-WFRun-TimeLibrary Binary CoherenceProtocols EEL a.out Uppsala Architecture Research Team (UART)
Cabinet 2 Physical Memoryof the Cabinet 2 Stack Stack Stack Stack Stack shmid = B shmget shmat Cabinet_2_G_MEM Cabinet_2_G_MEM Cabinet_2_G_MEM shmat Cabinet_1_G_MEM Cabinet_1_G_MEM Cabinet_1_G_MEM Cabinet_1_G_MEM ”Aliasing” Physical Memoryof the Cabinet 1 shmat G_MEM G_MEM G_MEM G_MEM 0x80000000 shmid = A shmget fork fork fork PRIVATE_DATA PRIVATE_DATA PRIVATE_DATA PRIVATE_DATA PRIVATE_DATA Heap Heap Heap pset_bindpset_bind pset_bind Heap Heap Text & Data Text & Data Text & Data Text & Data Text & Data Stack Stack Stack Cabinet_2_G_MEM Cabinet_1_G_MEM G_MEM PRIVATE_DATA PRIVATE_DATA PRIVATE_DATA Heap Heap Heap Text & Data Text & Data Text & Data Process and Memory Distribution Cabinet 1 Stack Cabinet_2_G_MEM PRIVATE_DATA Heap fork fork fork Text & Data Uppsala Architecture Research Team (UART)
So far … Unmodified SPLASH-2Application DSZOOM-WFImplementationof PARMACSMacros m4 GNUgcc DSZOOM-WFRun-TimeLibrary (Un)executable CoherenceProtocols EEL a.out Uppsala Architecture Research Team (UART)
Squeezing Protocols into Binaries … • Static Binary Instrumentation • EEL — Machine-independent Executable Editing Library implemented in C++ • Replace global loads with snippets containing fine-grain access control checks • Insert coherence protocols Uppsala Architecture Research Team (UART)
Fine-grain Access Control Checks • The “magic” value is a small integer corresponding to an IEEE floating-point NaN [Blizzard-S, Sirocco-S] • Floating-point load example: 1: ld [address],%reg // original LD 2: fcmps %fcc0,%reg,%reg // compare reg withitself 3: fbe,pt %fcc0,hit // if (reg == reg) goto hit 4: nop 5: Call global coherence load routine hit: CoherenceProtocols Uppsala Architecture Research Team (UART)
MEM_STORE One DIR_ENTRYper cache line DIR_ENTRY Presence bits LOCK 0 0 0 0 0 0 0 1 LOCK 0 0 0 0 0 0 1 0 Before MEM_STORE After MEM_STORE Modified-Shared-Invalid (MSI) Shared cache line Cabinet_2_G_MEM Invalid cache line Cabinet_1_G_MEM Distributed DIR ”Aliasing” G_MEM Uppsala Architecture Research Team (UART)
2. put 1a. f&s 1b. get data Read Data from Home Node:2–hop read = Small packet (~10 bytes) = Large packet (~68 bytes) = Message on the critical path = Message off the critical path DIR Mem Requestor Uppsala Architecture Research Team (UART)
Instrumentation Performance Uppsala Architecture Research Team (UART)
Normalized Instrumentation Overhead Breakdown (Seq. Exec.) Uppsala Architecture Research Team (UART)
Results (1)Execution Times in Seconds (16 CPUs) Uppsala Architecture Research Team (UART)
Results (2)Normalized Execution Time Breakdowns (16 CPUs) Uppsala Architecture Research Team (UART)
Conclusions • DSZOOM completely eliminates asynchronous messaging between protocol agents • Consistently competitive and stable performance in spite of high instrumentation overhead • 35% slowdown compared to hardware • State-of-the-art checking overheads are in the range of 5–35% (e.g., Shasta), DSZOOM: 11–68% Uppsala Architecture Research Team (UART)
Improved DSZOOM… [SC2001] • Protocol/Overall optimizations • Coherency unit variations • Synchronization improvements • More balanced execution between cabinets • Better instrumentation • More detailed backward slice algorithm Uppsala Architecture Research Team (UART)
SC2001 TeaserExecution Times in Seconds (16 CPUs) Uppsala Architecture Research Team (UART)
DSZOOM’s Home Page http://www.it.uu.se/research/group/uart Uppsala Architecture Research Team (UART)