220 likes | 326 Views
Exploiting Store Locality through Permission Caching in Software DSMs. Uppsala University Dept. of Information Technology Div. of Computer Systems Uppsala Architecture Research Team [ UART ]. Håkan Zeffer, Zoran Radovic, Oskar Grenholm and Erik Hagersten zeffer@it.uu.se.
E N D
Exploiting Store Locality through Permission Caching in Software DSMs Uppsala UniversityDept. of Information TechnologyDiv. of Computer SystemsUppsala Architecture Research Team [UART] Håkan Zeffer, Zoran Radovic, Oskar Grenholm and Erik Hagersten zeffer@it.uu.se
Traditional Software DSMs • Page based coherence [e.g., Ivy, Munin, TreadMarks] • Virtual memory hardware for coherence checks • Expensive TLB traps • Large coherence unit size • Problem: False sharing • Solution: Weak memory consistency models DATA dir CPUs req. ST miss
Fine-Grain Software DSMs • Fine-grain access-control checks [Shasta, Blizzard] • Relies on binary instrumentation • Avoids operating system trapping • Less false sharing • Extra instructions introduce overhead Checking code instrumented into the application DATA dir if (miss) goto st_protocol ST CPUs req.
Fine-Grain Pros and Cons • Pros • Small coherence unit • Hardware-like memory consistency model • Cons • Extra check instructions to execute • Our proposal: Write Permission Cache (WPC) • Exploits store locality • Caches write permission • Effectively reduces the store instrumentation cost
Outline • Motivation • Problem: Instrumentation Overhead • Solution: Write Permission Cache • Experimental Setup • Results on Real HW- and SW-DSM Systems • Conclusions
Software Fine-Grain Coherence • Binary instrumentation of global loads and stores • Inserted code “snippet” maintains coherence Instrumented program Original program add R1, R2 -> R3 loop: load snippet for G_LD1 call coherence protocol if load miss load snippet for G_LD2 call coherence protocol if load miss sub R9, 1 -> R9 add R6, R7 -> R8 store snippet for G_ST1 call coherence protocol if store miss add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4]
The Lock Problem (original DSZOOM) • Example store access pattern (array traversal) Operation CUID Original snippet handling ST 0xE22F0000 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0008 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0010 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0018 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0020 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0028 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0030 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0038 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0040 99 lock dir entry 99; store; unlock dir entry 99 ST 0xE22F0048 99 lock dir entry 99; store; unlock dir entry 99
DSZOOM Fine-Grain Coherence • Magic value (load), atomic operations (store) Original program Instrumented program add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // original load if (R6 == MAGIC) // test permission LD_PROTOCOL(); // protocol if miss ld [R2 + R4] -> R7 // original load if (R7 == MAGIC) // test permission LD_PROTOCOL(); // protocol if miss sub R9, 1 -> R9 add R6, R7 -> R8 LOCK(LOCAL_DIR); // lock local dir if (LOCAL_DIR != WRITE_PERMISSION) ST_PROTOCOL(); // protocol if miss st R8 -> [R3 + R4] // original store UNLOCK(LOCAL_DIR); // unlock local dir add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4]
Sequential Instrumentation Overhead Average instrumentation overhead when run on a single processor (SPLASH2 –O3): • Integer load instrumentation overhead: 3% • Overhead when only integer loads are instrumented • Float load instrumentation overhead: 31% • Only floating-point loads instrumented • Store instrumentation overhead: 61% • Only stores instrumented
Write Permission Caching in Action • Example store access pattern (array traversal) 98 99 Write Permission Cache Operation CUID WPC snippet handling ST 0xE22F0000 98 check WPC; miss; upd. WPC; lock dir entry 98; store ST 0xE22F0008 98 check WPC; hit; store ST 0xE22F0010 98 check WPC; hit; store ST 0xE22F0018 98 check WPC; hit; store ST 0xE22F0020 98 check WPC; hit; store ST 0xE22F0028 98 check WPC; hit; store ST 0xE22F0030 98 check WPC; hit; store ST 0xE22F0038 98 check WPC; hit; store ST 0xE22F0040 99 check WPC; miss; unlock 98; upd. WPC; lock 99; store ST 0xE22F0048 99 check WPC; hit; store
The Write Permission Cache Idea • Keep the lock • Rely on store locality • SPARC application registers Write Permission Cache Snippet Original program add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] WPC_FASTPATH: if (WPC != CU_ID(ADDR)) WPC_SLOWPATH() st R8 -> [R3 + R4]; // original store WPC_SLOWPATH: UNLOCK(WPC) WPC = CU_ID(ADDR) LOCK(WPC); if (LOCAL_DIR != WRITE_PERMISSION) ST_PROTOCOL();
Experimental Setup: Software • Benchmarks: unmodified SPLASH2 • Compiler: GCC 3.3.3 (-O0 and –O3) • Instrumentation tool: custom made
Experimental Setup: Hardware • SMP: Sun Enterprise E6000 Server • 16 UltraSPARC II (250 MHz) • Memory access time 330 ns [lmbench] • HW-DSM: Sun Wildfire (2 E6000 nodes) • Remote memory access time 1700 ns [lmbench] • Hardware coherent interconnect. BW 800 MB/s • DSZOOM: Runs in user space on the Wildfire system • put (get) = uncacheable block load (store) operation • atomic = ldstub (load store unsigned byte SPARC V9) • maintains coherence between private copies of G_MEM
Execution Time, 16 processors (2x8)Performance bug in paper (popc).
Conclusions • Write permission cache (WPC) • Effectively reduces store instrumentation overhead • 2 entries is sufficient • Store instrumentation overhead reduction: 42% • HW-, SW-DSM gap reduction: 28% • Parallel performance improvement: 9%
Thanks and Questions http://www.it.uu.se/research/group/uart
Memory Consistency • The base architecture implements sequential consistency by requiring all acknowledges from sharing nodes before a global store request is granted • Introducing the WPC in an invalidation-based environment will not weaken the memory model • WPC just extends the duration of the permission tenure before the write permission is given up • If the memory model of each node is weaker than SC, it will decide the memory model of the system
Deadlock • WPC entries are flushed at: • Synchronization points • Failures to acquire directory locks • Thread termination • WPC + flag synchronization can lead to deadlock • Timers • Interrupt other CPUs • Lack of forward progress
Directory Collisions • Directory collision: if a requesting processor fails to acquire a directory lock • The number of directory collisions doesn’t increase when less than 32 WPC entries are used • More information in the paper