Exploiting Store Locality through Permission Caching in Software DSMs

Exploiting Store Locality through Permission Caching in Software DSMs Uppsala UniversityDept. of Information TechnologyDiv. of Computer SystemsUppsala Architecture Research Team [UART] Håkan Zeffer, Zoran Radovic, Oskar Grenholm and Erik Hagersten zeffer@it.uu.se

Software Distributed Shared Memory

Traditional Software DSMs • Page based coherence [e.g., Ivy, Munin, TreadMarks] • Virtual memory hardware for coherence checks • Expensive TLB traps • Large coherence unit size • Problem: False sharing • Solution: Weak memory consistency models DATA dir CPUs req. ST miss

Fine-Grain Software DSMs • Fine-grain access-control checks [Shasta, Blizzard] • Relies on binary instrumentation • Avoids operating system trapping • Less false sharing • Extra instructions introduce overhead Checking code instrumented into the application DATA dir if (miss) goto st_protocol ST CPUs req.

Fine-Grain Pros and Cons • Pros • Small coherence unit • Hardware-like memory consistency model • Cons • Extra check instructions to execute • Our proposal: Write Permission Cache (WPC) • Exploits store locality • Caches write permission • Effectively reduces the store instrumentation cost

Outline • Motivation • Problem: Instrumentation Overhead • Solution: Write Permission Cache • Experimental Setup • Results on Real HW- and SW-DSM Systems • Conclusions

Software Fine-Grain Coherence • Binary instrumentation of global loads and stores • Inserted code “snippet” maintains coherence Instrumented program Original program add R1, R2 -> R3 loop: load snippet for G_LD1 call coherence protocol if load miss load snippet for G_LD2 call coherence protocol if load miss sub R9, 1 -> R9 add R6, R7 -> R8 store snippet for G_ST1 call coherence protocol if store miss add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4]

The Lock Problem (original DSZOOM) • Example store access pattern (array traversal) Operation CUID Original snippet handling ST 0xE22F0000 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0008 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0010 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0018 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0020 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0028 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0030 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0038 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0040 99 lock dir entry 99; store; unlock dir entry 99 ST 0xE22F0048 99 lock dir entry 99; store; unlock dir entry 99

DSZOOM Fine-Grain Coherence • Magic value (load), atomic operations (store) Original program Instrumented program add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // original load if (R6 == MAGIC) // test permission LD_PROTOCOL(); // protocol if miss ld [R2 + R4] -> R7 // original load if (R7 == MAGIC) // test permission LD_PROTOCOL(); // protocol if miss sub R9, 1 -> R9 add R6, R7 -> R8 LOCK(LOCAL_DIR); // lock local dir if (LOCAL_DIR != WRITE_PERMISSION) ST_PROTOCOL(); // protocol if miss st R8 -> [R3 + R4] // original store UNLOCK(LOCAL_DIR); // unlock local dir add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4]

Sequential Instrumentation Overhead Average instrumentation overhead when run on a single processor (SPLASH2 –O3): • Integer load instrumentation overhead: 3% • Overhead when only integer loads are instrumented • Float load instrumentation overhead: 31% • Only floating-point loads instrumented • Store instrumentation overhead: 61% • Only stores instrumented

Write Permission Caching in Action • Example store access pattern (array traversal) 98 99 Write Permission Cache Operation CUID WPC snippet handling ST 0xE22F0000 98 check WPC; miss; upd. WPC; lock dir entry 98; store ST 0xE22F0008 98 check WPC; hit; store ST 0xE22F0010 98 check WPC; hit; store ST 0xE22F0018 98 check WPC; hit; store ST 0xE22F0020 98 check WPC; hit; store ST 0xE22F0028 98 check WPC; hit; store ST 0xE22F0030 98 check WPC; hit; store ST 0xE22F0038 98 check WPC; hit; store ST 0xE22F0040 99 check WPC; miss; unlock 98; upd. WPC; lock 99; store ST 0xE22F0048 99 check WPC; hit; store

The Write Permission Cache Idea • Keep the lock • Rely on store locality • SPARC application registers Write Permission Cache Snippet Original program add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] WPC_FASTPATH: if (WPC != CU_ID(ADDR)) WPC_SLOWPATH() st R8 -> [R3 + R4]; // original store WPC_SLOWPATH: UNLOCK(WPC) WPC = CU_ID(ADDR) LOCK(WPC); if (LOCAL_DIR != WRITE_PERMISSION) ST_PROTOCOL();

Experimental Setup: Software • Benchmarks: unmodified SPLASH2 • Compiler: GCC 3.3.3 (-O0 and –O3) • Instrumentation tool: custom made

Experimental Setup: Hardware • SMP: Sun Enterprise E6000 Server • 16 UltraSPARC II (250 MHz) • Memory access time 330 ns [lmbench] • HW-DSM: Sun Wildfire (2 E6000 nodes) • Remote memory access time 1700 ns [lmbench] • Hardware coherent interconnect. BW 800 MB/s • DSZOOM: Runs in user space on the Wildfire system • put (get) = uncacheable block load (store) operation • atomic = ldstub (load store unsigned byte SPARC V9) • maintains coherence between private copies of G_MEM

Write Permission Cache Hit Rate

Sequential Instrumentation Overhead

Execution Time, 16 processors (2x8)Performance bug in paper (popc).

Conclusions • Write permission cache (WPC) • Effectively reduces store instrumentation overhead • 2 entries is sufficient • Store instrumentation overhead reduction: 42% • HW-, SW-DSM gap reduction: 28% • Parallel performance improvement: 9%

Thanks and Questions http://www.it.uu.se/research/group/uart

Memory Consistency • The base architecture implements sequential consistency by requiring all acknowledges from sharing nodes before a global store request is granted • Introducing the WPC in an invalidation-based environment will not weaken the memory model • WPC just extends the duration of the permission tenure before the write permission is given up • If the memory model of each node is weaker than SC, it will decide the memory model of the system

Deadlock • WPC entries are flushed at: • Synchronization points • Failures to acquire directory locks • Thread termination • WPC + flag synchronization can lead to deadlock • Timers • Interrupt other CPUs • Lack of forward progress

Directory Collisions • Directory collision: if a requesting processor fails to acquire a directory lock • The number of directory collisions doesn’t increase when less than 32 WPC entries are used • More information in the paper

Exploiting Store Locality through Permission Caching in Software DSMs

Exploiting Store Locality through Permission Caching in Software DSMs

Presentation Transcript

Characterizing and Exploiting Reference Locality in Data Stream Applications

In Locality

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Exploiting Sequential Locality for Fast Disk Accesses

Towards Exploiting User-Centric Information for Proactive Caching in Mobile Networks ‡

In Locality

Row Buffer Locality Aware Caching Policies for Hybrid Memories

On the Value Locality of Store Instructions

Exploiting Spatial Locality to Improve Disk Efﬁciency in Virtualized Environments

Improving outcomes through locality working

Web Caching: Locality of References Revisited

GreedyDual* Web Caching Exploiting the Two Sources of Locality in Web Request Streams

Exploiting Locality in DRAM

Exploiting locality for scalable information retrieval in peer-to-peer networks

EXPLOITING VALUE LOCALITY FOR SECURE-ENERGY AWARE COMMUNICATION

Exploiting Vector Parallelism in Software Pipelined Loops

Exploiting Spatial Locality in Data Caches using Spatial Footprints

Exploiting Value Locality in Physical Register Files

Exploiting Locality in DRAM

Locality and Caching

Exploiting Value Locality in Physical Register Files

Exploiting Vector Parallelism in Software Pipelined Loops