250 likes | 362 Views
Support for Symmetric Shadow Memory in Multiprocessors. Vijay Nagarajan Rajiv Gupta University of California, Riverside. Runtime Monitoring. Applications of monitoring Security DIFT Debugging Memcheck, Redux, OnTrac Performance Speculation Requirements of monitoring
E N D
Support for Symmetric Shadow Memory in Multiprocessors Vijay Nagarajan Rajiv Gupta University of California, Riverside
Runtime Monitoring • Applications of monitoring • Security • DIFT • Debugging • Memcheck, Redux, OnTrac • Performance • Speculation • Requirements of monitoring • Shadow Memory (SM) • Meta-data associated with memory locations • Shadow memory instructions (SMIs) • Instruction for maintenance of meta-data
DIFT: Example • Each word/reg associated with “taint” value • Data from input channels are considered tainted • Flow of tainted data is tracked • Usage of tainted data in “malicious” fashion detected
Shadow Memory Observations • Single vs Multiple Shadow values • DIFT associates one taint value • Other applications associate multiple shadow values • DDG computes dynamic dependence graph on the fly • For each memory word, maintains (instruction, instance) pair that wrote to it last. • Symmetric SMIs • Original stores (loads) associated with shadow stores (loads) • Atomic SMIs • OMI and SMIs must be executed atomically
Atomic SMIs Proc A St1 S St1 St2 S St2 Proc B Ld S Ld Proc A St1 S St1 St2 S St2 Proc A St1 S St1 St2 S St2 Proc B Ld S Ld Proc B Ld S Ld Inconsistent View Atomicity
Robust & Efficient SM • Each SM access involves • Calculating effective and shadow address • Accessing the shadow values • Half-and-Half scheme • Reserve half of virtual space for shadow memory • Efficient SM access • Not Robust [Nethercote and Seward VEE ’07] • Valgrind’s s/w page table like scheme • Robust • Inefficient (Valgrind’s Memcheck causes 22x slowdown) • Need to be efficient and robust!
Research Question • Can we make SMIs and OMIs atomic? • Can we make SM accesses efficient without sacrificing robustness? • Can we do the above with minimal HW support?
Our Approach • Convey atomic block to the processor • Simple ISA support: shadow-start, shadow-end • SMIs implicitly identified • Coupled Coherence • Coherence of SMIs and OMIs are coupled • Enforces the effect of atomicity • OS Support • Couple allocation of original and shadow pages • Efficient addressing without sacrificing robustness
ISA Support EXAMPLE 0. shadow-start // Original load 1. ld reg1, vaddr // 1st shadow load 2. ld reg2, vaddr // 2nd shadow load 3. ld reg3, vaddr 4. shadow-end • Shadow-start / Shadow-end instructions • OMIs and SMIs enclosed • Conveys atomic block to the processor • Guides actions of cache-coherence protocol • Implicitly distinguishing SMIs • First instruction is an OMI • All others with same VA treated as SMIs • Multiple accesses implicitly assumed to access different shadow values
Coupled Coherence • Dependence Mirroring • Dependences among SMIs mirror those of the OMIs • If OMI2 OMI1 then SMI2 SMI1 • Couple coherence enforces this Proc B Ld S Ld Proc A St1 S St1 St2 S St2
Coupled Coherence • Coupled Coherence involves • No Explicit Shadow coherence messages • SMIs do not trigger coherence messages • Shadow stores do not trigger invalidates • Shadow loads do not cause misses • Co-transfer • Data replies of original blocks are piggybacked with shadow blocks • Co-existence • Original blocks and shadow blocks co-exist in the cache • Brought in together • Replaced together
Dependence Mirroring: RAW Proc A Proc B Block ‘B’ Shared shared Exc Inv Proc A send invalidate for B and B’ Shadow Block ‘B’ Proc B send read miss for B and B’ Exc Inv Shared shared Proc A sends blocks B and B’ St S St Ld S Ld
Dependence Mirroring: RAW Proc A Proc B Block ‘B’ Ready bit 0 1 Exc Inv Proc A send invalidate for B and B’ Proc B send read miss for B and B’ shadow-st St Proc A waits until ready bit set Ld Proc A sends blocks B and B’ S St shadow-end S Ld
Dependence Mirroring: WAR Proc A Proc B St1 S St1 Proc A send invalidates Proc B send read miss for B and B’ Ld Proc A sends blocks B and B’ St2 S St2 S Ld
Coupled Coherence • On a cache miss • Original Ld / St • Place read miss for original, shadow block(s) • Write back dirty blocks • Shadow Ld / St • //No coherence events • Shadow-start • Set ready bit to 0 • Shadow-end • Set ready bit to 1
Symmetric/General SM • Symmetric SM • Original loads (stores) accompanied by shadow loads (stores) • General SM • Original load can be accompanied by both shadow loads and stores • Eg. Eraser: Online race detection • Need to enforce shadow coherence for RAR • Typically no coherence events for RAR • Future Work
Addressing Support • Shadow pages allocated adjacent to original pages • Virtual Memory space unaffected • Retains robustness • OS treats them as a single “superpage” • Swapped in and swapped out together • Address Translation • During Address translation add offset to access shadow page • Provides efficiency • No separate TLB for shadow pages Memory TLB OMI Ph.page Ori.Page Shadow Page 1 Shadow Page 2 V.Page Off V.Page Off SMI Shadow Value cnt
Experiments • Implementation in SESC Simulator • Cycle Accurate, targets MIPS architecture • Shadow-start, Shadow-end instructions • Models cache coherence protocol • Coupled Coherence implementation • Bus based protocol • Models basic OS services • Coupled page allocation • Monitoring Applications • DIFT: Detection of security attacks • DDG: Computes Dynamic dependence graph online • Benchmarks • SPLASH-2
Efficiency of SM • Three versions: • SM • Our SM implementation • ISA support • OS support for address translation • Coupled Coherence protocol for atomicity • VAL: serial • Valgrind’s SM support. • Address Translation: involves software page table accesses • Atomicity: Enforced by thread serialization • VAL:lb • Valgrind’s SM support with no atomicity guarantees • Means of comparison of our address translation support
Efficiency of SM: DIFT • VAL:serial causes 41 times overhead on an average • Effect of serialization • SM causes only 7 times overhead • Efficient Address translation + coupled coherence • Even without serialization VAL:lb causes 12 times overhead • With coupled coherence this reduces to 7 times
Efficiency of SM:DDG • VAL:serial causes 78 times overhead on an average • Effect of serialization • SM causes only 23 times overhead • Efficient Address translation + coupled coherence • Even without serialization VAL:lb causes 27 times overhead • With coupled coherence this reduces to 23 times • Effect not as pronounced as in DIFT
Effect of Coupled Coherence • Performance overhead < 0.6% for DIFT and DDG • Total amount of traffic is about the same • Coupled coherence sees more bursts in traffic
Related Work • Enforcing Atomicity • Valgrind [Nethercote et al. PLDI ‘07] through thread serialization • Not efficient • TM [Chung et al. HPCA ‘08] can be used. • Requires additional HW changes • Support for rollback and re-execution. • Address Translation • Valgrind [Nethercote VEE ’07] software page table structure • Proposed application specific optimizations • Still inefficient • Half-and-Half scheme [Qin et al MICRO ’07] • Divides virtual address space • Not Robust
Conclusion • SM used extensively for performing monitoring • Performance • Security • Debugging • Support for improving SM performance • ISA Support • Coupled coherence atomicity • Coupled allocation efficient addressing • Significant performance advantage • Future Work • Extend system to not only symmetric SMIs • Look at other techniques for providing atomicity without changes to coherence protocol