Umbra: Efficient and Scalable Memory Shadowing

Umbra: Efficient and Scalable Memory Shadowing Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) CGO 2010, Toronto, Canada April 26, 2010

Shadow Memory • Meta-data • Track properties of application memory • Synchronized Update • Application data and meta-data a.out a.out heap heap libc libc stack stack Application Memory Shadow Memory

Examples • Memory Error Detection • MemCheck[VEE’07] • Purify [USENIX’92] • Dr. Memory • MemTracker[HPCA’07] • Dynamic Information Flow Tracking • LIFT [MICRO’39] • TaintTrace[ISCC’06] • Multi-threaded Debugging • Eraser [TCS’97] • Helgrind • Others • Redux[TCS’03] • Software Watchpoint[CC’08]

Issues • Performance • Runtime overhead • Example: MemCheck25x[VEE’07] • Scalability • 64-bit architecture • Dependence • OS • Hardware • Development • Implemented with specific analysis • Lack of a general framework

Memory Shadowing System • Dynamic Instrumentation • Context switch (application ↔ shadow) • Address calculation • Updating meta-data • Memory Management • Memory allocation / free • Monitor application memory management • Manage shadow memory • Mapping translation scheme (addrA addrS) • DMS: Direct Mapping Scheme • SMS: Segmented Mapping Scheme

Direct Mapping Scheme (DMS) • Single memory region for entire address space. • Translation: • Issue: address conflict between memAand memS lea [addr]  %r1 add %r1disp  %r1 Application Shadow Slowdown relative to native execution

Segmented Mapping Scheme (SMS) • Shadow segment per application segment • Translation: • Segment lookup (address indexing) • Address translation App 1 lea [addr]  %r1 mov %r1  %r2 shr %r2, 16  %r2 add %r1, disp[%r2]  %r1 addrA Shd 2 Shd 1 Slowdown relative to native execution addrS App 2 Segment table

Umbra • Mapping Scheme • Segmented mapping • Scale with actual memory usage • Implementation • DynamoRIO • Optimization • Translation optimization • Instrumentation optimization • Client API • Experimental Results • Performance evaluation • Statistics collection

Shadow Memory Mapping • Scaling to 64-bit Architecture • DMS • Infeasible due to memory layout a.out User space 247 stack Unusable space Kernel space 264 vsyscall CGO, Toronto, Canada, 4/26/2010

Shadow Memory Mapping addrA • Scaling to 64-bit Architecture • DMS • Infeasible due to memory layout • Single-Level SMS • Too big (~4 billion entries)

Shadow Memory Mapping addrA • Scaling to 64-bit Architecture • DMS • Infeasible due to memory layout • Single-Level SMS • Too big (~4 billion entries) • Multi-Level SMS • Even more expensive • Fast path on lower 32G (MemCheck) Slowdown relative to native execution

Shadow Memory Mapping • Scaling to 64-bit Architecture • DMS is infeasible • Single-Level SMS is too sparse • Multi-Level SMS is too expensive • Umbra Solution • Eliminate empty entries • Compact table • Walk the table to find the entry

Umbra • Mapping Scheme √ • Segmented mapping • Scale with actual memory usage • Implementation • DynamoRIO • Optimization • Translation optimization • Instrumentation optimization • Client API • Experimental Result • Performance evaluation • Statistics collection

Implementation • Memory Manager • Monitor and control application memory allocation • brk, mmap, munmap, mremap • Allocate shadow memory • Maintain translation table • Instrumenter • Instrument every memory reference • Context save • Address calculation • Address translation • Shadow memory update • Context restore App 1 Shd 2 Shd 1 App 2

Umbra • Mapping Scheme √ • Segmented mapping • Scale with actual memory usage • Implementation √ • DynamoRIO • Optimization • Translation optimization • Instrumentation optimization • Client API • Experimental Result • Performance evaluation • Statistics collection

Unoptimized System ~100 • Small overhead from DynamoRIO • Slower than SMS-64 • Need to walk the global translation table • Why so slow? • 41.79% instructions are memory references • For each of these instructions • Full context switch • Table lookup • Call-out instrumentation Global translation table

Optimization ~100 • Translation Optimization • Thread-local translation cache • Hashtable lookup • Memoization mini-cache • Reference uni-cache • Instrumentation Optimization • Context switch reduction • Reference grouping • 3-stage code layout Global translation table 17

1. Thread-Local Translation Cache ~100 • Local translation table per thread • Synchronize with global translation table when necessary • Avoid lock contention • Walk table to find match entry • Walk global table if not find in thread-local cache • Inlined instrumentation Thread 1 Thread 2 Global translation table Thread-local translation cache

2. Hashtable Lookup ~100 • Hashtable per thread • Fixed number of slots • Hash(addra)  entry in thread-local cache • If match, found • If no match, walk the local cache Thread 1 Thread 2 Global translation table Thread-local translation cache Hashtable

3. Memoization Mini-Cache ~100 • Four-entry table per thread • Stack • Heap • Application (a.out) • Units found in last table lookup • If not match, hashtable lookup • 68.93% hit ratio Thread 1 Thread 2 Global translation table Memoization mini-cache Thread-local translation cache Hashtable

4. Reference Uni-Cache ~100 • Software uni-cache per instr per thread • Last reference unit tag • Last translation displacement • If not match, memoization mini-cache check • 99.93% hit ratio ADD $1, (%RAX) MOV %RBX 48(%RAX) Thread 1 PUSH %RAX ADD 40(%RAX), %RBX Thread 2 Reference uni-cache Global translation table Memoization mini-cache Thread-local translation cache Hashtable

5. Context Switch Reduction ~100 • Register liveness analysis • Use dead register • Avoid flags save/restore ADD $1, (%RAX) MOV %RBX 48(%RAX) Thread 1 PUSH %RAX ADD 40(%RAX), %RBX Thread 2 Reference uni-cache Global translation table Memoization mini-cache Thread-local translation cache Hashtable

6. Reference Grouping ~100 • One reference cache for multiple references • Stack local variables • Different members of the same object ADD $1, (%RAX) MOV %RBX 48(%RAX) Thread 1 PUSH %RAX ADD 40(%RAX), %RBX Thread 2 Reference uni-cache Global translation table Memoization mini-cache Thread-local translation cache Hashtable

3-stage Code Layout • Inline stub (<10 instructions) • Quick inline check code with minimal context switch • Lean procedure (~50 instructions) • Simple assembly procedure with partial context switch • Callout (C function) • C function with complete context switch Lean procedure Callout Inline stub uni-cache check memoization check hashtable lookup local cache lookup <full context switch> c_function() { // global table // lookup . . . . . . } <full context switch> app instruction

Umbra • Mapping Scheme √ • Segmented mapping • Scale with actual memory usage • Implementation √ • DynamoRIO • Optimization √ • Translation optimization • Instrumentation optimization • Client API • Experimental Result • Performance evaluation • Statistics collection

Client API

Umbra Client: Shared Memory Detection staticvoidinstrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* lock or [%r1], tid_map [%r1] */ opnd1 = OPND_CREATE_MEM32(umbra_inforeg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_datatid_map); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr); } • Meta-data maintains a bit map to store which threads access the associated memory

Umbra • Mapping Scheme √ • Segmented mapping • Scale with actual memory usage • Implementation √ • DynamoRIO • Optimization √ • Translation optimization • Instrumentation optimization • Client API √ • Experimental Result • Performance evaluation • Statistics collection

Performance Evaluation Slowdown relative to native execution

EMS64:Efficient Memory Shadowing for 64-bit • Translation • Reference uni-cache hit rate: 99.93% • Still need a costly check to catch the 0.07% • Reg steal; save flags; compare & jump; restore • EMS64 (ISMM’10) • Speculatively use a disp without check • Notified by memory access violation fault for incorrect disp

EMS64 Preliminary Result Slowdown relative to native execution

Thanks • Download • http://people.csail.mit.edu/qin_zhao/umbra/ • Q & A

Umbra: Efficient and Scalable Memory Shadowing

Umbra: Efficient and Scalable Memory Shadowing

Presentation Transcript

Scalable Transactional Memory Scheduling

Towards Scalable and Energy-Efficient Memory System Architectures

Efficient and Scalable Archive Search Avishek Anand

Towards Scalable and Energy-Efficient Memory System Architectures

Efficient Use of Memory

Shadowing

Safe and Efficient Supervised Memory Systems

Efficient Memory Shadowing for 64-bit Architectures

Volunteering and Shadowing

Scalable Transactional Memory Scheduling

Job Shadowing

Scalable Distributed Memory Multiprocessors

Efficient Scalable Video Compression by Scalable Motion Coding

SENSE: Scalable and Efficient Networking of Sensor Elements

Memory-Efficient and Scalable Virtual Routers Using FPGA

Scalable, Controllable, Efficient and Convincing Crowd Simulation

shadowing

Scalable Distributed Memory Machines