330 likes | 568 Views
Umbra: Efficient and Scalable Memory Shadowing. Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT). CGO 2010, Toronto, Canada April 26 , 2010. Shadow Memory. Meta-data Track properties of application memory Synchronized Update Application data and meta-data. a.out. a.out.
E N D
Umbra: Efficient and Scalable Memory Shadowing Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) CGO 2010, Toronto, Canada April 26, 2010
Shadow Memory • Meta-data • Track properties of application memory • Synchronized Update • Application data and meta-data a.out a.out heap heap libc libc stack stack Application Memory Shadow Memory
Examples • Memory Error Detection • MemCheck[VEE’07] • Purify [USENIX’92] • Dr. Memory • MemTracker[HPCA’07] • Dynamic Information Flow Tracking • LIFT [MICRO’39] • TaintTrace[ISCC’06] • Multi-threaded Debugging • Eraser [TCS’97] • Helgrind • Others • Redux[TCS’03] • Software Watchpoint[CC’08]
Issues • Performance • Runtime overhead • Example: MemCheck25x[VEE’07] • Scalability • 64-bit architecture • Dependence • OS • Hardware • Development • Implemented with specific analysis • Lack of a general framework
Memory Shadowing System • Dynamic Instrumentation • Context switch (application ↔ shadow) • Address calculation • Updating meta-data • Memory Management • Memory allocation / free • Monitor application memory management • Manage shadow memory • Mapping translation scheme (addrA addrS) • DMS: Direct Mapping Scheme • SMS: Segmented Mapping Scheme
Direct Mapping Scheme (DMS) • Single memory region for entire address space. • Translation: • Issue: address conflict between memAand memS lea [addr] %r1 add %r1disp %r1 Application Shadow Slowdown relative to native execution
Segmented Mapping Scheme (SMS) • Shadow segment per application segment • Translation: • Segment lookup (address indexing) • Address translation App 1 lea [addr] %r1 mov %r1 %r2 shr %r2, 16 %r2 add %r1, disp[%r2] %r1 addrA Shd 2 Shd 1 Slowdown relative to native execution addrS App 2 Segment table
Umbra • Mapping Scheme • Segmented mapping • Scale with actual memory usage • Implementation • DynamoRIO • Optimization • Translation optimization • Instrumentation optimization • Client API • Experimental Results • Performance evaluation • Statistics collection
Shadow Memory Mapping • Scaling to 64-bit Architecture • DMS • Infeasible due to memory layout a.out User space 247 stack Unusable space Kernel space 264 vsyscall CGO, Toronto, Canada, 4/26/2010
Shadow Memory Mapping addrA • Scaling to 64-bit Architecture • DMS • Infeasible due to memory layout • Single-Level SMS • Too big (~4 billion entries)
Shadow Memory Mapping addrA • Scaling to 64-bit Architecture • DMS • Infeasible due to memory layout • Single-Level SMS • Too big (~4 billion entries) • Multi-Level SMS • Even more expensive • Fast path on lower 32G (MemCheck) Slowdown relative to native execution
Shadow Memory Mapping • Scaling to 64-bit Architecture • DMS is infeasible • Single-Level SMS is too sparse • Multi-Level SMS is too expensive • Umbra Solution • Eliminate empty entries • Compact table • Walk the table to find the entry
Umbra • Mapping Scheme √ • Segmented mapping • Scale with actual memory usage • Implementation • DynamoRIO • Optimization • Translation optimization • Instrumentation optimization • Client API • Experimental Result • Performance evaluation • Statistics collection
Implementation • Memory Manager • Monitor and control application memory allocation • brk, mmap, munmap, mremap • Allocate shadow memory • Maintain translation table • Instrumenter • Instrument every memory reference • Context save • Address calculation • Address translation • Shadow memory update • Context restore App 1 Shd 2 Shd 1 App 2
Umbra • Mapping Scheme √ • Segmented mapping • Scale with actual memory usage • Implementation √ • DynamoRIO • Optimization • Translation optimization • Instrumentation optimization • Client API • Experimental Result • Performance evaluation • Statistics collection
Unoptimized System ~100 • Small overhead from DynamoRIO • Slower than SMS-64 • Need to walk the global translation table • Why so slow? • 41.79% instructions are memory references • For each of these instructions • Full context switch • Table lookup • Call-out instrumentation Global translation table
Optimization ~100 • Translation Optimization • Thread-local translation cache • Hashtable lookup • Memoization mini-cache • Reference uni-cache • Instrumentation Optimization • Context switch reduction • Reference grouping • 3-stage code layout Global translation table 17
1. Thread-Local Translation Cache ~100 • Local translation table per thread • Synchronize with global translation table when necessary • Avoid lock contention • Walk table to find match entry • Walk global table if not find in thread-local cache • Inlined instrumentation Thread 1 Thread 2 Global translation table Thread-local translation cache
2. Hashtable Lookup ~100 • Hashtable per thread • Fixed number of slots • Hash(addra) entry in thread-local cache • If match, found • If no match, walk the local cache Thread 1 Thread 2 Global translation table Thread-local translation cache Hashtable
3. Memoization Mini-Cache ~100 • Four-entry table per thread • Stack • Heap • Application (a.out) • Units found in last table lookup • If not match, hashtable lookup • 68.93% hit ratio Thread 1 Thread 2 Global translation table Memoization mini-cache Thread-local translation cache Hashtable
4. Reference Uni-Cache ~100 • Software uni-cache per instr per thread • Last reference unit tag • Last translation displacement • If not match, memoization mini-cache check • 99.93% hit ratio ADD $1, (%RAX) MOV %RBX 48(%RAX) Thread 1 PUSH %RAX ADD 40(%RAX), %RBX Thread 2 Reference uni-cache Global translation table Memoization mini-cache Thread-local translation cache Hashtable
5. Context Switch Reduction ~100 • Register liveness analysis • Use dead register • Avoid flags save/restore ADD $1, (%RAX) MOV %RBX 48(%RAX) Thread 1 PUSH %RAX ADD 40(%RAX), %RBX Thread 2 Reference uni-cache Global translation table Memoization mini-cache Thread-local translation cache Hashtable
6. Reference Grouping ~100 • One reference cache for multiple references • Stack local variables • Different members of the same object ADD $1, (%RAX) MOV %RBX 48(%RAX) Thread 1 PUSH %RAX ADD 40(%RAX), %RBX Thread 2 Reference uni-cache Global translation table Memoization mini-cache Thread-local translation cache Hashtable
3-stage Code Layout • Inline stub (<10 instructions) • Quick inline check code with minimal context switch • Lean procedure (~50 instructions) • Simple assembly procedure with partial context switch • Callout (C function) • C function with complete context switch Lean procedure Callout Inline stub uni-cache check memoization check hashtable lookup local cache lookup <full context switch> c_function() { // global table // lookup . . . . . . } <full context switch> app instruction
Umbra • Mapping Scheme √ • Segmented mapping • Scale with actual memory usage • Implementation √ • DynamoRIO • Optimization √ • Translation optimization • Instrumentation optimization • Client API • Experimental Result • Performance evaluation • Statistics collection
Umbra Client: Shared Memory Detection staticvoidinstrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* lock or [%r1], tid_map [%r1] */ opnd1 = OPND_CREATE_MEM32(umbra_inforeg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_datatid_map); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr); } • Meta-data maintains a bit map to store which threads access the associated memory
Umbra • Mapping Scheme √ • Segmented mapping • Scale with actual memory usage • Implementation √ • DynamoRIO • Optimization √ • Translation optimization • Instrumentation optimization • Client API √ • Experimental Result • Performance evaluation • Statistics collection
Performance Evaluation Slowdown relative to native execution
EMS64:Efficient Memory Shadowing for 64-bit • Translation • Reference uni-cache hit rate: 99.93% • Still need a costly check to catch the 0.07% • Reg steal; save flags; compare & jump; restore • EMS64 (ISMM’10) • Speculatively use a disp without check • Notified by memory access violation fault for incorrect disp
EMS64 Preliminary Result Slowdown relative to native execution
Thanks • Download • http://people.csail.mit.edu/qin_zhao/umbra/ • Q & A