1 / 32

Umbra: Efficient and Scalable Memory Shadowing

Umbra: Efficient and Scalable Memory Shadowing. Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT). CGO 2010, Toronto, Canada April 26 , 2010. Shadow Memory. Meta-data Track properties of application memory Synchronized Update Application data and meta-data. a.out. a.out.

graiden-kim
Download Presentation

Umbra: Efficient and Scalable Memory Shadowing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Umbra: Efficient and Scalable Memory Shadowing Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) CGO 2010, Toronto, Canada April 26, 2010

  2. Shadow Memory • Meta-data • Track properties of application memory • Synchronized Update • Application data and meta-data a.out a.out heap heap libc libc stack stack Application Memory Shadow Memory

  3. Examples • Memory Error Detection • MemCheck[VEE’07] • Purify [USENIX’92] • Dr. Memory • MemTracker[HPCA’07] • Dynamic Information Flow Tracking • LIFT [MICRO’39] • TaintTrace[ISCC’06] • Multi-threaded Debugging • Eraser [TCS’97] • Helgrind • Others • Redux[TCS’03] • Software Watchpoint[CC’08]

  4. Issues • Performance • Runtime overhead • Example: MemCheck25x[VEE’07] • Scalability • 64-bit architecture • Dependence • OS • Hardware • Development • Implemented with specific analysis • Lack of a general framework

  5. Memory Shadowing System • Dynamic Instrumentation • Context switch (application ↔ shadow) • Address calculation • Updating meta-data • Memory Management • Memory allocation / free • Monitor application memory management • Manage shadow memory • Mapping translation scheme (addrA addrS) • DMS: Direct Mapping Scheme • SMS: Segmented Mapping Scheme

  6. Direct Mapping Scheme (DMS) • Single memory region for entire address space. • Translation: • Issue: address conflict between memAand memS lea [addr]  %r1 add %r1disp  %r1 Application Shadow Slowdown relative to native execution

  7. Segmented Mapping Scheme (SMS) • Shadow segment per application segment • Translation: • Segment lookup (address indexing) • Address translation App 1 lea [addr]  %r1 mov %r1  %r2 shr %r2, 16  %r2 add %r1, disp[%r2]  %r1 addrA Shd 2 Shd 1 Slowdown relative to native execution addrS App 2 Segment table

  8. Umbra • Mapping Scheme • Segmented mapping • Scale with actual memory usage • Implementation • DynamoRIO • Optimization • Translation optimization • Instrumentation optimization • Client API • Experimental Results • Performance evaluation • Statistics collection

  9. Shadow Memory Mapping • Scaling to 64-bit Architecture • DMS • Infeasible due to memory layout a.out User space 247 stack Unusable space Kernel space 264 vsyscall CGO, Toronto, Canada, 4/26/2010

  10. Shadow Memory Mapping addrA • Scaling to 64-bit Architecture • DMS • Infeasible due to memory layout • Single-Level SMS • Too big (~4 billion entries)

  11. Shadow Memory Mapping addrA • Scaling to 64-bit Architecture • DMS • Infeasible due to memory layout • Single-Level SMS • Too big (~4 billion entries) • Multi-Level SMS • Even more expensive • Fast path on lower 32G (MemCheck) Slowdown relative to native execution

  12. Shadow Memory Mapping • Scaling to 64-bit Architecture • DMS is infeasible • Single-Level SMS is too sparse • Multi-Level SMS is too expensive • Umbra Solution • Eliminate empty entries • Compact table • Walk the table to find the entry

  13. Umbra • Mapping Scheme √ • Segmented mapping • Scale with actual memory usage • Implementation • DynamoRIO • Optimization • Translation optimization • Instrumentation optimization • Client API • Experimental Result • Performance evaluation • Statistics collection

  14. Implementation • Memory Manager • Monitor and control application memory allocation • brk, mmap, munmap, mremap • Allocate shadow memory • Maintain translation table • Instrumenter • Instrument every memory reference • Context save • Address calculation • Address translation • Shadow memory update • Context restore App 1 Shd 2 Shd 1 App 2

  15. Umbra • Mapping Scheme √ • Segmented mapping • Scale with actual memory usage • Implementation √ • DynamoRIO • Optimization • Translation optimization • Instrumentation optimization • Client API • Experimental Result • Performance evaluation • Statistics collection

  16. Unoptimized System ~100 • Small overhead from DynamoRIO • Slower than SMS-64 • Need to walk the global translation table • Why so slow? • 41.79% instructions are memory references • For each of these instructions • Full context switch • Table lookup • Call-out instrumentation Global translation table

  17. Optimization ~100 • Translation Optimization • Thread-local translation cache • Hashtable lookup • Memoization mini-cache • Reference uni-cache • Instrumentation Optimization • Context switch reduction • Reference grouping • 3-stage code layout Global translation table 17

  18. 1. Thread-Local Translation Cache ~100 • Local translation table per thread • Synchronize with global translation table when necessary • Avoid lock contention • Walk table to find match entry • Walk global table if not find in thread-local cache • Inlined instrumentation Thread 1 Thread 2 Global translation table Thread-local translation cache

  19. 2. Hashtable Lookup ~100 • Hashtable per thread • Fixed number of slots • Hash(addra)  entry in thread-local cache • If match, found • If no match, walk the local cache Thread 1 Thread 2 Global translation table Thread-local translation cache Hashtable

  20. 3. Memoization Mini-Cache ~100 • Four-entry table per thread • Stack • Heap • Application (a.out) • Units found in last table lookup • If not match, hashtable lookup • 68.93% hit ratio Thread 1 Thread 2 Global translation table Memoization mini-cache Thread-local translation cache Hashtable

  21. 4. Reference Uni-Cache ~100 • Software uni-cache per instr per thread • Last reference unit tag • Last translation displacement • If not match, memoization mini-cache check • 99.93% hit ratio ADD $1, (%RAX) MOV %RBX 48(%RAX) Thread 1 PUSH %RAX ADD 40(%RAX), %RBX Thread 2 Reference uni-cache Global translation table Memoization mini-cache Thread-local translation cache Hashtable

  22. 5. Context Switch Reduction ~100 • Register liveness analysis • Use dead register • Avoid flags save/restore ADD $1, (%RAX) MOV %RBX 48(%RAX) Thread 1 PUSH %RAX ADD 40(%RAX), %RBX Thread 2 Reference uni-cache Global translation table Memoization mini-cache Thread-local translation cache Hashtable

  23. 6. Reference Grouping ~100 • One reference cache for multiple references • Stack local variables • Different members of the same object ADD $1, (%RAX) MOV %RBX 48(%RAX) Thread 1 PUSH %RAX ADD 40(%RAX), %RBX Thread 2 Reference uni-cache Global translation table Memoization mini-cache Thread-local translation cache Hashtable

  24. 3-stage Code Layout • Inline stub (<10 instructions) • Quick inline check code with minimal context switch • Lean procedure (~50 instructions) • Simple assembly procedure with partial context switch • Callout (C function) • C function with complete context switch Lean procedure Callout Inline stub uni-cache check memoization check hashtable lookup local cache lookup <full context switch> c_function() { // global table // lookup . . . . . . } <full context switch> app instruction

  25. Umbra • Mapping Scheme √ • Segmented mapping • Scale with actual memory usage • Implementation √ • DynamoRIO • Optimization √ • Translation optimization • Instrumentation optimization • Client API • Experimental Result • Performance evaluation • Statistics collection

  26. Client API

  27. Umbra Client: Shared Memory Detection staticvoidinstrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* lock or [%r1], tid_map [%r1] */ opnd1 = OPND_CREATE_MEM32(umbra_inforeg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_datatid_map); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr); } • Meta-data maintains a bit map to store which threads access the associated memory

  28. Umbra • Mapping Scheme √ • Segmented mapping • Scale with actual memory usage • Implementation √ • DynamoRIO • Optimization √ • Translation optimization • Instrumentation optimization • Client API √ • Experimental Result • Performance evaluation • Statistics collection

  29. Performance Evaluation Slowdown relative to native execution

  30. EMS64:Efficient Memory Shadowing for 64-bit • Translation • Reference uni-cache hit rate: 99.93% • Still need a costly check to catch the 0.07% • Reg steal; save flags; compare & jump; restore • EMS64 (ISMM’10) • Speculatively use a disp without check • Notified by memory access violation fault for incorrect disp

  31. EMS64 Preliminary Result Slowdown relative to native execution

  32. Thanks • Download • http://people.csail.mit.edu/qin_zhao/umbra/ • Q & A

More Related