1 / 40

Day 2: Building Process Virtualization Systems

Day 2: Building Process Virtualization Systems. Kim Hazelwood ACACES Summer School July 2009. Course Outline. Day 1 – What is Process Virtualization? Day 2 – Building Process Virtualization Systems Day 3 – Using Process Virtualization Systems Day 4 – Symbiotic Optimization.

lonato
Download Presentation

Day 2: Building Process Virtualization Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009

  2. Course Outline • Day 1 – What is Process Virtualization? • Day 2 – Building Process Virtualization Systems • Day 3 – Using Process Virtualization Systems • Day 4 – Symbiotic Optimization

  3. JIT-Based Process Virtualization Application Transform Profile Code Cache Execute

  4. What are the Challenges? • Performance! Solutions: • Code caches – only transform code once • Trace selection – focus on hot paths • Branch linking – only perform cache lookup once • Indirect branch hash tables / chaining • Memory “management” • Correctness – self-modifying code, munmaps, multithreading • Transparency – context switching, eflags

  5. What is the Overhead? • The latest Pin overhead numbers …

  6. Sources of Overhead • Internal • Compiling code & exit stubs (region detection, region formation, code generation) • Managing code (eviction, linking) • Managing directories and performing lookups • Maintaining consistency (SMC, DLLs) • External • User-inserted instrumentation

  7. Improving Performance: Code Caches Exit Stub Hash Table Lookup Hit Code Cache Interpret Branch Target Address Start Counter++ Miss Delete Insert Update Hash Table Room in Code Cache? Yes Code is Hot? No Region Formation & Optimization Yes Evict Code No

  8. Software-Managed Code Caches • Store transformed code at run time to amortize overhead of process VMs • Contain a (potentially altered) copy of application code Application Transform Profile Code Cache Execute

  9. Code Cache Contents • Every application instruction executed is stored in the code cache (at least) • Code Regions • Altered copies of application code • Basic blocks and/or traces • Exit stubs • Swap applicationVM state • Return control to the process VM

  10. BBL A: Inst1 Inst2 Inst3 Branch B A A A A B B B C C C D D D D CFG In Memory Trace Code Regions • Basic Blocks • Traces

  11. Exit Stubs • One exit stub exists for every exit from every trace or basic block • Functionality • Prepare for context switch • Return control to VM dispatch • Details • Each exit stub ≈ 3 instructions A B Exit to C D Exit to E

  12. A A A B B Exit to C B C C D Call D E D E I G Exit to F Call H F G I I H E F Return Return G H Performance: Trace Selection • Interprocedural path • Single entry, multiple exit Trace (superblock) Layout in Code Cache CFG Layout in Memory

  13. Performance: Cache Linking Trace #1 Trace #2 Trace #3 Dispatch Exit #1a Exit #1b

  14. F H A A I A B C B C B Exit to C D D Exit to A D Call E E D E E G G G F C Exit to F H H F G H H D I I I I E I H G Return Exit to F H I Exit to A Linking Traces • Proactive linking • Lazy linking

  15. Are Links Highly Beneficial?

  16. Code Cache Visualization

  17. Challenge: Rewriting Instructions • We must regularly rewrite branches • No atomic branch write on x86 • Pin uses a neat trick*: “old” 5-byte branch 2-byte self branch n-2 bytes of “new” branch “new” 5-byte branch * Sundaresan et al. 2006

  18. SPC TPC Challenge: Achieving Transparency • Pretend as though the original program is executing Push 0x1006 on stack, then jump to 0x4000 Original Code: 0x1000call 0x4000 Translated Code: 0x7000push 0x1006 0x7006jmp 0x8000 Code cache address mapping: 0x1000  0x7000 “caller” 0x4000  0x8000 “callee”

  19. Challenge: Self-Modifying Code • The problem • Code cache must detect SMC and invalidate corresponding cached traces • Solutions • Many proposed … but without HW support, they are very expensive! • Changing page protection • Memory diff prior to execution • On ARM, there is an explicit instruction for SMC!

  20. Self-Modifying Code Handler • void main (int argc, char **argv) { • PIN_Init(argc, argv); • TRACE_AddInstrumentFunction(InsertSmcCheck,0); • PIN_StartProgram(); // Never returns • } • void InsertSmcCheck () { • . . . • memcpy(traceCopyAddr, traceAddr, traceSize); • TRACE_InsertCall(trace, IPOINT_BEFORE, (AFUNPTR)DoSmcCheck, • IARG_PTR, traceAddr, IARG_PTR, traceCopyAddr, • IARG_UINT32, traceSize, IARG_CONTEXT, IARG_END); • } • void DoSmcCheck (VOID* traceAddr, VOID *traceCopyAddr, • USIZE traceSize, CONTEXT* ctxP) { • if (memcmp(traceAddr, traceCopyAddr, traceSize) != 0) { • CODECACHE_InvalidateTrace((ADDRINT)traceAddr); • PIN_ExecuteAt(ctxP); • } • } (Written by Alex Skaletsky)

  21. Challenge: Parallel Applications Pin Tool Instrumentation Code Call-Back Handlers Analysis Code Pin T1 T1 T2 JIT Compiler Dispatcher Code Cache T1 T1 Syscall Emulator T2 Signal Emulator Serialized Parallel

  22. Challenge: Code Cache Consistency • Cached code must be removed for a variety of reasons: • Dynamically unloaded code • Ephemeral/adaptive instrumentation • Self-modifying code • Bounded code caches EXE Transform Profile Code Cache Execute

  23. Motivating a Bounded Code Cache • The Perl Benchmark

  24. Flushing the Code Cache • Option 1: All threads have a private code cache (oops, doesn’t scale) • Option 2: Shared code cache across threads • If one thread flushes the code cache, other threads may resume in stale memory

  25. Naïve Flush • Wait for all threads to return to the code cache • Could wait indefinitely! Flush Delay VM CC1 VM stall CC2 Thread1 VM CC1 VM stall CC2 Thread2 VM CC1 VM CC2 Thread3 Time

  26. VM CC1 VM CC2 VM CC1 VM CC2 VM CC1 VM CC2 Generational Flush • Allow threads to continue to make progress in a separate area of the code cache Thread1 Thread2 Thread3 Time Requires a high water mark

  27. Eviction Granularities • Entire Cache • One Cache Block • One Trace • Address Range Build-Your-Own Cache Replacement • void main(int argc, char **argv) { • PIN_Init(argc,argv); • CODECACHE_CacheIsFull(FlushOnFull); • PIN_StartProgram(); //Never returns • } • void FlushOnFull() { • CODECACHE_FlushCache(); • cout << “SWOOSH!” << endl; • } % pin –cache_size 40960 –t flusher -- /bin/ls SWOOSH! SWOOSH!

  28. A Graphical Front-End

  29. Memory Scalability of the Code Cache • Ensuring scalability also requires carefully configuring the code stored in the cache • Trace Lengths • First basic block is non-speculative, others are speculative • Longer traces = fewer entries in the lookup table, but more unexecuted code • Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code

  30. Effect of Trace Length on Trace Count

  31. Effect of Trace Length on Memory

  32. Sources of Overhead • Internal • Compiling code & exit stubs (region detection, region formation, code generation) • Managing code (eviction, linking) • Managing directories and performing lookups • Maintaining consistency (SMC, DLLs) • External • User-inserted instrumentation

  33. Adding Instrumentation

  34. “Normal Pin” Execution Flow • Instrumentation is interleaved with application time  Uninstrumented Application “Pinned” Application Instrumented Application Pin Overhead Instrumentation Overhead

  35. “SuperPin” Execution Flow • SuperPin creates instrumented slices Uninstrumented Application SuperPinned Application Instrumented Slices

  36. Issues and Design Decisions • Creating slices • How/when to start a slice • How/when to end a slice System calls Merging results

  37. fork fork fork fork fork S1+ S4+ record sigr4, sleep resume resume sleep S2+ S5+ record sigr5, sleep record sigr2, sleep resume resume S3+ S6+ record sigr6, sleep record sigr3, sleep resume resume Execution Timeline S2 S3 S4 S5 S6 S1 original application fork CPU1 instrumentedapplication slices detect sigr2 detect sigr5 CPU2 detect sigr3 detect sigr6 CPU3 detect sigr4 detect exit CPU4 time

  38. Performance – icount1 % pin –t icount1 -- <benchmark>

  39. What Did We Learn Today? • Building Process VMs is only half the battle • Robustness, correctness, performance are paramount • Lots of “tricks” are in play • Code caches, trace selection, etc. • Knowing about these tricks is beneficial • Lots of research opportunities • Understanding the inner workings often helps you write better tools

  40. Want More Info? • Read the seminal Dynamo paper • See the more recent papers by the Pin, DynamoRIO, Valgrind teams • Relevant conferences: VEE, CGO, ASPLOS, PLDI, PACT Day 1 – What is Process Virtualization? Day 2 – Building Process Virtualization Systems Day 3 – Using Process Virtualization Systems Day 4 – Symbiotic Optimization

More Related