400 likes | 506 Views
Day 2: Building Process Virtualization Systems. Kim Hazelwood ACACES Summer School July 2009. Course Outline. Day 1 – What is Process Virtualization? Day 2 – Building Process Virtualization Systems Day 3 – Using Process Virtualization Systems Day 4 – Symbiotic Optimization.
E N D
Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009
Course Outline • Day 1 – What is Process Virtualization? • Day 2 – Building Process Virtualization Systems • Day 3 – Using Process Virtualization Systems • Day 4 – Symbiotic Optimization
JIT-Based Process Virtualization Application Transform Profile Code Cache Execute
What are the Challenges? • Performance! Solutions: • Code caches – only transform code once • Trace selection – focus on hot paths • Branch linking – only perform cache lookup once • Indirect branch hash tables / chaining • Memory “management” • Correctness – self-modifying code, munmaps, multithreading • Transparency – context switching, eflags
What is the Overhead? • The latest Pin overhead numbers …
Sources of Overhead • Internal • Compiling code & exit stubs (region detection, region formation, code generation) • Managing code (eviction, linking) • Managing directories and performing lookups • Maintaining consistency (SMC, DLLs) • External • User-inserted instrumentation
Improving Performance: Code Caches Exit Stub Hash Table Lookup Hit Code Cache Interpret Branch Target Address Start Counter++ Miss Delete Insert Update Hash Table Room in Code Cache? Yes Code is Hot? No Region Formation & Optimization Yes Evict Code No
Software-Managed Code Caches • Store transformed code at run time to amortize overhead of process VMs • Contain a (potentially altered) copy of application code Application Transform Profile Code Cache Execute
Code Cache Contents • Every application instruction executed is stored in the code cache (at least) • Code Regions • Altered copies of application code • Basic blocks and/or traces • Exit stubs • Swap applicationVM state • Return control to the process VM
BBL A: Inst1 Inst2 Inst3 Branch B A A A A B B B C C C D D D D CFG In Memory Trace Code Regions • Basic Blocks • Traces
Exit Stubs • One exit stub exists for every exit from every trace or basic block • Functionality • Prepare for context switch • Return control to VM dispatch • Details • Each exit stub ≈ 3 instructions A B Exit to C D Exit to E
A A A B B Exit to C B C C D Call D E D E I G Exit to F Call H F G I I H E F Return Return G H Performance: Trace Selection • Interprocedural path • Single entry, multiple exit Trace (superblock) Layout in Code Cache CFG Layout in Memory
Performance: Cache Linking Trace #1 Trace #2 Trace #3 Dispatch Exit #1a Exit #1b
F H A A I A B C B C B Exit to C D D Exit to A D Call E E D E E G G G F C Exit to F H H F G H H D I I I I E I H G Return Exit to F H I Exit to A Linking Traces • Proactive linking • Lazy linking
Challenge: Rewriting Instructions • We must regularly rewrite branches • No atomic branch write on x86 • Pin uses a neat trick*: “old” 5-byte branch 2-byte self branch n-2 bytes of “new” branch “new” 5-byte branch * Sundaresan et al. 2006
SPC TPC Challenge: Achieving Transparency • Pretend as though the original program is executing Push 0x1006 on stack, then jump to 0x4000 Original Code: 0x1000call 0x4000 Translated Code: 0x7000push 0x1006 0x7006jmp 0x8000 Code cache address mapping: 0x1000 0x7000 “caller” 0x4000 0x8000 “callee”
Challenge: Self-Modifying Code • The problem • Code cache must detect SMC and invalidate corresponding cached traces • Solutions • Many proposed … but without HW support, they are very expensive! • Changing page protection • Memory diff prior to execution • On ARM, there is an explicit instruction for SMC!
Self-Modifying Code Handler • void main (int argc, char **argv) { • PIN_Init(argc, argv); • TRACE_AddInstrumentFunction(InsertSmcCheck,0); • PIN_StartProgram(); // Never returns • } • void InsertSmcCheck () { • . . . • memcpy(traceCopyAddr, traceAddr, traceSize); • TRACE_InsertCall(trace, IPOINT_BEFORE, (AFUNPTR)DoSmcCheck, • IARG_PTR, traceAddr, IARG_PTR, traceCopyAddr, • IARG_UINT32, traceSize, IARG_CONTEXT, IARG_END); • } • void DoSmcCheck (VOID* traceAddr, VOID *traceCopyAddr, • USIZE traceSize, CONTEXT* ctxP) { • if (memcmp(traceAddr, traceCopyAddr, traceSize) != 0) { • CODECACHE_InvalidateTrace((ADDRINT)traceAddr); • PIN_ExecuteAt(ctxP); • } • } (Written by Alex Skaletsky)
Challenge: Parallel Applications Pin Tool Instrumentation Code Call-Back Handlers Analysis Code Pin T1 T1 T2 JIT Compiler Dispatcher Code Cache T1 T1 Syscall Emulator T2 Signal Emulator Serialized Parallel
Challenge: Code Cache Consistency • Cached code must be removed for a variety of reasons: • Dynamically unloaded code • Ephemeral/adaptive instrumentation • Self-modifying code • Bounded code caches EXE Transform Profile Code Cache Execute
Motivating a Bounded Code Cache • The Perl Benchmark
Flushing the Code Cache • Option 1: All threads have a private code cache (oops, doesn’t scale) • Option 2: Shared code cache across threads • If one thread flushes the code cache, other threads may resume in stale memory
Naïve Flush • Wait for all threads to return to the code cache • Could wait indefinitely! Flush Delay VM CC1 VM stall CC2 Thread1 VM CC1 VM stall CC2 Thread2 VM CC1 VM CC2 Thread3 Time
VM CC1 VM CC2 VM CC1 VM CC2 VM CC1 VM CC2 Generational Flush • Allow threads to continue to make progress in a separate area of the code cache Thread1 Thread2 Thread3 Time Requires a high water mark
Eviction Granularities • Entire Cache • One Cache Block • One Trace • Address Range Build-Your-Own Cache Replacement • void main(int argc, char **argv) { • PIN_Init(argc,argv); • CODECACHE_CacheIsFull(FlushOnFull); • PIN_StartProgram(); //Never returns • } • void FlushOnFull() { • CODECACHE_FlushCache(); • cout << “SWOOSH!” << endl; • } % pin –cache_size 40960 –t flusher -- /bin/ls SWOOSH! SWOOSH!
Memory Scalability of the Code Cache • Ensuring scalability also requires carefully configuring the code stored in the cache • Trace Lengths • First basic block is non-speculative, others are speculative • Longer traces = fewer entries in the lookup table, but more unexecuted code • Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code
Sources of Overhead • Internal • Compiling code & exit stubs (region detection, region formation, code generation) • Managing code (eviction, linking) • Managing directories and performing lookups • Maintaining consistency (SMC, DLLs) • External • User-inserted instrumentation
“Normal Pin” Execution Flow • Instrumentation is interleaved with application time Uninstrumented Application “Pinned” Application Instrumented Application Pin Overhead Instrumentation Overhead
“SuperPin” Execution Flow • SuperPin creates instrumented slices Uninstrumented Application SuperPinned Application Instrumented Slices
Issues and Design Decisions • Creating slices • How/when to start a slice • How/when to end a slice System calls Merging results
fork fork fork fork fork S1+ S4+ record sigr4, sleep resume resume sleep S2+ S5+ record sigr5, sleep record sigr2, sleep resume resume S3+ S6+ record sigr6, sleep record sigr3, sleep resume resume Execution Timeline S2 S3 S4 S5 S6 S1 original application fork CPU1 instrumentedapplication slices detect sigr2 detect sigr5 CPU2 detect sigr3 detect sigr6 CPU3 detect sigr4 detect exit CPU4 time
Performance – icount1 % pin –t icount1 -- <benchmark>
What Did We Learn Today? • Building Process VMs is only half the battle • Robustness, correctness, performance are paramount • Lots of “tricks” are in play • Code caches, trace selection, etc. • Knowing about these tricks is beneficial • Lots of research opportunities • Understanding the inner workings often helps you write better tools
Want More Info? • Read the seminal Dynamo paper • See the more recent papers by the Pin, DynamoRIO, Valgrind teams • Relevant conferences: VEE, CGO, ASPLOS, PLDI, PACT Day 1 – What is Process Virtualization? Day 2 – Building Process Virtualization Systems Day 3 – Using Process Virtualization Systems Day 4 – Symbiotic Optimization