Day 2: Building Process Virtualization Systems

Day 2: Building Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009

Course Outline • Day 1 – What is Process Virtualization? • Day 2 – Building Process Virtualization Systems • Day 3 – Using Process Virtualization Systems • Day 4 – Symbiotic Optimization

JIT-Based Process Virtualization Application Transform Profile Code Cache Execute

What are the Challenges? • Performance! Solutions: • Code caches – only transform code once • Trace selection – focus on hot paths • Branch linking – only perform cache lookup once • Indirect branch hash tables / chaining • Memory “management” • Correctness – self-modifying code, munmaps, multithreading • Transparency – context switching, eflags

What is the Overhead? • The latest Pin overhead numbers …

Sources of Overhead • Internal • Compiling code & exit stubs (region detection, region formation, code generation) • Managing code (eviction, linking) • Managing directories and performing lookups • Maintaining consistency (SMC, DLLs) • External • User-inserted instrumentation

Improving Performance: Code Caches Exit Stub Hash Table Lookup Hit Code Cache Interpret Branch Target Address Start Counter++ Miss Delete Insert Update Hash Table Room in Code Cache? Yes Code is Hot? No Region Formation & Optimization Yes Evict Code No

Software-Managed Code Caches • Store transformed code at run time to amortize overhead of process VMs • Contain a (potentially altered) copy of application code Application Transform Profile Code Cache Execute

Code Cache Contents • Every application instruction executed is stored in the code cache (at least) • Code Regions • Altered copies of application code • Basic blocks and/or traces • Exit stubs • Swap applicationVM state • Return control to the process VM

BBL A: Inst1 Inst2 Inst3 Branch B A A A A B B B C C C D D D D CFG In Memory Trace Code Regions • Basic Blocks • Traces

Exit Stubs • One exit stub exists for every exit from every trace or basic block • Functionality • Prepare for context switch • Return control to VM dispatch • Details • Each exit stub ≈ 3 instructions A B Exit to C D Exit to E

A A A B B Exit to C B C C D Call D E D E I G Exit to F Call H F G I I H E F Return Return G H Performance: Trace Selection • Interprocedural path • Single entry, multiple exit Trace (superblock) Layout in Code Cache CFG Layout in Memory

Performance: Cache Linking Trace #1 Trace #2 Trace #3 Dispatch Exit #1a Exit #1b

F H A A I A B C B C B Exit to C D D Exit to A D Call E E D E E G G G F C Exit to F H H F G H H D I I I I E I H G Return Exit to F H I Exit to A Linking Traces • Proactive linking • Lazy linking

Are Links Highly Beneficial?

Code Cache Visualization

Challenge: Rewriting Instructions • We must regularly rewrite branches • No atomic branch write on x86 • Pin uses a neat trick*: “old” 5-byte branch 2-byte self branch n-2 bytes of “new” branch “new” 5-byte branch * Sundaresan et al. 2006

SPC TPC Challenge: Achieving Transparency • Pretend as though the original program is executing Push 0x1006 on stack, then jump to 0x4000 Original Code: 0x1000call 0x4000 Translated Code: 0x7000push 0x1006 0x7006jmp 0x8000 Code cache address mapping: 0x1000  0x7000 “caller” 0x4000  0x8000 “callee”

Challenge: Self-Modifying Code • The problem • Code cache must detect SMC and invalidate corresponding cached traces • Solutions • Many proposed … but without HW support, they are very expensive! • Changing page protection • Memory diff prior to execution • On ARM, there is an explicit instruction for SMC!

Self-Modifying Code Handler • void main (int argc, char **argv) { • PIN_Init(argc, argv); • TRACE_AddInstrumentFunction(InsertSmcCheck,0); • PIN_StartProgram(); // Never returns • } • void InsertSmcCheck () { • . . . • memcpy(traceCopyAddr, traceAddr, traceSize); • TRACE_InsertCall(trace, IPOINT_BEFORE, (AFUNPTR)DoSmcCheck, • IARG_PTR, traceAddr, IARG_PTR, traceCopyAddr, • IARG_UINT32, traceSize, IARG_CONTEXT, IARG_END); • } • void DoSmcCheck (VOID* traceAddr, VOID *traceCopyAddr, • USIZE traceSize, CONTEXT* ctxP) { • if (memcmp(traceAddr, traceCopyAddr, traceSize) != 0) { • CODECACHE_InvalidateTrace((ADDRINT)traceAddr); • PIN_ExecuteAt(ctxP); • } • } (Written by Alex Skaletsky)

Challenge: Parallel Applications Pin Tool Instrumentation Code Call-Back Handlers Analysis Code Pin T1 T1 T2 JIT Compiler Dispatcher Code Cache T1 T1 Syscall Emulator T2 Signal Emulator Serialized Parallel

Challenge: Code Cache Consistency • Cached code must be removed for a variety of reasons: • Dynamically unloaded code • Ephemeral/adaptive instrumentation • Self-modifying code • Bounded code caches EXE Transform Profile Code Cache Execute

Motivating a Bounded Code Cache • The Perl Benchmark

Flushing the Code Cache • Option 1: All threads have a private code cache (oops, doesn’t scale) • Option 2: Shared code cache across threads • If one thread flushes the code cache, other threads may resume in stale memory

Naïve Flush • Wait for all threads to return to the code cache • Could wait indefinitely! Flush Delay VM CC1 VM stall CC2 Thread1 VM CC1 VM stall CC2 Thread2 VM CC1 VM CC2 Thread3 Time

VM CC1 VM CC2 VM CC1 VM CC2 VM CC1 VM CC2 Generational Flush • Allow threads to continue to make progress in a separate area of the code cache Thread1 Thread2 Thread3 Time Requires a high water mark

Eviction Granularities • Entire Cache • One Cache Block • One Trace • Address Range Build-Your-Own Cache Replacement • void main(int argc, char **argv) { • PIN_Init(argc,argv); • CODECACHE_CacheIsFull(FlushOnFull); • PIN_StartProgram(); //Never returns • } • void FlushOnFull() { • CODECACHE_FlushCache(); • cout << “SWOOSH!” << endl; • } % pin –cache_size 40960 –t flusher -- /bin/ls SWOOSH! SWOOSH!

A Graphical Front-End

Memory Scalability of the Code Cache • Ensuring scalability also requires carefully configuring the code stored in the cache • Trace Lengths • First basic block is non-speculative, others are speculative • Longer traces = fewer entries in the lookup table, but more unexecuted code • Shorter traces = two off-trace paths at ends of basic blocks with conditional branches = more exit stub code

Effect of Trace Length on Trace Count

Effect of Trace Length on Memory

Sources of Overhead • Internal • Compiling code & exit stubs (region detection, region formation, code generation) • Managing code (eviction, linking) • Managing directories and performing lookups • Maintaining consistency (SMC, DLLs) • External • User-inserted instrumentation

Adding Instrumentation

“Normal Pin” Execution Flow • Instrumentation is interleaved with application time  Uninstrumented Application “Pinned” Application Instrumented Application Pin Overhead Instrumentation Overhead

“SuperPin” Execution Flow • SuperPin creates instrumented slices Uninstrumented Application SuperPinned Application Instrumented Slices

Issues and Design Decisions • Creating slices • How/when to start a slice • How/when to end a slice System calls Merging results

fork fork fork fork fork S1+ S4+ record sigr4, sleep resume resume sleep S2+ S5+ record sigr5, sleep record sigr2, sleep resume resume S3+ S6+ record sigr6, sleep record sigr3, sleep resume resume Execution Timeline S2 S3 S4 S5 S6 S1 original application fork CPU1 instrumentedapplication slices detect sigr2 detect sigr5 CPU2 detect sigr3 detect sigr6 CPU3 detect sigr4 detect exit CPU4 time

Performance – icount1 % pin –t icount1 -- <benchmark>

What Did We Learn Today? • Building Process VMs is only half the battle • Robustness, correctness, performance are paramount • Lots of “tricks” are in play • Code caches, trace selection, etc. • Knowing about these tricks is beneficial • Lots of research opportunities • Understanding the inner workings often helps you write better tools

Want More Info? • Read the seminal Dynamo paper • See the more recent papers by the Pin, DynamoRIO, Valgrind teams • Relevant conferences: VEE, CGO, ASPLOS, PLDI, PACT Day 1 – What is Process Virtualization? Day 2 – Building Process Virtualization Systems Day 3 – Using Process Virtualization Systems Day 4 – Symbiotic Optimization

Day 2: Building Process Virtualization Systems

Day 2: Building Process Virtualization Systems

Presentation Transcript

Building Systems

Dave’s free 2-Day Wealth Building Event

Course: Government Process Re-engineering Day 2

INTELLIGENT CONTROL SYSTEMS Day 1/2

Distributed (Operating) Systems - Virtualization - -Server Design Issues- - Process and Code Migration -

Virtualization of Veterinary Hospital Building Plans

Day 13 Assignment: Systems Pt 2

Embedded Systems Day-2

Process Paper Day 2

Smart Assembly Systems Day 2

Virtualization in Mobile Systems

Building Your Student Success Plan: Day 2

Network Virtualization: Building Community

Residential Green Building Rating Systems - 2

Day 3: Using Process Virtualization Systems

Hanscom Virtualization Day CITS PMO

SWPBIS Building Tier 2/3 Systems

Course: Government Process Re-engineering Day 2