200 likes | 207 Views
Learn about the performance optimizations in Dyninst, a complex instrumentation tool that relocates code, improves register usage, and optimizes code generation.
E N D
Performance Optimizations in Dyninst Andrew Bernat, Matthew Legendre
Instrumentation is Complicated • User perspective: • “Insert some new code here, here, and here.” • Dyninst’s perspective: • Relocation – Move code to make space for instrumentation • Infrastructure – Save/restore machine state • Instrumentation – Generate user provided code Performance Optimizations in Dyninst
Sources of Overhead Relocation Infrastructure Instrumentation • Extra jumps • Unnecessary emulation • Traps • Extra register saves • Tramp guards • Inefficient register usage • Poor code generation • Optimizations • Inlining instrumentation • Compiler optimizations of generated code • 665% -> 32% Performance Optimizations in Dyninst
History • Enable fast (and frequent) insertion and removal of code • “Linked list” model • Insert/remove by patching branches • Model has evolved over time • Long-lived instrumentation (particularly with static rewriter) • Focus on speed of execution instead of speed of insertion Performance Optimizations in Dyninst
Outlined Instrumentation Original Code Relocated Block Relocated Block Relocated Block Relocated Code Instrumentation/Infrastructure Basetramp Basetramp Basetramp Basetramp Basetramp Basetramp Minitramp Relocated Function Relocated Function Branch Minitramp Minitramp Minitramp Branch Minitramp Relocated Function Branch Minitramp Performance Optimizations in Dyninst
Outlined System • Fast insertion and removal • Simple to update • Original serves as a “handle” • Reduced code relocation • Block or instruction • Hard to optimize • New code can be inserted without warning • Poor code locality Performance Optimizations in Dyninst
Partial Inlining Original Code Relocated Block Relocated Block Relocated Block Relocated Code Instrumentation & Instrumentation Basetramp Basetramp Basetramp Basetramp Basetramp Basetramp Minitramp Relocated Function Relocated Function Branch Minitramp Minitramp Minitramp Branch Minitramp Relocated Function Branch Minitramp Performance Optimizations in Dyninst
Full Inlining Original Code Relocated Block Relocated Block Relocated Block Relocated Code & Instrumentation Relocated Function Relocated Function Relocated Function Branch ? Branch Relocated Function Branch Performance Optimizations in Dyninst
Branch Reduction • Inlining removed three levels of branching • Function to block to basetramp to minitramp • One level is left • Function original to relocated copy • Can we remove this branch as well? • Identify and rewrite calls to relocated functions • Regenerate whenever target is moved Performance Optimizations in Dyninst
Optimizing BaseTramps and MiniTramps • DyninstAPI contains a built-in compiler • Converts ASTs to machine code • Used for BaseTramps and MiniTramps • Designed to be cross-platform (x86, x86_64, ppc32, ppc64, IA-64, Sparc) • Build new optimizations into compiler • Some optimizations from classic compilers • Some optimizations are instrumentation specific Performance Optimizations in Dyninst
Optimizing Code Generation pusha pushf push %ebp mov %esp,%ebp sub $128,%esp mov 0x805a490,%eax mov (%eax),%ecx test %ecx,%ecx je done mov $0x0,(%ecx) mov $1,%eax mov %eax,4(%ebp) mov 0x805a494,%ebx mov 4(%ebp),%eax add %eax,%ebx mov %ebx,0x805a494 mov 0x805a490,%eax mov $0x1,(%eax) done: leave popf popa Saving too many registers Register Saves Stack frame (Setup) Stack frame unnecessary Tramp guards unnecessary Trampoline Guard (Check) Extraneous register usage “Virtual” registers unnecessary Instrumentation Inefficient instrumentation Trampoline Guard (Restore) Recalculating old value Stack frame (Clean) Register Restores
Register Saves Register Saves • Calculate live registers at inst point • Calculate registers used by instrumentation • Save intersection • Use more efficient flag saves pusha pushf push %eax lahf push %eax Performance Optimizations in Dyninst
Virtual Registers Instrumentation • “Virtual Registers” were stack slots on x86 • Load from virtual register to eax • Operate on eax • Store from eax to virtual register • Now use real register allocation algorithm, with spilling mov $1,%eax mov %eax,4(%ebp) mov 4(%ebp),%eax mov $1,%eax Performance Optimizations in Dyninst
AST to Machine Code Compilation Instrumentation • Each AST node is converted to an instruction • Not optimal on CISC systems • Recognize sequences of ASTs, emit optimized code mov $1,%eax incl 0x805a494 = mov 0x805a494,%ebx 0x805a494 + add %eax,%ebx mov $0x805a494,%ecx 0x805a494 1 mov %ebx,(%ecx) Performance Optimizations in Dyninst
Optional Infrastructure Tramp Guard Stack Frame • Some tramp infrastructure not always required. E.g, • Stack frame only needed for register spilling • Tramp guard only need for function calls • Save only necessary infrastructure mov 0x805a490,%eax mov (%eax),%ecx test %ecx,%ecx je done mov $0x0,(%ecx) push %ebp mov %esp,%ebp sub $0x32,%esp ... FP Saves mov %esp,%eax sub $512,%esp and 0xfffffff0,%esp fxsave (%esp) push %eax Stack Shift lea 0x128(%rsp),%rsp Performance Optimizations in Dyninst
Fixed Point Code Generation • Optimizations may be interlinked. E.g., • Removing code may leave registers unused • Removing unused registers eliminates saves • Eliminating saves removes stack access • Removing stack accesses may eliminate stack shift • Typical code generation requires 2 passes Performance Optimizations in Dyninst
Optimizing Code Generation pusha pushf push %ebp mov %esp,%ebp sub $128,%esp mov 0x805a490,%eax mov (%eax),%ecx test %ecx,%ecx je done mov $0x0,(%ecx) mov $1,%eax mov %eax,4(%ebp) mov 0x805a494,%ebx mov 4(%ebp),%eax incl 0x805a494 mov %ebx,0x805a494 mov 0x805a490,%eax mov $0x1,(%eax) done: leave popf popa pusha pushf push %ebp mov %esp,%ebp sub $128,%esp mov 0x805a490,%eax mov (%eax),%ecx test %ecx,%ecx je done mov $0x0,(%ecx) mov $1,%eax mov %eax,4(%ebp) mov 0x805a494,%ebx mov 4(%ebp),%eax add %eax,%ebx mov %ebx,0x805a494 mov 0x805a490,%eax mov $0x1,(%eax) done: leave popf popa Register Saves Stack frame (Setup) Trampoline Guard (Check) incl 0x805a494 Instrumentation Trampoline Guard (Restore) Stack frame (Clean) Register Restores
Results • Basic block instrumentation on ‘go’ from SPEC2000 Instrumented run time (base: 12.25s) Instrumentation time Performance Optimizations in Dyninst
Conclusions • Optimizations in DyninstAPI instrumentation • Inline instrumentation levels • Generate more efficient code • Significant performance gains • Instrumentation code runs faster • More time spent generating instrumentation Performance Optimizations in Dyninst
Questions? Performance Optimizations in Dyninst