420 likes | 561 Views
Efficient, Transparent, and Comprehensive Managed Program Execution. Derek Bruening. Determina Corporation. Typical Modern Application: IIS. Design Goals. Efficient Near-native performance Transparent Match native behavior Comprehensive Control every instruction, in any application
E N D
Efficient, Transparent, and ComprehensiveManaged Program Execution Derek Bruening Determina Corporation
Design Goals • Efficient • Near-native performance • Transparent • Match native behavior • Comprehensive • Control every instruction, in any application • Customizable
Managed Program Execution Engine • First software system that can manipulate, at runtime, every instruction an arbitrary application executes, with: • Minimal performance penalty • Full transparency • Exports interface for building custom tools • No modifications to the hardware, operating system, or application
Challenges of Real-World Apps • Multiple threads • Cache management • Application introspection • Inter-process manipulation: hooks • Transparency corner cases are the norm • Scalability • Must adapt to varying code sizes, thread counts, etc.
Outline • Efficient • Transparent • Comprehensive • Customizable
Outline • Efficient • Software code cache • Traces • Base performance • Transparent • Comprehensive • Customizable
Basic Interpreter START fetch decode execute Slowdown: ~300x
Interpreter + Basic Block Cache basic block builder START dispatch context switch BASIC BLOCK CACHE Non-control-flow instructions executed from software code cache non-control-flow instructions Slowdown: 300x 25x Shade [Cmelik 1994]
Linking Direct Branches basic block builder START dispatch context switch BASIC BLOCK CACHE Direct branch to existing block can bypass dispatch non-control-flow instructions Slowdown: 300x 25x 3x Shade [Cmelik 1994]
Linking Indirect Branches basic block builder START dispatch context switch BASIC BLOCK CACHE Application address mapped to code cache indirect branch lookup non-control-flow instructions Slowdown: 300x 25x 3x 1.2x Dynamo [Bala 2000]
Picking Traces trace selector basic block builder START dispatch context switch BASIC BLOCK CACHE TRACE CACHE indirect branch lookup non-control-flow instructions non-control-flow instructions indirect branch stays on trace? Slowdown: 300x 26x 3x 1.2x <1.1x Dynamo [Bala 2000]
Outline • Efficient • Transparent • Rules of transparency • Cache consistency • Comprehensive • Customizable
Transparency • Do not want to interfere with the semantics of the program • Dangerous to make any assumptions about: • Register usage • Calling conventions • Stack layout • Memory/heap usage • I/O and other system call use
Painful, But Necessary • Difficult and costly to handle corner cases • Many applications will not notice… • …but some will! • Non-exceptional exceptions: Adobe Photoshop • Stack convention violations: Microsoft Office • Self-modifying code: Adobe Premiere
Windows Rule 1: Avoid resource conflicts Linux
Rule 2: If it’s not broken, don’t change it • Threads • Executable on disk • Application data • Including the stack!
Example Transparency Violation Error Error Error Error Error Error Error Error Error Error SPEC CPU2000 Server Desktop
Rule 3: If you change it, emulate original behavior’s visible effects • Application addresses • Address space • Error transparency • Code cache consistency
Detecting Code Changes • Memory unmap • Example: shared library being unloaded • Detect by monitoring system calls (munmap, NtUnmapViewOfSection) • Memory modification • Dynamically modified code • IA-32 keeps icache consistent in hardware...
Detecting Code Changes • Solution: • Page protection when rarely written • Instrumentation when frequently written or when writer and target on same page
Outline • Efficient • Transparent • Comprehensive • Kernel-mediated control transfers • Customizable
Kernel-Mediated Control Transfers user mode kernel mode message pendingsave user context majority of executed code in a typical Windows application message handler time no message pendingrestore context
Challenges • Interception • Set up own handler in place of original • Continuation • May never return to interrupted state • Self-interruption • Kernel emulation
Intercept and Re-direct Messages user mode kernel mode message pendingsave user context intercept time message handler no message pendingrestore context Mojo [Chen 2000]
Kernel Emulation user context user context • Exception and signal handlers are passed machine context of the faulting instruction • For transparency, that context must be translated from the code cache to the original code location faulting instr. faulting instr.
Outline • Efficient • Transparent • Comprehensive • Customizable • Client Hooks • API • Examples
Clients • The engine exports an API for building a client • System details abstracted away: client focuses on manipulating the code stream
Client Hooks client client START trace selector basic block builder client dispatch context switch BASIC BLOCK CACHE TRACE CACHE indirect branch lookup non-control-flow instructions non-control-flow instructions indirect branch stays on trace?
Client Hooks: Code Stream • Application code stream • Basic block creation • Trace creation • Client has opportunity to inspect and potentially modify every single application instruction, immediately before it executes
Client Hooks: Bookkeeping • Initialization and Exit • Entire process • Each thread • Basic block and trace deletion during cache management
Client API • Code manipulation • IR • Saving eflags, spilling registers • Processor feature identification • Transparency support • Separate I/O and memory allocation • Thread support • Thread-local memory, simple mutexes
Instruction Representation • Costly to decode and encode IA-32 • Variable length • Specialized instruction templates • Complex decoding/encoding heuristics • Often only interested in high-level information for subset of instructions • Many instructions copied to cache unmodified • Solution: adaptive level of detail
API Highlights • Clean calls • Branch instrumentation • Adaptive code transformation • Custom traces • Custom exit stubs and prefixes • Standalone library support
Adaptive Code Transformation • Re-decode fragment in cache • Replace fragment in cache • Even while executing inside of it • Works by creating a new fragment and shifting all incoming links to it
Example Client EXPORT void dynamorio_basic_block(void *cxt, app_pc tag, InstrList *bb) { Instr *instr; for (instr = instrlist_first(bb); instr != NULL; instr = instr_get_next(instr)) { if (instr_is_syscall(instr)) { dr_save_arith_flags(cxt, bb, instr, &OF_slot); instrlist_preinsert(bb, instr, INSTR_CREATE_inc(cxt, OPND_CREATE_MEM32(REG_NULL, &counter))); dr_restore_arith_flags(cxt, bb, instr, &OF_slot); } } }
Dynamic Optimization Examples • Adaptive • Tune for current behavior, not single profile run • Microarchitecture-specific • Specialize to underlying processor • Inter-module • All code is available • Traditional static optimizations • Vendor may not have applied all optimizations
Pentium 4? EXPORT void dynamorio_init() { enable = (proc_get_family()==FAMILY_PENTIUM_IV); } EXPORT void dynamorio_trace(void *drcontext, app_pc tag, InstrList *trace) { Instr *instr, *next_instr; int opcode; if (!enable) return; for (instr =instrlist_first_expanded(bb); instr != NULL; instr = next_instr) { next_instr =instr_get_next_expanded(instr); opcode =instr_get_opcode(instr); if (opcode ==OP_inc|| opcode ==OP_dec) replace_inc_with_add(drcontext, instr, trace); } } } static bool replace_inc_with_add(void *drcontext, Instr *instr, InstrList *trace) { Instr *in; uint eflags; int opcode =instr_get_opcode(instr); bool ok_to_replace = false; for (in = instr; in != NULL; in =instr_get_next_expanded(in)) { eflags =instr_get_arith_flags(in); if ((eflags &EFLAGS_READ_CF) != 0) return false; if ((eflags &EFLAGS_WRITE_CF) != 0) { ok_to_replace = true; break; } if (instr_is_exit_cti(in)) return false; } if (!ok_to_replace) return false; if (opcode ==OP_inc) in =INSTR_CREATE_add(drcontext, instr_get_dst(instr, 0), OPND_CREATE_INT8(1)); else in =INSTR_CREATE_sub(drcontext,instr_get_dst(instr, 0), OPND_CREATE_INT8(1)); instr_set_prefixes(in,instr_get_prefixes(instr)); instrlist_replace(trace, instr, in); instr_destroy(drcontext, instr); return true; } Look for inc / dec Ensure eflags change ok Replace with add / sub
Summary • First software system that can manipulate code at runtime in a manner that is: • Efficient: minimal performance penalty • Transparent: unperturbed native behavior • Comprehensive: every instruction an arbitrary application executes • Customizable: can build runtime tools