290 likes | 399 Views
“Dynamo: A Transparent Dynamic Optimization System ” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000. Presented by: David Carrillo CS 295: Run-time Techniques for Efficient and Reliable Program Execution. Guoqing Xu. Motivation.
E N D
“Dynamo: A Transparent Dynamic Optimization System” V. Bala, E. Duesterwald, and S. Banerjia, PLDI 2000 Presented by: David Carrillo CS 295: Run-time Techniques for Efficient and Reliable Program Execution. Guoqing Xu
Motivation • Recent developments decrease effectiveness of compiler techniques • Software shipped as collection of DLLs not single executable • Object oriented languages use delayed binding • Dynamic code generation • System vendors have to rely on Software Vendors to avail powerful static compiler optimizations • Still trend is to offload complexity from hardware to compiler • CISC vs. RISC • Superscalar vs. VLIW (EPIC)
Motivation (contd.) • Legacy binaries may not take advantages of modern compilers/chip innovations • Two opposite developments • Static compiler taking on increased performance burden • Obstacles to static compiler analysis increasing • Choices • Go for complex specialized compilers • Analyze code dynamically and do optimizations (Run-time optimization) Dynamo’s Choice
Overview • Dynamo is a dynamic “native-to-native” optimization system • Input an executing native instruction stream • No involvement of Software Vendor • Transparent operation • Input instruction stream to Dynamo may be • Statically created binary • Dynamically generated binary (i.e. JIT, emulator) • Present work focuses on static binary case
Overview (contd.) • Dynamo works like a Virtual Machine • Can observe execution behavior • Tries to identify hot instruction sequences (traces) • After identification of hot traces • Generates optimized fragments for those traces • Stored into fragment cache (a software code cache) • Links successive traces (trace tree). • Subsequent occurrences of a hot trace • Stops interpreting • Executes corresponding fragment from “fragment cache” • Upon end of fragment, Dynamo resumes interpreting code
Overview (contd.) • Overheads associated • Interpreting code is very slow • Identification of hot traces and keeping track • Generation of optimized fragments • Advantages of approach • Working set of application gradually materializes into cache • Optimized using run-time information (expected to be much faster than original code) • Beneficial if • Performance benefits from reuse of fragments offset overheads • Assumes that major working set of application is small
Identification of Hot Traces • Uses the MRET (Most Recently Executed Trail) scheme • Interpreter associates counters with possible start-of-trace points • Increments whenever these points executed • Tries to do minimal profiling • Assumes that following instruction are also hot • Estimates the end-of-trace point using some conditions • When count exceeds a threshold • Stops interpreting • Records instruction sequence till end-of-trace condition • This gives the hot trace
Trace Optimization and Fragment Generation • Selected hot trace converted to intermediate representation(IR) • IR close to underlying machine instruction set • Performs optimizations on the IR • Removes unconditional branches • Other optimizations on branch instructions • For fragment generation • Trace might be split into multiple fragments due to branches • Register allocator tries to preserve original mappings • Stubs are inserted for exit from fragment • Fragment linking done to couple contiguous fragments (highest performance gain)
From the interpreter A C D G I J E F* B* Back to the interpreter H* The Trace A B C call D Speculative Trace Construction G E H I F J return *
From the interpreter A C D G I J E F* B* Back to the interpreter H* Effects of Fragment Construction • Hot path is linearized • Eliminates branches • Creates superblock • Applies local optimization • Cold-path branches remain • Targets are stubs • Send control to interpreter • Path includes call & return • Jumps not branches • Interprocedural effects • Indirect branches • Speculate on target • Fall back on hash table of branch targets
Sources of Improvement Many small things contribute to make Dynamo profitable • Linearization eliminates branches • Improves TLB & I-cache behavior • 2 passes of local optimization • Redundancy elimination, copy propagation, constant folding, simple strength reduction, loop unrolling, loop invariant code motion, redundant load removal • Keep in mind that “local” includes interprocedural in Dynamo • Focuses on Optimization hard to do on compilation, at least without profile information. Aims to not compete with static compiler. Nevertheless, Engineering detail makes a difference
From the interpreter A C D G I J E F* B* Back to the interpreter What happens if another path becomes hot? (Say ABDGIJE) H* Fragment Linking Block B meets “start of trace” condition (exit from fragment) A B C call D Speculative Trace Construction G E H I F J return *
From the interpreter A C D G I J E F* B* Back to the interpreter H* Fragment Linking When counter reaches hot: • Builds a fragment
From the interpreter A B C D D G G I I J J E E F* F* B* Back to the interpreter A* H* H* Fragment Linking When counter reaches hot: • Builds a fragment
Fragment Linking When counter reaches hot: • Builds a fragment • Links exit AB to new fragment • Links exit EA to old fragment From the interpreter A B C D D G G I I J J E E F* F* Back to the interpreter H* H*
Fragment Linking When counter reaches hot: • Builds a fragment • Links exit AB to new fragment • Links exit EA to old fragment What if B* held redundant op? • Have LIVE on entry to B • Can test LIVE sets for both exit from A & entry to B • May show op is dead … • Link-time RE From the interpreter A B C D D G G I I J J E E F* F* Back to the interpreter H* H*
Fragment Cache Management • Need to flush out cold entries regularly • Mechanism for replacement can’t take too much overhead • Schemes such as LRU might not be feasible • Dynamo’s approach • Looks for sharp increase in fragment creation range • Indicates significant change in program working set • Complete fragment cache flush triggered
Signal handling in Dynamo • Signals/interrupts could be of two types • Asynchronous • Synchronous • Asynchronous signals queued at arrival time and processed at end of current fragment’s execution • Synchronous signal arrival on optimized code can create problems • Can’t be postponed • Might assume original program context which no longer exists due to optimizations
Signal handling in Dynamo (contd.) • Conservative Optimizations • Constant propagation, constant folding, copy propagation, etc • Allow precise signal context to be reconstructed • Aggressive Optimizations • Dead code removal, code sinking, loop invariant code motion, etc. • Can’t guarantee that synchronous signals can be handled properly
Results Figure. Speedup of +O2 optimized PA-8000 binaries running on Dynamo, relative to the identical binaries running standalone. The contributions from dynamo inlining due to trace selection, conservative trace optimization and aggressive trace optimization are shown. Dynamo bails out to direct native execution on go and vortex.
Results • Dynamo runs faster than native code in the average case • Overhead averages 1.5% (Spec95 on HP PA-8000) • Most significant source of speedup is trace-selection. • Most of the overhead is spent in trace selection • Interpret, bump target counters, test against threshold • Optimization & code generation have minor cost
Results Performance gain reduces V.S. highly optimized binaries.
Explanation of results • Gains due to trace selection due to • Selecting a trace and forming fragments out of it • Partial procedure inlining and improved layout in cache • Large improvements in • Perl, compress, li and m88skim • Have stable working sets • No performance improvements on • go, ijpeg and vortex • Startup time is non-negligible fraction of the total runtime in these programs • Don’t run long enough to recoup Dynamo’s startup overhead • Lack of a stable working set.
Remarks • Dynamo project • Started in 1996 • Tries to achieve dynamic performance improvement • Novel native-to-native online optimization aproach • Complements the static compiler • Very influential paper • 279 tracked in ACM Digital Library
Remarks • Starting up • Single call to library routine • It copies the stack, creates space for interpreter’s state, and begins Dynamo’s execution • This model simplifies initialization of program and grants Dynamo local access to binary memory space. • Counter management • 1st time at branch target allocates & initializes a counter • 2nd & subsequent times bumps that counter • Optimization & code generation recycles counter’s space • With cheap breakpoint mechanism, it could spare the use of an interpreter and use breakpoint to invoke Dynamo (similar to a debugger).
Caveats • Runs on PA-8000 (introduced on November 1995) • Very simple instructions, easier to write optimizer than CISC (recall Itanium Fiasco) • This chip lacks most of hardware optimization features common on modern chips. • Low performance gain on already highly optimized binaries • Poor handling of interruptions may cause problems in I/O intensive scenarios. Experiments does not test this scenario • Posterior documents from the authors market Dynamo as an interpreter for emulators
References • Dynamo: A Transparent Dynamic Optimization System • VasanthBala, Evelyn Duesterwald and SanjeevBanerjia • Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation • ARS Technica report on Dynamo • http://www.arstechnica.com/reviews/1q00/dynamo/dynamo-1.html Slides based on previous slides of: • Dynamic Optimization as typified by the Dynamo System. • Keith D. Cooper & Linda Torczon. • COMP 512, Rice University. • MayankAgarwal. CS 812 : High Level Design and Modeling Student Presentation