610 likes | 807 Views
Dynamic Optimization. David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu. What is Dynamic Optimization. Allow a running binary to adapt to the underlying hardware system dynamically
E N D
Dynamic Optimization David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu
What is Dynamic Optimization • Allow a running binary to adapt to the underlying hardware system dynamically • Perform optimization while not sacrificing performance Input Input Runtime Dynamic Optimization System Static Source Fluid Binary OS/HW Platform
Why Dynamic versus Static • Allows code to adapt to: • Changes in the microarchitecture of the underlying platform (related to binary translation) • Changes in program input • Environment dynamics (e.g., system load, system availability) • Involves very little user interaction (optimization should be applied transparently) • Source code is not needed • Language independent
Challenges with Dynamic Optimization • Reducing the associated overhead and maintaining transparency • Addressing a range of workloads • Selecting appropriate optimizations
Dynamic Optimization Systems • Dynamo • HP labs, PA-RISC/HPUX • Runtime optimization • Vulcan/Mojo • MS Research, X86-IA64/Win2K • Deskstop instrumentation, profiling and optimization • Jalapeno • IBM Research, JVM-PPC-SMPs/AIX • Java JIT designed for research • Latte • University of Seoul, Korea • Java JIT designed for efficient register allocation
Dynamo Application + Libs (native binary) Application + Libs (native binary) Dynamo CPU Platform CPU Platform Dynamo execution model Normal execution model To the application, Dynamo looks like a software interpreter that executes the same instruction set executed by underlying hardware interpreter (the CPU). * Many of these slides were provided by Evelyn Duesterwald
Elements of Dynamo • A novel performance delivery mechanism: • Optimize the code when it executes, not when it is created • A client-enabled performance mechanism • Dynamic code re-layout • Partial dynamic inlining/superblock formation • Path-specific optimization • Adaptive: machine and input specific • Complementaryto static optimization • Transparent: requires no compiler support Application + Libs (native binary) Dynamo CPU Platform
Flow within Dynamo Input native instruction stream Interpretation/Profiling Loop no miss Lookup next PC in Trace Cache Interpret until taken branch Hot start of trace? yes recycle counter exit branch hit Trace Selector Dynamo Code Cache Trace Optimizer Emit Trace Trace Linker
Traces in Dynamo Trace = single-entry join-free dynamic sequence of basic blocks Control Flow Graph Memory Layout Trace Cache Layout connect to other trace A A A F T F B B F B C C D C call E E call D return trampoline E return D exit to Interpreter
Traces in Dynamo Interprocedural forward path: start-of-trace= target of backward branch end-of-trace = taken backward branch A B C D 11 Paths through the loop: ABCEH ABCEHKMO ABCEHKNO ABCEIKMO ABCEIKNO ABCFJL ABCFJLNO ABDFJL ABDFJLNO ABDGJL ABDGJLNO G E F I H J K L M N O
Traces in Dynamo – typical path profiles • Approach: • profile all edge frequencies • select hot trace by following highest frequency branch outcome • Disadvantage: • Infeasible path: ignores branch correlation • Overhead: • need to profile every conditional branch A B C D G E F I H J K L M N O
Traces in Dynamo – Next Executing Tail Prediction • Minimal profiling: • profile only start-of-trace points (block A) • Optimistic: • at hot start-of-trace select next executing • Advantages: • very light-weight # instrumentation points = #targets of backward branches #counters = #targets of backward branches • statistically likely to pick the hottest path • feasible paths • easy to implement A B C D G E F I H J K L M N O
When to stop creating new traces • Excessively high trace selection rates cause unacceptable overhead and potential thrashing in the Dynamo code cache • We need the opportunity to amortize the cost of creating traces, thus we need to turn off trace creation sometimes • “Bail out” is entered when the creation rate per unit time is excessively high
Trace Optimization List of trace blocks Build lightweight Intermediate Representation (IR): Symbolic Labels, Extended Virtual Register Set Lite IR Forward Pass Optimization with integrated demand-driven analysis Backward Pass Reg Alloc Schedule & Register Allocation – retain previous mappings Linker
Trace Optimization Are there any runtime optimization opportunities in statically optimized code? Limitations of static compiler optimization: • cost of call-specific interprocedural optimization • cost of path-specific optimization in presence of complex control flow • difficulty of predicting past indirect branches • lack of access to shared libraries • sub-optimal register allocation decisions • register allocation for individual array elements or pointers
Path-specific optimizations • Conservative Optimizations • precise signal delivery • memory-safe • partial procedure inlining • redundant branch removal • constant propagation • constant folding • copy propagation • Aggressive Optimizations • redundant load removal • runtime disambiguated (guarded) load removal • dead code elimination • partially dead code sinking • loop unrolling • loop invariant hoisting aggressive optimization can be made memory- and signal-safe • compiler hints • de-optimization
Dynamo Optimizations • Constant propagation • Given x <- c ; for variable x and constant c • Replace all later uses of x with c, assuming that x will not be modified entry entry b <- 3 c <- 4 * b c > b b <- 3 c <- 4 * 3 c > 3 n n y y d <- b + 2 d <- 3 + 2 e <- a + b e <- a + 3 exit exit
Dynamo Optimizations • Constant folding • Identifying that all operands in an assignment are constant after macro expansion and constant propagation • Easy for booleans, a little trickier for integers (exceptions such as divide by zero and overflows), for FP this can be very tricky due to multiple FP formats entry entry b <- 3 c <- 4 * 3 c > 3 b <- 3 c <- 12 c > 3 e <- a + 3 n y d <- 3 + 2 d <- 3 + 2 exit e <- a + 3 exit
Dynamo Optimizations • Partial load removal – LRE paper • Dead code elimination • A variable is dead if it is not used on any path from where it is defined to where the function exits • An instruction is dead if it computes only values that are not used on any executable path leading from the instruction • Dead code is often created through the application of code optimization (e.g., strength reduction; replacing expensive ops by less expensive ops) • Loop invariant hoisting – moving invariant operations out of the loop body • Fragment link-time optimizations – apply peephole optimization around link, looking for dead code removal
Implementation Issues Problem: Signal arrives when executing in the code cache – How can we achieve transparent signal delivery? How can the original signal context be reconstructed? Dynamo approach: intercept all signals Upon arrival of a signal at code cache locationL, Dynamo first gains control: • Save code cache context • Retranslate the trace and record: • Any changes in register mapping up to position L • Original code addresses of L • All context-modifying optimizations and steps for de-optimization • Update the code cache context to obtain native context • Load native context and execute original signal handler
Dynamic Code Cache Problem: How to control size of dynamically recompiled code? How to react to phase changes? Adaptive flushing based cache management scheme: • Preemptive cache flushes • Fast allocation/de-allocation of traces • Removal of old and cold traces • Branch re-biasing to improve locality in cache • Configurable for various performance/memory-footprint trade-offs • Code cache default size: 300 Kbytes
Dynamo Performance (+O2 compiled native binary running under Dynamo on a PA-8000)
Bailout • bail out if trace selection rate exceeds tolerable threshold
Bailout • To prevent degradation, Dynamo keeps track of the current trace selection rate • Virtual time is recorded by counting the number of interpreted BBs before we select N traces • A threshold is set to judge if a rate is “high” • The trace selection rate is considered excessive if k consecutive high rate time intervals have been encountered • Bailout will turn off trace selection and optimization; execution resumes in the original binary
Performance speedups with bailout (+O2 compiled native binary running under Dynamo on a PA-8000)
Memory Overhead – Dynamo text Total size = 273 Kb PA-RISC dependent portion = 179 Kb (66%)
Summary of Dynamo • Demonstrated the potential for dynamic optimization through an actual implementation • Optimization impact tends to be program dependent • More sophisticated bailout algorithms need to be devised • Static compile-time hints should be used to help guide a dynamic optimization system
Vulcan – A. Srivastava • Provides both static and dynamic code modification • Performs optimization on x86, IA64 and MSIL binaries • Can work in the presence of multithreading and variable length instructions (X86) • Designed to be able to perform modifications on a remote machine using a distributed common object model (DCOM) interface • Can also serve as a binary translator
Mojo – Dynamic Optimization using Vulcan (Chaiken&Gillies) • Targets a desktop x86/Windows2000 environment • Supports large, multithreaded, applications that use exception handlers • Requires no OS support • Allows optimization across shared library boundaries • Can be aided by information provided by a static compiler
Mojo Structure Exception handling Original Code NT DLL Mojo Dispatcher Basic Block Cache Path Cache Path Builder
Mojo Structure Exception handling Original Code NT DLL 1. Interrogate the Path Cache for a hit Mojo Dispatcher Basic Block Cache Path Cache Path Builder
Mojo Structure Exception handling 2. If hit, then execute from the PC directly, else interrogate the Basic Block Cache for a hit Original Code NT DLL Mojo Dispatcher Basic Block Cache Path Cache Path Builder
Mojo Structure Exception handling 3. If hit in the BBC, execute directly, else load the block from the original code. Original Code NT DLL Mojo Dispatcher Basic Block Cache Path Cache Path Builder
Mojo Structure Exception handling Each time control returns to the Mojo Dispatcher. BBs are checked for “hotness”. Original Code NT DLL Mojo Dispatcher Basic Block Cache Path Cache Path Builder
Mojo Structure Exception handling If a BB is hot enough, Mojo turns on Path Building. Once a complete path has been built and optimized, it is placed in the Path Cache. Original Code NT DLL Mojo Dispatcher Basic Block Cache Path Cache Path Builder
Mojo Components • Mojo Dispatcher • Is the control point in the dynamic optimization system • Manages execution context using its own stack space • Basic Block Cache • Handles basic blocks that have not yet become hot • Identifies basic block boundaries by dynamically decoding instruction bytes • Branches are modified to pass control to the dispatcher, and passes the addresses of the next basic block to execute • Additional information is kept in the BBC that is used when constructing paths
Mojo Components • Path Builder • Responsible for selecting, building and optimizing hot paths • Maintains “hotness” information for basic blocks • Utilizes the same heuristics for building hot paths as is used in Dynamo (next path after counter overflow) • Utilizes separate thresholds for back edge targets and path exit targets (need to detect hot side exits when constructing a dynamic path) • Instructions are laid out contiguously (reordered), eliminating many taken conditional branches
Mojo Components – Path Builder • Path Termination - Dynamo only terminates paths on back edges Dynamo back edge profiling Mojo back edge and side exit profiling Original nested loops B A B B C C Longer path C A A
Exception Handling and Threads • Mojo patches the ntdll.dll • Mojo captures the state of the machine before passing off exceptions to the dispatcher • The dispatcher prevents the exception handler from polluting the Path Cache • To handle multithreading, Mojo allocates a basic block cache per thread, though uses a shared Path Cache • Locking mechanisms are provided to access and update the shared Path Cache reliably
Mojo performance qsort, acker and fib are recursive programs
Comments • For simple programs with simple control flow, Mojo shows good improvement • For larger programs with more dynamic control flow, Mojo is overwhelmed with the amount of path creation (same problem that was encountered for Dynamo) • Bailout strategy needed, along with better hot path detection algorithm • Future work is investigating how to use hints obtained during static compilation to aid in the dynamic optimization of the code
What is a JIT • Just-in Time Compiler – developed to address the performance issues encountered with java interpreter/translator performance • Portability generally means lower performance; JITs attempt to bridge this gap • JITs dynamically cache translated java bytecodes and perform extensive optimization on the native instructions • Given the overhead of using an OO programming model (frequent method calls), extensive exception checking, and the overhead of dynamic translation/compilation, the quality of the JIT must be high
Common JITs • SUN Java Development Kit (Sun) • Hotspot JIT (Sun) • Kaffe (Transvirtual Technologies) • Jalapeno (IBM Research) • Latte (Seoul National University)
IBM Jalapeno JVM and JIT • Designed specifically for servers • Shared memory multiprocessor scalability • Manage a large number of concurrent threads • High availability • Rapid response and graceful degradation (an issue when garbage collection is involved) • Mainly developed in Java (reliability?) • Designed specifically for extensive dynamic optimization
The Jalapeno Adaptive Optimization System • Translates bytecodes directly to the native ISA • Recompilation is performed in a separate thread from the application, and thus can be done in parallel to program execution • AOS has three components • Runtime measurement system • Controller • Recompilation system
Jalapeno AOS Architecture Compilers (Base, Opt, …) Install new code Controller Executing Code Controller Hardware/VM Performance Monitor Profile data AOS Database Inst/Opt Code Raw Data Raw Data Inst/Comp Plan Controller Raw Data Controller Organizer Compilation Threads Formatted Data Organizer Measurement Subsystem Formatted Data Compilation Queue Controller Organizer Event Queue
Three Optimization Levels • Level 0 – On-the-fly optimizations performed during translation (constant prop, constant folding, dead code detection) • Level 1 – Adds to Level 0 common subexpression elimination, redundant load elimination, aggressive inlining • Level 2 – Adds to Level 1 flow-sensitive optimizations, array bounds check elimination