Dynamic Optimization

Dynamic Optimization David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu

What is Dynamic Optimization • Allow a running binary to adapt to the underlying hardware system dynamically • Perform optimization while not sacrificing performance Input Input Runtime Dynamic Optimization System Static Source Fluid Binary OS/HW Platform

Why Dynamic versus Static • Allows code to adapt to: • Changes in the microarchitecture of the underlying platform (related to binary translation) • Changes in program input • Environment dynamics (e.g., system load, system availability) • Involves very little user interaction (optimization should be applied transparently) • Source code is not needed • Language independent

Challenges with Dynamic Optimization • Reducing the associated overhead and maintaining transparency • Addressing a range of workloads • Selecting appropriate optimizations

Dynamic Optimization Systems • Dynamo • HP labs, PA-RISC/HPUX • Runtime optimization • Vulcan/Mojo • MS Research, X86-IA64/Win2K • Deskstop instrumentation, profiling and optimization • Jalapeno • IBM Research, JVM-PPC-SMPs/AIX • Java JIT designed for research • Latte • University of Seoul, Korea • Java JIT designed for efficient register allocation

Dynamo Application + Libs (native binary) Application + Libs (native binary) Dynamo CPU Platform CPU Platform Dynamo execution model Normal execution model To the application, Dynamo looks like a software interpreter that executes the same instruction set executed by underlying hardware interpreter (the CPU). * Many of these slides were provided by Evelyn Duesterwald

Elements of Dynamo • A novel performance delivery mechanism: • Optimize the code when it executes, not when it is created • A client-enabled performance mechanism • Dynamic code re-layout • Partial dynamic inlining/superblock formation • Path-specific optimization • Adaptive: machine and input specific • Complementaryto static optimization • Transparent: requires no compiler support Application + Libs (native binary) Dynamo CPU Platform

Flow within Dynamo Input native instruction stream Interpretation/Profiling Loop no miss Lookup next PC in Trace Cache Interpret until taken branch Hot start of trace? yes recycle counter exit branch hit Trace Selector Dynamo Code Cache Trace Optimizer Emit Trace Trace Linker

Traces in Dynamo Trace = single-entry join-free dynamic sequence of basic blocks Control Flow Graph Memory Layout Trace Cache Layout connect to other trace A A A F T F B B F B C C D C call E E call D return trampoline E return D exit to Interpreter

Traces in Dynamo Interprocedural forward path: start-of-trace= target of backward branch end-of-trace = taken backward branch A B C D 11 Paths through the loop: ABCEH ABCEHKMO ABCEHKNO ABCEIKMO ABCEIKNO ABCFJL ABCFJLNO ABDFJL ABDFJLNO ABDGJL ABDGJLNO G E F I H J K L M N O

Traces in Dynamo – typical path profiles • Approach: • profile all edge frequencies • select hot trace by following highest frequency branch outcome • Disadvantage: • Infeasible path: ignores branch correlation • Overhead: • need to profile every conditional branch A B C D G E F I H J K L M N O

Traces in Dynamo – Next Executing Tail Prediction • Minimal profiling: • profile only start-of-trace points (block A) • Optimistic: • at hot start-of-trace select next executing • Advantages: • very light-weight # instrumentation points = #targets of backward branches #counters = #targets of backward branches • statistically likely to pick the hottest path • feasible paths • easy to implement A B C D G E F I H J K L M N O

Trace Selection

When to stop creating new traces • Excessively high trace selection rates cause unacceptable overhead and potential thrashing in the Dynamo code cache • We need the opportunity to amortize the cost of creating traces, thus we need to turn off trace creation sometimes • “Bail out” is entered when the creation rate per unit time is excessively high

Trace Optimization List of trace blocks Build lightweight Intermediate Representation (IR): Symbolic Labels, Extended Virtual Register Set Lite IR Forward Pass Optimization with integrated demand-driven analysis Backward Pass Reg Alloc Schedule & Register Allocation – retain previous mappings Linker

Trace Optimization Are there any runtime optimization opportunities in statically optimized code? Limitations of static compiler optimization: • cost of call-specific interprocedural optimization • cost of path-specific optimization in presence of complex control flow • difficulty of predicting past indirect branches • lack of access to shared libraries • sub-optimal register allocation decisions • register allocation for individual array elements or pointers

Path-specific optimizations • Conservative Optimizations • precise signal delivery • memory-safe • partial procedure inlining • redundant branch removal • constant propagation • constant folding • copy propagation • Aggressive Optimizations • redundant load removal • runtime disambiguated (guarded) load removal • dead code elimination • partially dead code sinking • loop unrolling • loop invariant hoisting  aggressive optimization can be made memory- and signal-safe • compiler hints • de-optimization

Dynamo Optimizations • Constant propagation • Given x <- c ; for variable x and constant c • Replace all later uses of x with c, assuming that x will not be modified entry entry b <- 3 c <- 4 * b c > b b <- 3 c <- 4 * 3 c > 3 n n y y d <- b + 2 d <- 3 + 2 e <- a + b e <- a + 3 exit exit

Dynamo Optimizations • Constant folding • Identifying that all operands in an assignment are constant after macro expansion and constant propagation • Easy for booleans, a little trickier for integers (exceptions such as divide by zero and overflows), for FP this can be very tricky due to multiple FP formats entry entry b <- 3 c <- 4 * 3 c > 3 b <- 3 c <- 12 c > 3 e <- a + 3 n y d <- 3 + 2 d <- 3 + 2 exit e <- a + 3 exit

Dynamo Optimizations • Partial load removal – LRE paper • Dead code elimination • A variable is dead if it is not used on any path from where it is defined to where the function exits • An instruction is dead if it computes only values that are not used on any executable path leading from the instruction • Dead code is often created through the application of code optimization (e.g., strength reduction; replacing expensive ops by less expensive ops) • Loop invariant hoisting – moving invariant operations out of the loop body • Fragment link-time optimizations – apply peephole optimization around link, looking for dead code removal

Implementation Issues Problem: Signal arrives when executing in the code cache – How can we achieve transparent signal delivery? How can the original signal context be reconstructed? Dynamo approach: intercept all signals Upon arrival of a signal at code cache locationL, Dynamo first gains control: • Save code cache context • Retranslate the trace and record: • Any changes in register mapping up to position L • Original code addresses of L • All context-modifying optimizations and steps for de-optimization • Update the code cache context to obtain native context • Load native context and execute original signal handler

Dynamic Code Cache Problem: How to control size of dynamically recompiled code? How to react to phase changes? Adaptive flushing based cache management scheme: • Preemptive cache flushes • Fast allocation/de-allocation of traces • Removal of old and cold traces • Branch re-biasing to improve locality in cache • Configurable for various performance/memory-footprint trade-offs • Code cache default size: 300 Kbytes

Dynamo Performance (+O2 compiled native binary running under Dynamo on a PA-8000)

Bailout • bail out if trace selection rate exceeds tolerable threshold

Bailout • To prevent degradation, Dynamo keeps track of the current trace selection rate • Virtual time is recorded by counting the number of interpreted BBs before we select N traces • A threshold is set to judge if a rate is “high” • The trace selection rate is considered excessive if k consecutive high rate time intervals have been encountered • Bailout will turn off trace selection and optimization; execution resumes in the original binary

Performance speedups with bailout (+O2 compiled native binary running under Dynamo on a PA-8000)

Memory Overhead – Dynamo text Total size = 273 Kb PA-RISC dependent portion = 179 Kb (66%)

Summary of Dynamo • Demonstrated the potential for dynamic optimization through an actual implementation • Optimization impact tends to be program dependent • More sophisticated bailout algorithms need to be devised • Static compile-time hints should be used to help guide a dynamic optimization system

Vulcan – A. Srivastava • Provides both static and dynamic code modification • Performs optimization on x86, IA64 and MSIL binaries • Can work in the presence of multithreading and variable length instructions (X86) • Designed to be able to perform modifications on a remote machine using a distributed common object model (DCOM) interface • Can also serve as a binary translator

Mojo – Dynamic Optimization using Vulcan (Chaiken&Gillies) • Targets a desktop x86/Windows2000 environment • Supports large, multithreaded, applications that use exception handlers • Requires no OS support • Allows optimization across shared library boundaries • Can be aided by information provided by a static compiler

Mojo Structure Exception handling Original Code NT DLL Mojo Dispatcher Basic Block Cache Path Cache Path Builder

Mojo Structure Exception handling Original Code NT DLL 1. Interrogate the Path Cache for a hit Mojo Dispatcher Basic Block Cache Path Cache Path Builder

Mojo Structure Exception handling 2. If hit, then execute from the PC directly, else interrogate the Basic Block Cache for a hit Original Code NT DLL Mojo Dispatcher Basic Block Cache Path Cache Path Builder

Mojo Structure Exception handling 3. If hit in the BBC, execute directly, else load the block from the original code. Original Code NT DLL Mojo Dispatcher Basic Block Cache Path Cache Path Builder

Mojo Structure Exception handling Each time control returns to the Mojo Dispatcher. BBs are checked for “hotness”. Original Code NT DLL Mojo Dispatcher Basic Block Cache Path Cache Path Builder

Mojo Structure Exception handling If a BB is hot enough, Mojo turns on Path Building. Once a complete path has been built and optimized, it is placed in the Path Cache. Original Code NT DLL Mojo Dispatcher Basic Block Cache Path Cache Path Builder

Mojo Components • Mojo Dispatcher • Is the control point in the dynamic optimization system • Manages execution context using its own stack space • Basic Block Cache • Handles basic blocks that have not yet become hot • Identifies basic block boundaries by dynamically decoding instruction bytes • Branches are modified to pass control to the dispatcher, and passes the addresses of the next basic block to execute • Additional information is kept in the BBC that is used when constructing paths

Mojo Components • Path Builder • Responsible for selecting, building and optimizing hot paths • Maintains “hotness” information for basic blocks • Utilizes the same heuristics for building hot paths as is used in Dynamo (next path after counter overflow) • Utilizes separate thresholds for back edge targets and path exit targets (need to detect hot side exits when constructing a dynamic path) • Instructions are laid out contiguously (reordered), eliminating many taken conditional branches

Mojo Components – Path Builder • Path Termination - Dynamo only terminates paths on back edges Dynamo back edge profiling Mojo back edge and side exit profiling Original nested loops B A B B C C Longer path C A A

Exception Handling and Threads • Mojo patches the ntdll.dll • Mojo captures the state of the machine before passing off exceptions to the dispatcher • The dispatcher prevents the exception handler from polluting the Path Cache • To handle multithreading, Mojo allocates a basic block cache per thread, though uses a shared Path Cache • Locking mechanisms are provided to access and update the shared Path Cache reliably

Mojo performance qsort, acker and fib are recursive programs

Mojo performance – SPEC2000/SPEC95

Mojo Execution - Windows

Comments • For simple programs with simple control flow, Mojo shows good improvement • For larger programs with more dynamic control flow, Mojo is overwhelmed with the amount of path creation (same problem that was encountered for Dynamo) • Bailout strategy needed, along with better hot path detection algorithm • Future work is investigating how to use hints obtained during static compilation to aid in the dynamic optimization of the code

What is a JIT • Just-in Time Compiler – developed to address the performance issues encountered with java interpreter/translator performance • Portability generally means lower performance; JITs attempt to bridge this gap • JITs dynamically cache translated java bytecodes and perform extensive optimization on the native instructions • Given the overhead of using an OO programming model (frequent method calls), extensive exception checking, and the overhead of dynamic translation/compilation, the quality of the JIT must be high

Common JITs • SUN Java Development Kit (Sun) • Hotspot JIT (Sun) • Kaffe (Transvirtual Technologies) • Jalapeno (IBM Research) • Latte (Seoul National University)

IBM Jalapeno JVM and JIT • Designed specifically for servers • Shared memory multiprocessor scalability • Manage a large number of concurrent threads • High availability • Rapid response and graceful degradation (an issue when garbage collection is involved) • Mainly developed in Java (reliability?) • Designed specifically for extensive dynamic optimization

The Jalapeno Adaptive Optimization System • Translates bytecodes directly to the native ISA • Recompilation is performed in a separate thread from the application, and thus can be done in parallel to program execution • AOS has three components • Runtime measurement system • Controller • Recompilation system

Jalapeno AOS Architecture Compilers (Base, Opt, …) Install new code Controller Executing Code Controller Hardware/VM Performance Monitor Profile data AOS Database Inst/Opt Code Raw Data Raw Data Inst/Comp Plan Controller Raw Data Controller Organizer Compilation Threads Formatted Data Organizer Measurement Subsystem Formatted Data Compilation Queue Controller Organizer Event Queue

Three Optimization Levels • Level 0 – On-the-fly optimizations performed during translation (constant prop, constant folding, dead code detection) • Level 1 – Adds to Level 0 common subexpression elimination, redundant load elimination, aggressive inlining • Level 2 – Adds to Level 1 flow-sensitive optimizations, array bounds check elimination

Dynamic Optimization

Dynamic Optimization

Presentation Transcript

Dynamic Batch Bayesian Optimization

Dynamic Optimization for Interactive Computing Systems

Dynamic Binary Optimization: The Dynamo Case

Dynamic Optimization and Learning for Renewal Systems --

Multi-Objective Dynamic Optimization using Evolutionary Algorithms

Dynamic Compilation and Optimization

Optimization in Dynamic Environments

Dynamic Binary Optimization

Dynamic Binary Optimization

“Dynamo: A Transparent Dynamic Optimization System ”

Dynamic Binary Optimization – Part 1

Dynamic Optimization and Automatic Differentiation

Economic Growth and Dynamic Optimization - The Comeback -

Optimization of Dynamic Characteristics of Heavy Truck

Dynamic Query Optimization

Debunking Dynamic Optimization Myths

Dynamic Route Optimization

Gati's Dynamic Route Optimization - Machinist Magazine

Dynamic Optimization and Automatic Differentiation

“Dynamo: A Transparent Dynamic Optimization System ”

Planning using dynamic optimization