Adaptive Optimization with On-Stack Replacement

Adaptive Optimization with On-Stack Replacement Stephen J. Fink IBM T.J. Watson Research Center Feng Qian (presenter) Sable Research Group, McGill University http://www.sable.mcgill.ca

Motivation • Modern VM uses adaptive recompilation strategies • VM replaces entry in dispatching table with newly compiled code • Switching to new code can only happen at the next invocation • On-stack replacement (OSR) allows transformation happen in the middle of method execution

stack stack m2 m1 m2 m1 frame frame PC PC What is On-stack Replacement? • Transfer execution from compiled code m1 to compiled code m2 even while m1 runs on some thread’s stack

Why On-Stack Replacement (OSR)? • Debugging optimized code via dynamic de-optimization [SELF-93] • Deferred compilation of cold paths in a method [SELF-91, HotSpot, Whaley 2001] • Promotion of long-run activations [SELF-93] • Safe invalidation for speculative optimization [HotSpot, SELF-91]

Related Work • Holzle, Chambers, and Ungar (SELF-91, SELF-93) deferred compilation, de-optimization for debugging, promotion of long-run loops, safe invalidation [OOPSLA’91, PLDI’92, OOPSLA’94] • HotSpot server compiler [JVM’01] • Partial method compilation [OOPSLA’01]

OSR Challenges • Engineering Complexity • How to minimize disruption to VM code base? • How to constrain optimizations? • Policies for applying OSR • How to make rational decisions for applying OSR? • Effectiveness • How does OSR improve/constrain dataflow optimizations? • How effective are online OSR-based optimizations?

Outline • Motivation • OSR Mechanism • Applications • Experimental Results • Conclusion

OSR Mechanism Overview • Extract compiler-independent state from a suspended activation for m1 • Generate specialized code m2 for the suspended activation • Compile and transfer execution to the new code m2 stack stack 2 3 1 compiler- independent state m2 m2 m1 m2 m1 frame frame PC PC

JVM Scope Descriptor • Compiler-independent state of a running activation • Based on Java Virtual Machine Architecture • Five components: • Thread running the activation • Reference to the activation's stack frame • Program Counter (as a bytecode index) • Value of each local variable • Value of each stack location

JVM Scope Descriptor Example class C { static int sum(int c) { int y = 0; for (int i=0; i<c; i++) { y += i; } return y; } } JVM Scope Descriptor Bytecode 0 iconst_0 1 istore_1 2 iconst_0 3 istore_2 4 goto 14 7 iload_1 8 iload_2 9 iadd 10 istore_1 11 iinc 2 1 14 iload_2 15 iload_0 16 if_icmplt 7 19 iload_1 20 ireturn Running thread: MainThread Frame Pointer: 0xSomeAddress Program Counter:16 Local variables: L0(c) = 100; L1(y) = 1225; L2(i) = 50; Stack Expressions: S0 = 50; S1 = 100; Suspend after 50 loop iterations (i = 50)

Extracting JVM Scope Descriptor • Trivial from interpreter • Optimizing Compiler • Insert OSR Point (safe-point) instructions in initial IR • OSR Pointuses stack, local state needed to recover scope descriptor • OSR Point is treated as a call, transfers control to exit block • Aggregate OSR points to an OSR map when generating machine instructions stack 1 compiler- independent state m1 m1 frame PC

Specialized Code Generation • Prepend a specialized prologue to original bytecode • Prologue will • Save JVM Scope Descriptor values into local variables • Push JVM Scope Descriptor values onto the stack • Jump to the desired program counter 2 compiler- independent state m2

Transition Example JVM Scope Descriptor Original Bytecode Specialized Bytecode Running thread: MainThread Frame Pointer: 0xSomeAddress Program Counter: 16 Local variables: L0(c) = 100; L1(y) = 1225; L2(i) = 50; Stack Expressions: S0 = 50; S1 = 100; 0 iconst_0 1 istore_1 2 iconst_0 3 istore_2 4 goto 14 7 iload_1 8 iload_2 9 iadd 10 istore_1 11 iinc 2 1 14 iload_2 15 iload_0 16 if_icmplt 7 19 iload_1 20 ireturn ldc 100 istore_0 ldc 1225 istore_1 ldc 50 istore_2 ldc 50 ldc 100 goto 16 0 iconst_0 ... 16 if_icmplt 7 ... 20 ireturn

Transfer Execution to the New Code • Compile m2 as a normal method • System unfolds the stack frame of m1 • Reschedule the thread to execute m2 • By construction, executing specialized m2sets up target stack frame and continues execution stack stack 3 3 m2 m2 m2 m2 m2 m2 frame frame PC PC

Recovering from Inlining • Suppose optimizer inlines A -> B -> C: JVM Scope Descriptor A stack A' A' JVM Scope Descriptor B A stack m2 A' frame 2 3 1 B' frame frame A B' B' frame A A JVM Scope Descriptor C frame C' C' frame PC C' PC

Inlining Example foo_prime() { <specialized foo prologue> call bar_prime() goto A; ... bar(); A: ... } bar_prime() { <specialized bar prologue> goto B: ... B: ... } void foo() { bar(); A: ... } void bar() { ... B: ... } Wipe stack to caller C and call foo_prime stack Suspend at B: in A -> B C foo' A m2 foo' frame frame frame bar' bar' frame PC

Implementation Details • Target Compiler unmodified, except for .... • New pseudo-bytecodes • Load literals (to avoid inserting new constants in constant pool) • Load an address/bytecode index: JSR return address on stack • Fix bytecode indices for GC maps, exception tables, line number tables

Pros and Cons • Advantages • mostly compiler-independent • avoid multi-entry points of compiled code • target compiler can exploit run-time constants • Disadvantage • must compile target method twice (once for transition, once for next invocation)

if (foo is currently final) trap/OSR; x = 1; x = foo(); return x; Two OSR Applications • Promotion (see the paper for details) • recompile a long-running activation • Deferred Compilation • don't compile uncommon paths • saves compile-time

Deferred Compilation • What's "infrequent"? • static heuristics • profile data • Adaptive recompilation decision is modified to consider OSR factors Feng Qian: Class initialization is called by a class loader, when do we need OSR for it?

Online Experiments Eager : (by default) no deferred compilation • OSR/static: deferred compilation for CHA-based inlining only • OSR/edge counts: deferred compilation w/online profile data & CHA-based inlining

Adaptive System Performance better

OSR Activities SPECjvm98 size 100 First Run Promotions Invalidations compress 3 6 jess 0 0 db 0 1 javac 0 10 mpegaudio 0 1 mtrt 0 5 jack 0 1 total 3 24

Summary • A new On-stack replacement mechanism • Online profile-directed deferred compilation • Evaluation of OSR applications in JikesRVM

Conclusion • Should a VM implement OSR? • Can be done with minimal intrusion to code base • Modest gains from deferred compilation • No benefit for class-hierarchy-based inlining • Debugging with dynamic de-optimization valuable • TODO: More advanced speculative optimizations Implementation is available to public in JikesRVM under CPL: Linux/x86, Linux/PPC, and AIX/PPC http://www-124.ibm.com/developerworks/oss/jikesrvm/

Backup Slides

Compile Rate Offline Profile

Machine Code Size Offline Profile

Code Quality Offline Profile

Code Quality Offline Profile better

Jikes RVM Analytic Recompilation Model • Define • cur, current optimization level for methodm • Tj, expected future execution time at levelj • Cj, compilation cost at opt levelj • Choose j > cur that minimizes Tj + Cj • If Tj + Cj < Tcurrecompile at level j • Assumptions • Method will execute for twice its current duration • Compilation cost and speedup based on offline average • Sample data determines how long a method has executed

Jikes RVM OSR Promotion Model • Given: Outdated activationA of method m • Define • L,last optimization level for any compiled version of m • cur, current optimization level for activationA • Tcur, expected future execution time of A at level cur • CL, compilation cost for method m at opt levelL • TL, expected future execution time of A at levelL • If TL + CL < Tcurspecialize A at level L • Assumption • Outdated activation will execute for twice its current duration

Jikes RVM Recompilation Model, with Profile-Driven Deferred Compilation • Define • cur, current optimization level for methodm • Tj, expected future execution time at levelj • Cj, compilation cost at opt levelj • P,percentage of code in m that profile data indicates was reached • Choose j > cur that minimizes Tj + P*Cj • If Tj + P*Cj < Tcurrecompile at level j • Assumptions • Method will execute for twice its current duration • Compilation cost and speedup based on offline average • Sample data determines how long a method has executed

Offline Profile experiments • Collect "perfect" profile data offline • Mark any block never reached as "uncommon" • Defer compilation of "uncommon" blocks • Four configurations • Ideal: deferred compilation trap keeps no state live • Ideal-OSR: deferred compilation trap is valid OSR point • Static-OSR: no profile data; defer compilation for CHA-based inlining; trap is valid OSR point • Eager: (default) no deferred compilation

Compile Rate Offline Profile

Machine Code Size Offline Profile

Code Quality Offline Profile

OSR Challenges • Engineering Complexity • How to minimize disruption to VM code base? • How to constrain optimizations? • Policies for applying OSR • How to make rational decisions for applying OSR? • Effectiveness • How does OSR improve/constrain dataflow optimizations? • How effective are online OSR-based optimizations?

Recompilation Activities First Run With OSR Without OSR O0 O1 O2 total O0 O1 O2 total compress 17 7 2 26 13 9 6 28 jess 49 20 1 70 39 17 4 60 db 8 4 2 14 8 4 5 17 javac 171 19 2 192 168 16 3 187 mpegaudio 68 32 7 107 66 29 6 101 mtrt 57 14 3 74 61 11 3 75 jack 59 25 8 92 54 26 5 85 total 429 121 25 575 409 112 32 553

Summary of Study (1) • Engineering Complexity • How to minimize disruption to VM code base? • Compiler-independent specialized source code to manage transition transparently • How to constrain optimizations? • Model OSR Points like CALLS in standard transformations • Policies for applying OSR • How to make rational decisions for applying OSR? • Simple modifications to cost-benefit analytic model

Summary of Study (2) • Effectiveness • (for an implementation of online profile-directed deferred compilation) • How does OSR improve/constrain dataflow optimizations? • small ideal benefit from dataflow merges (0.5 - 2.2%) • negligible benefit when constraining optimization for potential invalidation • negligible benefit for just CHA-based inlining • patch points + splitting + pre-existence good enough • How effective are online OSR-based optimizations? • average performance improvement of 2.6% on first run SPECjvm98 s=100 • individual benchmarks range from +8% to -4% • negligible impact on steady state performance (best of 10 iterations) • adaptive recompilation model relatively insensitive, compiles 4% more methods

Experimental Details • SPECjvm98, size 100 • Jikes RVM 2.1.1 • FastAdaptiveSemispace configuration • one virtual processor • 500MB heap • separate VM instance for each benchmark • IBM RS/6000 Model F80 • six 500 MHz PowerPC 630's • AIX 4.3.3 • 4 GB memory

Specialized Code Generation • Generate specialized m2 that sets up new stack frame and continues execution, preserving semantics. • Express the transition to new stack frame in source code (bytecode) 2 compiler- independent state m2

Deferred Compilation • Don't compile "infrequent" blocks if (foo is currently final) if (foo is currently final) x = 1; x = 1; trap/OSR; x = foo(); return x; return x;

Adaptive Optimization with On-Stack Replacement