240 likes | 442 Views
Optimizing Java Bytecode for Embedded Systems Stefan Hepp. Java Bytecode Optimization. Overview. Toolchain JOP, JIT vs. ahead-of-time compilation Existing open source tools JOPtimizer framework and code representations Inlining Results. Toolchain Overview.
E N D
Optimizing Java Bytecode for Embedded Systems Stefan Hepp Java Bytecode Optimization
Overview • Toolchain JOP, JIT vs. ahead-of-time compilation • Existing open source tools • JOPtimizer framework and code representations • Inlining • Results
Toolchain Overview • Sourcecode compiled withjavac to Java bytecode • Optimization defered to JVM, profiling information and JIT compiler is used • Not feasable on embedded processors like JOP
Toolchain Overview • Ahead-of-timeoptimization needed • Optimization of bytecode for target platform • Output is Java bytecode • Profiling vs. static WCET
Toolchain Overview • Advantages over JIT • Runtime is not critical • No warm-up phase to gather profiling information and to do JIT compiling • Disadvantages • Less accurate/no profiling information available at design-time, class hierarchy may change dynamically • Target platform must be known
Existing Tools • Soot framework looks promising, but not designed for embedded systems and very complex • Other open source tools usually only remove unused methods and obfuscate code
JOPtimizer • JOPtimizer: a new framework for optimizations • Intermediate code representations • Inlining which respects method size restrictions
Assumtions • Assumptions about embedded applications • No dynamic class loading or class modifications at runtime • Reflection is not used • All class files are available at compile-time (except native classes) • Allows more optimizations (but assumtions can be disables) • Exclude “library” code (like java.*) • Define: library classes must not extend/reference application classes
Java Class Files • A class consists of: • ConstantPool: indexed table of constants (numbers, Strings, class names, method names, signatures, .. ) • Classname, super-class, interfaces (references to CP) • Fields, methods: name, signature, flags • Method code as attribute of methods • Stack architecture with variable length encoding • Parsing and compiling of classfiles done by existing Libraries (BCEL, ASM, ...)
The JVM Instruction Set • (partially) typed stack instructions • 32bit (int, float, reference, byte, short, ..) and 64bit (long, double) variables • exception-handling, synchronization, subroutines • Stack- and variable table entries always 32bit • No indirect jumps, stack size must be static private test(I)V ICONST_2 NEWARRAY T_INT ASTORE 2 FCONST_2 FSTORE 3 ALOAD 2 ICONST_0 ILOAD 1 FLOAD 3 F2I private Map m; private void test(int i) { int j[] = new int[2]; float a = 2.0f; j[0] = i * (int) a; m.put(this, j); } IMUL IASTORE ALOAD 0 GETFIELD #4 ALOAD 0 ALOAD 2 INVOKEINTERFACE #7 POP RETURN
Stackcode Representation • Internal representation (“stackcode”) • Types and constant values as parameters of instructions to reduce number of different instructions (~40 stackcode instructions) • Stack emulation to determine operand types for all instructions (swap, dup, ..) • Variables and types instead of 32-bit slots • Constant values instead of references into CP • split basic blocks at exception handler ranges too • Still a stack architecture • Stackcode can be mapped directly to bytecode (allows analysis of code size and execution time)
Quadcode Representation • Stack creates implicit dependencies between instructions and blocks, makes optimizations more complex • Quadruple form of code (“quadcode”) • Create local variable per stack slot, emulate stack to determine the arguments of instructions • Instructions with types and constants as parameter • Instructions to manupulate stack not needed (pop) or replaced with copy instructions (load, swap, dup, ..)
Quadcode Representation • Quadcode representation enables simpler implementation of optimizations, but code cannot be mapped to bytecode directly • Stackcode and Quadcode similar to Soot internal representations (Baf, Jimple, Shimple) public int calc(int a, int b) { copy.ref s0, l0 // load.ref l0 // aload_0 getfield.'Test.fField' s0, s0 // getfield 'Test.fField' // getfield #3 copy.float s1, 2.0f // push.float 2.0f // fconst_2 binop.float.div s0, s0, s1 // binop.float.div // fdiv copy.float l3, s0 // store.float l3 // fstore_3 copy.int s0, l1 // load.int l1 // iload_1 return.int s0 // return.int // ireturn }
Creation of Bytecode • Transformation back from quadcode to bytecode • Create complete expressions from instructions (“Decompile” code), compile expression trees to JVM instructions like javac (Soot does this (Grimp)) • Create stack form of quadruple instructions, compile to bytecode (JOPtimizer does this, optional in Soot) • Per quadcode instruction: load parameters on stack, execute operation and store result back • load/store elimination and local variable allocation for stackcode needed before bytecode can be created • Decompilation method of Soot gets slightly better results
Inlining • Invocations are expensive on JOP • Inline methods to eliminate invokation overhead • Inlining is not always possible • Callee code restrictions • Code size and variable table size restrictions of JOP • Inlining comes at a price • Caller code size increases, makes caller cache miss more costly • Overall program size increases if callee is not removed (p.e. is called somewhere else)
Inlining methods • Traverse callgraph bottom-up (leaves first) • Find and devirtualize invocations • static, final, private invokations not virtual • Check class hierarchy for overloading methods • Check if inlining is admissible • Estimate gain • Replace invocation with copy of callee • insert nullpointer-check for callee class reference • map local variables of callee above caller variables
Inlining Checks • Inlining is not possible if • new code size or variable table size of caller exceeds platform limits • the callee uses exception handlers or synchronized code • throwing an exception clears the stack • stack of the caller needs to be saved and restored if an exception is handled within the inlined method (NYI in JOPtimizer) • the method or class is excluded from inlining by configuration (caller or callee, p.e. Native class)
Inlining Checks (cont.) class A public a() tmp = new C() invoke tmp.b() class B public b() if (v == null) invoke B.c() private c() class C extends B private c() • Check field- and method references in callee code • Must be accessible from caller • Else make field or method public if possible • Always possible for fields as they are not virtual in Java • All overloading methods must be made public too • If a private method is made public, all invocations have to be changed from invokespecial to invokevirtual (luckily only methods of callee class have to be searched) • Naming conflicts or dynamic class loading can prevent changes, thus preventing inlining
Inlining Checks (cont.) • Estimate gain of inlining • Depends on cache state • Possible degredation of performance if inlined method is seldom invoked • Calculate gain based on invocation frequency and cache state estimations • Decrease weight of callees with multiple call sites to reduce increase of application code size • Select method with highest (positive) weight for inlining • Add inlined invocations to inlining candidate list, repeat inlining (check with new codesize)
Benchmark Results • Inlining of stackcode, jbe @60Mhz • Inlining limited by maximal code size imposed by JOP's memory cache • Removing of unused code should be implemented
Inlining Improvements • Many improvements possible • Type analysis/callgraph thinning for better devirtualization • Better cache state and invocation frequency estimation(WCET-driven?) • Run optimizations to reduce code size prior to inlining • Allow inlining of synchronized code/exception handlers • Try to find invocations with highest gain application-wide • ...
Summary • Optimizing code at runtime not feasible for (realtime) embedded systems • Existing open source tools not designed for embedded systems • Inlining implemented in JOPtimizer which takes target platform into account (code restrictions, caching, ..), up to 14% speedup of JBE benchmark • load/store elimination and local variable allocation needed for further optimizations to be implemented • Still many improvements possible ..
Thanks for your attention! Questions? Q&A