Targeting Dynamic Compilation for Embedded Systems

Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford University Targeting Dynamic Compilation for Embedded Systems

Outline • Motivating Problem • Compiler Design • Performance Results • Conclusions

Challenges of Running Java on Embedded Devices • J2ME (micro edition) on CDC (connected device configuration) • PDAs, thin clients, and high-end cellphones • Highly resource constrained • 30MHz - 200MHz embedded processors • 2MB - 32MB RAM • < 4MB ROM • Differences from running Java on desktop machines • Satisfying performance requirements difficult with slower processors • Virtual machine footprint matters • Limited dynamic memory available for runtime system J2ME/CLDC J2ME/CDC J2SE J2EE Embedded Server Desktop

Java Execution Models • Interpretation • Decode and execute bytecodes in software • Incurs high performance penalty • Fast code generators • Dynamic compilation without aggressive optimization • Sacrifices code quality for compilation speed • Lazy compilation • Interpret bytecodes and translate code with optimizing compiler for frequently executed methods • Adds complexity and total ROM footprint of interpreter + compiler large • Alternative approach?

microJIT: An Efficient Optimizing Compiler • Minimize major compiler passes while optimizing aggressively • Perform several optimizations concurrently • Pipeline information from one pass drive optimizations in subsequent passes • Budget overheads for dataflow analysis • Efficient implementations of straightforward optimizations • Use good heuristics for difficult optimizations • Manage compiler dynamic memory requirements • Efficient dataflow representation

Using microJIT in Embedded Systems • Configuration • Compile everything to native code • Potential advantages over other execution models • Lower total system cost • Multiple execution engines require more ROM • Reduced complexity • Only need to maintain one compiler • Doesn't sacrifice long or short running performance • Generates fast code while minimizing overheads

microJIT Compiler Overview Dataflow Information ISA Dependent Optimizations CFG Construction Locals & field accesses Loop identification Register reservations IR expression optimizations IR expression use counts DFG Generation Assembler macros Instruction delays Register allocator Machine idioms Instruction scheduler Native Code Generation

Pass 1: CFG Construction • Quickly scan bytecodes in one pass • Partially decode bytecodes to extract desired information • Decompose method into extended basic blocks (EBBs) • Build blocks and arcs as branches and targets are encountered • Compute block-level dataflow information • Identify loops • Record local and field accesses for blocks and loops

Pass 2: DFG Generation • Intermediate representation (IR) • Closer to machine instructions than bytecodes (LIR) • Triples representation – unnamed destination • Source arguments are pointers to other IR expression nodes • Complex bytecodes decompose into several IR expressions [L0] [1] const 1 [2] add [1] [L0] [3] neg [2]

Block-local Optimizations Pass 2: DFG Generation id IR expression [L0] [1] load @ [L0]+16 [2] const 1 [3] add [1] [2] [4] store [4] @ [L0]+16 • Maintain mimic stack when translating into IR expressions • Manipulate pointers in place of locals and stack accesses which do not generate IR expressions • Immediately eliminates copy expressions • Optimizations immediately applied to newly created IR expressions • Check source arguments for constant propagation and algebraic simplifications • Search backwards in EBB for available matching expression (CSE) Java source L0.count++; bpc bytecode 0 aload_0 1 dup 2 getfield count 4 iconst_1 5 iadd 6 putfield count

Global Optimizations Pass 2: DFG Generation B1 • Global optimizations also immediately applied to newly created IR expressions • Global forward flow information available for every new IR expression • Blocks processed in reverse post-order (predecessors first) • Use loop field and locals access statistics from previous pass to calculate fixed point solution at loop header • Restricted to dataflow optimizations that rely primarily on forward flow information • Global constant propagation, copy propagation, and CSE B2 B3 B5 B4 B6 B7 loop locals access table

Loop Invariant Code Motion Pass 2: DFG Generation • Check loop statistics to make sure source arguments are not redefined in loop • Can perform code motion on dependent instructions without iterating • Hoisted IR expressions immediately communicated to successive instructions and blocks in loop PH [1]à [G0] loop locals access table [3]à [G1] H [1] add [L0] [L1] [2] const 1 [3] sub [1] [2] E

Inlining Pass 2: DFG Generation • Optimized for small methods • Handles nested inlining • Important for object initializers with deep sub-classing • Can inline non-final public virtual and interface methods with only one target found at runtime • Protected with a class check

Pass 3: Code Generation • Registers allocated dynamically as code is generated • Instruction scheduling within a basic block • Use standard list scheduling techniques • Fills load and branch delay slots • Successfully ported to three different ISAs • MIPS, SPARC, StrongARM • Ports took only a few weeks to implement • Plans to port to x86

Fast Optimization of Machine Idioms Pass 3: Code Generation • Traditionally done using a peephole optimizer • Requires additional pass over generated code • Compiler features allow optimization of machine idioms without additional pass • Machine specific code can be invoked two passes • Configurable IR expressions • Deferred code generation of IR expressions • Optimized machine idioms • Register calling conventions • Mapping branch implementations • Immediate operands • Different addressing modes

Code Generation Example Pass 3: Code Generation {blk,glb} uses {2,0} {2,0} {1,0} {1,0} {0,1} {1,0} {1,0} {0,0} {blk,glb} uses {1,0} {2,0} {1,0} {1,0} {0,1} {1,0} {1,0} {0,0} {blk,glb} uses {1,0} {1,0} {0,0} {0,0} {0,1} {1,0} {1,0} {0,0} {blk,glb} uses {1,0} {0,0} {0,0} {0,0} {0,1} {0,0} {1,0} {0,0} {blk,glb} uses {0,0} {0,0} {0,0} {0,0} {0,1} {0,0} {0,0} {0,0} last use [7] [6] [4] [4] [6] [7] flags %o1 %o0 %o0 imm reg alloc generated code N %l0 N %o1 ldw [%l0+16],%o1 N %o0 mov 5, %o0 N %l1 mov %o1,%l1 F %o0 call newarray F %o1 N %g1 add %l1,1,%g1 F %l1 F %g1 stw %g1,[%l0+16] F %l0 id IR expression [L0] [1] load @ [L0]+16 [2] const 5 [3] const &newarray [4] call [3] ([2] [1]) à [L1] [5] const 1 [6] add [1] [5] [7] store [6] @ [L0]+16 Register conventions %ln – call preserved reg %on – argument reg %gn – temp reg DFG generation Code generation

Global Register Allocation Pass 3: Code Generation B0 J0 Out – B0 In – B1 B2 B1 B2 J1 Out – B1 B3 In – B3 B4 B3 J2 Out – B2 B4 In – B5 Reserve outgoing registers B4 Reserve outgoing registers B5

Experiment Setup • SPARC VMs chosen for comparison • Large number of VMs with source code available • Required for timing and memory use instrumentation • Neutral RISC ISA • No embedded JITs available for comparison • Variety of benchmarks chosen • Benchmark suites – SPECjvm98, Java Grande, jBYTEmark • Other significant applications – MipsSimulator, h263 Decoder, jLex, jpeg2000

Comparisons to Other Dynamic Compilers

Compilation Speed • 30% faster than Sun-client • 2.5x faster than nearest dataflow compiler (LaTTe) UltraSparcII @ 200MHz Sun Solaris 8

Time spent in each compiler pass • CFG construction consistently < 10% of compile time • DFG generation grows in proportion for large methods • Can improve code generation time for large methods • Limit optimizations with costs that grow with method size • CSE time grows with increasing code size

Performance on Long Running Benchmarks • Compilation to execution time proportionally smaller • Collected times also include Sun interpreter • Good performance for numerical programs • Performance suffers on object-oriented code Speedup normalized to microJIT

Performance on Short Running Benchmarks • Compilation to execution time proportionally larger • Fast optimizing compiler can compete against lazy compilation on total run time Speedup normalized to microJIT

Factors limiting microJIT performance • Sun-client and Sun-server support speculative inlining • Inline non-final public virtual and interface calls that only have one target • Decompile and fix if class loading adds new targets • Garbage collection overheads are higher for our system • Impacted object-oriented programs

Dynamic Memory Usage • microJIT compiler requires 2x memory of Sun-client, but less than ¼ of dataflow compilers • 250KB sufficient to compile 1KB method • Can reduce memory requirements for compilation of large methods by build DFG and generating code for only subsections of CFG per pass • 300KB native code buffer sufficient for largest benchmark applications (pizza compiler and jpeg2000)

Conclusions • Proposed Java dynamic compilation scheme for embedded devices • Compile all code • Fast compiler which performs aggressive optimizations • Results show potential of this approach • Small dynamic and static memory footprint • Good compilation speed and generated code performance • Possible improvements • Memory usage and compilation performance on large methods • Implement additional optimizations • Aggressive array bounds check removal from loops

Targeting Dynamic Compilation for Embedded Systems

Targeting Dynamic Compilation for Embedded Systems

Presentation Transcript

Java for embedded systems

Computing for Embedded Systems

Dynamic Binary Translation for Embedded Systems with Scratchpad Memory

Software for Embedded Systems

Dynamic Compilation and Optimization

Calpa: A Tool for Automating Dynamic Compilation

Middleware for Embedded Systems

Papyrus for Embedded Systems

Dynamic Memory Management for new embedded systems

UIs for Embedded Systems

Software for Embedded Systems

Embedded Systems

Networking for Embedded Systems

Networking for Embedded Systems

Dynamic Compilation and Modification

Embedded Systems Course | Best Institute for Embedded Systems Course

Processors for Embedded Systems

Middleware for Embedded Systems

OS for Embedded Systems

Processors for Embedded Systems