290 likes | 537 Views
Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford University. Targeting Dynamic Compilation for Embedded Systems. Outline. Motivating Problem Compiler Design Performance Results Conclusions. Challenges of Running Java on Embedded Devices.
E N D
Michael Chen Kunle Olukotun Computer Systems Laboratory Stanford University Targeting Dynamic Compilation for Embedded Systems
Outline • Motivating Problem • Compiler Design • Performance Results • Conclusions
Challenges of Running Java on Embedded Devices • J2ME (micro edition) on CDC (connected device configuration) • PDAs, thin clients, and high-end cellphones • Highly resource constrained • 30MHz - 200MHz embedded processors • 2MB - 32MB RAM • < 4MB ROM • Differences from running Java on desktop machines • Satisfying performance requirements difficult with slower processors • Virtual machine footprint matters • Limited dynamic memory available for runtime system J2ME/CLDC J2ME/CDC J2SE J2EE Embedded Server Desktop
Java Execution Models • Interpretation • Decode and execute bytecodes in software • Incurs high performance penalty • Fast code generators • Dynamic compilation without aggressive optimization • Sacrifices code quality for compilation speed • Lazy compilation • Interpret bytecodes and translate code with optimizing compiler for frequently executed methods • Adds complexity and total ROM footprint of interpreter + compiler large • Alternative approach?
microJIT: An Efficient Optimizing Compiler • Minimize major compiler passes while optimizing aggressively • Perform several optimizations concurrently • Pipeline information from one pass drive optimizations in subsequent passes • Budget overheads for dataflow analysis • Efficient implementations of straightforward optimizations • Use good heuristics for difficult optimizations • Manage compiler dynamic memory requirements • Efficient dataflow representation
Using microJIT in Embedded Systems • Configuration • Compile everything to native code • Potential advantages over other execution models • Lower total system cost • Multiple execution engines require more ROM • Reduced complexity • Only need to maintain one compiler • Doesn't sacrifice long or short running performance • Generates fast code while minimizing overheads
Outline • Motivating Problem • Compiler Design • Performance Results • Conclusions
microJIT Compiler Overview Dataflow Information ISA Dependent Optimizations CFG Construction Locals & field accesses Loop identification Register reservations IR expression optimizations IR expression use counts DFG Generation Assembler macros Instruction delays Register allocator Machine idioms Instruction scheduler Native Code Generation
Pass 1: CFG Construction • Quickly scan bytecodes in one pass • Partially decode bytecodes to extract desired information • Decompose method into extended basic blocks (EBBs) • Build blocks and arcs as branches and targets are encountered • Compute block-level dataflow information • Identify loops • Record local and field accesses for blocks and loops
Pass 2: DFG Generation • Intermediate representation (IR) • Closer to machine instructions than bytecodes (LIR) • Triples representation – unnamed destination • Source arguments are pointers to other IR expression nodes • Complex bytecodes decompose into several IR expressions [L0] [1] const 1 [2] add [1] [L0] [3] neg [2]
Block-local Optimizations Pass 2: DFG Generation id IR expression [L0] [1] load @ [L0]+16 [2] const 1 [3] add [1] [2] [4] store [4] @ [L0]+16 • Maintain mimic stack when translating into IR expressions • Manipulate pointers in place of locals and stack accesses which do not generate IR expressions • Immediately eliminates copy expressions • Optimizations immediately applied to newly created IR expressions • Check source arguments for constant propagation and algebraic simplifications • Search backwards in EBB for available matching expression (CSE) Java source L0.count++; bpc bytecode 0 aload_0 1 dup 2 getfield count 4 iconst_1 5 iadd 6 putfield count
Global Optimizations Pass 2: DFG Generation B1 • Global optimizations also immediately applied to newly created IR expressions • Global forward flow information available for every new IR expression • Blocks processed in reverse post-order (predecessors first) • Use loop field and locals access statistics from previous pass to calculate fixed point solution at loop header • Restricted to dataflow optimizations that rely primarily on forward flow information • Global constant propagation, copy propagation, and CSE B2 B3 B5 B4 B6 B7 loop locals access table
Loop Invariant Code Motion Pass 2: DFG Generation • Check loop statistics to make sure source arguments are not redefined in loop • Can perform code motion on dependent instructions without iterating • Hoisted IR expressions immediately communicated to successive instructions and blocks in loop PH [1]à [G0] loop locals access table [3]à [G1] H [1] add [L0] [L1] [2] const 1 [3] sub [1] [2] E
Inlining Pass 2: DFG Generation • Optimized for small methods • Handles nested inlining • Important for object initializers with deep sub-classing • Can inline non-final public virtual and interface methods with only one target found at runtime • Protected with a class check
Pass 3: Code Generation • Registers allocated dynamically as code is generated • Instruction scheduling within a basic block • Use standard list scheduling techniques • Fills load and branch delay slots • Successfully ported to three different ISAs • MIPS, SPARC, StrongARM • Ports took only a few weeks to implement • Plans to port to x86
Fast Optimization of Machine Idioms Pass 3: Code Generation • Traditionally done using a peephole optimizer • Requires additional pass over generated code • Compiler features allow optimization of machine idioms without additional pass • Machine specific code can be invoked two passes • Configurable IR expressions • Deferred code generation of IR expressions • Optimized machine idioms • Register calling conventions • Mapping branch implementations • Immediate operands • Different addressing modes
Code Generation Example Pass 3: Code Generation {blk,glb} uses {2,0} {2,0} {1,0} {1,0} {0,1} {1,0} {1,0} {0,0} {blk,glb} uses {1,0} {2,0} {1,0} {1,0} {0,1} {1,0} {1,0} {0,0} {blk,glb} uses {1,0} {1,0} {0,0} {0,0} {0,1} {1,0} {1,0} {0,0} {blk,glb} uses {1,0} {0,0} {0,0} {0,0} {0,1} {0,0} {1,0} {0,0} {blk,glb} uses {0,0} {0,0} {0,0} {0,0} {0,1} {0,0} {0,0} {0,0} last use [7] [6] [4] [4] [6] [7] flags %o1 %o0 %o0 imm reg alloc generated code N %l0 N %o1 ldw [%l0+16],%o1 N %o0 mov 5, %o0 N %l1 mov %o1,%l1 F %o0 call newarray F %o1 N %g1 add %l1,1,%g1 F %l1 F %g1 stw %g1,[%l0+16] F %l0 id IR expression [L0] [1] load @ [L0]+16 [2] const 5 [3] const &newarray [4] call [3] ([2] [1]) à [L1] [5] const 1 [6] add [1] [5] [7] store [6] @ [L0]+16 Register conventions %ln – call preserved reg %on – argument reg %gn – temp reg DFG generation Code generation
Global Register Allocation Pass 3: Code Generation B0 J0 Out – B0 In – B1 B2 B1 B2 J1 Out – B1 B3 In – B3 B4 B3 J2 Out – B2 B4 In – B5 Reserve outgoing registers B4 Reserve outgoing registers B5
Outline • Motivating Problem • Compiler Design • Performance Results • Conclusions
Experiment Setup • SPARC VMs chosen for comparison • Large number of VMs with source code available • Required for timing and memory use instrumentation • Neutral RISC ISA • No embedded JITs available for comparison • Variety of benchmarks chosen • Benchmark suites – SPECjvm98, Java Grande, jBYTEmark • Other significant applications – MipsSimulator, h263 Decoder, jLex, jpeg2000
Compilation Speed • 30% faster than Sun-client • 2.5x faster than nearest dataflow compiler (LaTTe) UltraSparcII @ 200MHz Sun Solaris 8
Time spent in each compiler pass • CFG construction consistently < 10% of compile time • DFG generation grows in proportion for large methods • Can improve code generation time for large methods • Limit optimizations with costs that grow with method size • CSE time grows with increasing code size
Performance on Long Running Benchmarks • Compilation to execution time proportionally smaller • Collected times also include Sun interpreter • Good performance for numerical programs • Performance suffers on object-oriented code Speedup normalized to microJIT
Performance on Short Running Benchmarks • Compilation to execution time proportionally larger • Fast optimizing compiler can compete against lazy compilation on total run time Speedup normalized to microJIT
Factors limiting microJIT performance • Sun-client and Sun-server support speculative inlining • Inline non-final public virtual and interface calls that only have one target • Decompile and fix if class loading adds new targets • Garbage collection overheads are higher for our system • Impacted object-oriented programs
Dynamic Memory Usage • microJIT compiler requires 2x memory of Sun-client, but less than ¼ of dataflow compilers • 250KB sufficient to compile 1KB method • Can reduce memory requirements for compilation of large methods by build DFG and generating code for only subsections of CFG per pass • 300KB native code buffer sufficient for largest benchmark applications (pizza compiler and jpeg2000)
Outline • Motivating Problem • Compiler Design • Performance Results • Conclusions
Conclusions • Proposed Java dynamic compilation scheme for embedded devices • Compile all code • Fast compiler which performs aggressive optimizations • Results show potential of this approach • Small dynamic and static memory footprint • Good compilation speed and generated code performance • Possible improvements • Memory usage and compilation performance on large methods • Implement additional optimizations • Aggressive array bounds check removal from loops