250 likes | 369 Views
Cross-Architectural Performance Portability of a Java Virtual Machine Implementation. Matthias Jacob Princeton University. Keith Randall Google, Inc. JVM architecture. Java Bytecode. Interpreter. JIT. Native Code. JVM. CPU. JVM architecture. Java Bytecode. Interpreter. JIT.
E N D
Cross-Architectural Performance Portability of a Java Virtual Machine Implementation Matthias JacobPrinceton University Keith RandallGoogle, Inc.
JVM architecture Java Bytecode Interpreter JIT Native Code JVM CPU
JVM architecture Java Bytecode Interpreter JIT Native Code JVM CPU
Compaq FastVM • State-of-the-art implementation of JVM on Alpha • Real 64-bit implementation • Efficient optimization mechanisms • Not feedback-based (as HotSpot) • Can we port the code generator to x86 and preserve the performance ?
Differences Alpha – x86 • Reduced number of registers • 8 registers on x86 versus 31 on Alpha • Instructions contain multiple operations • A single x86 instruction comprises several Alpha instructions • Different addressing modes • Arithmetic x86 instructions operate on memory directly • Non-orthogonality of instruction set • Different registers require different instructions • Source registers get overwritten • Operand registers are used to store results on x86
Outline • Modified Optimizations for x86 • Register Allocation • Instruction Selection • Instruction Patching • Method Inlining • New Optimizations for x86 • Calling Convention • Floating-Point Modes • Results • Conclusion
Register Allocation for JIT • Traditional optimal register allocation too expensive • Graph coloring • Use heuristics • LMAP structure
Register Allocation • Java entities: Local variables Lx and Java stack locations S(y) • Assign every Java entity home location H • Temporary location T for intermediate results
Register Allocation • Limited amount of registers • Flexible partitioning H- / T-registers • No dedicated registers • Thread-local pointer in segment register
Register Allocation • Instructions limited to certain registers • Allocate only subset of registers
Register Allocation • Memory locations as arguments • Pick different addressing mode instead of allocating register
Instruction Selection • Alpha/RISC: • ALU operations • Memory operations • Control operations • x86/CISC: • Instructions can be combined ALU/Memory/Control operations • Different addressing modes • Limited set of registers per instruction • Emulate 64-bit operations • Floating-point stack
Instruction Patching • Patching instructions • Class initializers • Fix up branches • Copying registers • Method Inlining • Needs to be atomic because of concurrency • Alpha: Every instruction is 4 bytes • single write instruction sufficient
Instruction Patching on x86 • Different instruction lengths • Patch instructions atomically using Compare-and-Exchange • Pad with NOPs • Difficult to walk back in code for renaming registers (as on Alpha) • Input registers are often output registers • Renaming output registers alone is not sufficient • Retargeting by forward-looking heuristic • Look for nearest future use of a preferred register
Outline • Modified Optimizations for x86 • Register Allocation • Instruction Selection • Instruction Patching • Method inlining • New Optimizations for x86 • Calling Convention • Floating-Point Modes • Results • Conclusion
Optimizations for x86 • Calling Convention on x86 • Argument passing on stack instead of registers • Allocate registers for argument passing • Two registers for stack management:Frame pointer and Stack pointer • Constant stack frame size • Detection of stack overflow is difficult • Check at bottom of stack frame in method prolog • 8-byte stack operations may be unaligned • Align stack frames to 8 byte boundaries
Optimized stack frame layout Input arguments subl $24, %esp movl %ebx, (%esp) … Return address Callee-save space … Local variables … Output stack arguments … esp Callee-save space (4 bytes) Method prolog: Method epilog: movl (%esp), %ebx addl $24, %esp ret
Floating-Point Modes • Alpha: • Floating-point precision is encoded in instruction • x86: • Toggle floating-point precision explicitly • Heuristically find default setting • Reduce number of toggles
Results Average scenario
Results Best-case scenario
Conclusion • FastVM port to x86 is competitive: • Fastest JVM implementation on javac and jack • Minimal effort on optimizations • Pitfalls, but also advantages • Instruction selection on x86 • Generally easier to generate efficient code for RISC • More architecture-neutral optimizations possible • Register allocation