Just-In-Time Java Compilation for the Itanium Processor

Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs

Introduction • Itanium processor is statically scheduled machine • Aggressive compiler techniques to extract ILP • Just-In-Time (JIT) compiler must be fast • Must consider time & space efficiency of optimizations • Balance compilation time with code quality • Light-weight compilation techniques • Use heuristics for modeling micro architecture • Leverage semantics and meta data of JVM

Outline • Introduction • Compiler overview • Register allocation • Code scheduling • Other optimizations • Conclusions

Compiler Structure Code Selection Prepass Register Allocation IR construction Predication Code Scheduling Inlining GC Support Global optimizations Code Emission Back-end Front-end

Register Allocation • Compilation time vs. code quality tradeoff • IPF architecture has large register files • 128 integer, 128 floating-point, 64 predicate, 8 branch • Register Stack Engine (RSE) provides 96 stack registers to each procedure • Use linear scan register allocation • “Linear Scan Register Allocation” by Massimiliano Poletto and Vivek Sarkar

B1 ... ... B1 B2 B3 B2 t1=... ... v =t1 t1=... ... v =t1 t2=... ... v =t2 B3 t2=... ... v = t2 B4 ...= v B4 ...= v Live Range vs. Live Interval Live Ranges Live Intervals

Coalesce v and t in v =t iff Live interval of t ends at v = t Live interval of t does not intersect with live range of v Requires one additional reverse pass over IR O(NINST + NVAR * NBB) ... B1 B2 t1=... ... v =t1 B3 t2=... ... v = t2 B4 ...= v Coalescing Algorithm

Coalescing Speedup

Code Scheduling • Forward cycle-based list scheduling • Scheduling unit is extended basic block • Middle exits are due to run-time exceptions (p6,p7) = cmp.eq r35, 0 (p6) br ThrowNullPointerException r10 = r35 + 16 r11 = ld8 [r10]

Type-based memory disambiguation • Use JVM meta data to disambiguate memory locations • Type • Integer, floating-point, object reference … • Kind • Object field, array element, virtual table address … • Field id • putfield #10 vs. putfield #15

Type-Based Disambiguation

Exception Dependencies • Java exceptions are precise • Naive approach • Exception checks end basic blocks • Our approach • Instruction depends on exception check iff • Its destination is live at the exception handler, or • It is an exception check for different exception type • It is a memory reference that may be guarded by check

1: (p6, p0) = cmp.eq r16, 0 2:(p6) brThrowNullPointerException 3: r17 = add r16, 8 4:r18 = ld [r17]// load field 5: r21 = movl 0x000F14E32019000 6: f8 = fld [r21]// load static Exception Dependency Example

Exception Dependencies

IPF Architecture • Execution (functional) unit type – M, I, F, B • Instruction (syllable type) – M, A, I, F, B, IL • Bundles, templates • .mii .mi;;i .mil .mmi .m;;mi .mfi .mmf .mib .mbb .bbb .mmb .mfb • Instruction group – no WAR, WAW with some exceptions .mi;;i r10 = ld [r15] r9 = add r8, 1 ;; // stop bit r16 = shr r9, r32

Template Selection • Pack instructions into bundles • Choose slot for each instruction • Insert NOP instructions • Assign instructions to functional units Problem: Resource over subscription Inaccurate bypass latencies

Unsorted NOP I1 NOP NOP I2 NOP I3 NOP Sorted NOP I3 I1 NOP I2 Algorithm • Greedy slot assignment • Sort instruction by syllable type • M < F < IL < I < A < B I1: r20 = sxt r14 (I-type) I2: r21 = movl ADDR (IL-type) I3: f15 = fadd f10, f11 (F-type)

Template Selection Heuristics

I-Unit M-Unit M-Unit r17 = add r16, 8 r17 = add r16, 8 r18 = ld [r17] 1 2 Bypass Latency Accuracy • Phase ordering of functional unit assignment • Code selection time is too early: underutilizes resources • Template selection time too late: inaccurate scheduling latencies • Solution: Assign to functional unit during scheduling Assign to M-Unit if available, else Assign to I-Unit and increment latency

Modeling of Address Computation Latency

Other optimizations • Predication • Profitability depends on a benchmark • Performance variations within 2% • Branch hints • Up to 50% speedup from using branch hints • Sign-extension elimination • 1% potential gain for our compiler

Conclusions • Light-weight optimizations techniques for Itanium • Considering micro architecture is important • Cannot ignore bypass latencies • Template selection should be resource sensitive • Language semantics helps to improve ILP • Type-based memory disambiguation • Exception dependency elimination

Just-In-Time Java Compilation for the Itanium Processor