370 likes | 516 Views
Trace Fragment Selection within Method-based JVMs. Duane Merrill Kim Hazelwood. VEE ‘08. Overview. Would trace fragment dispatch benefit VMs with JITs? Fragment-dispatch as a feedback-directed optimization Why? Improve VM performance via better instruction layout Overview
E N D
Trace Fragment Selection within Method-based JVMs Duane Merrill Kim Hazelwood VEE ‘08
Overview • Would trace fragment dispatch benefit VMs with JITs? • Fragment-dispatch as a feedback-directed optimization • Why? • Improve VM performance via better instruction layout • Overview • Motivation • New scheme for trace selection • Viability in JikesRVM • Evaluate opportunities for code improvement • Evaluate trace selection overhead
Traditional VM Adaptive Code Generation Phase 3: More Advanced JIT Compilation Update Class/TOC dispatch tables, perform OSR Phase 2: JIT Method compilation Compilation Shape: Source Method Dispatch Shape: Corresponding MC Code Array & Machine Code Trace Fragment Phase 1: Interpreter Compilation Shape: Source Instruction Dispatch Shape: Corresponding MC Instruction(s) Machine Code Trace Fragment
SDT/ DBI/ Embedded VM Adaptive Code Generation Phase 3: More Advanced JIT Compilation Update Class/TOC dispatch tables, perform OSR Phase 2: JIT Method compilation Compilation Shape: Source Method Dispatch Shape: Corresponding MC Code Array& Machine Code Trace Fragment Phase 1: Interpreter Compilation Shape: Source Instruction Dispatch Shape: Corresponding MC Instruction(s) Machine Code Trace Fragment
Proposed VM Adaptive Code Generation Phase 3: More Advanced JIT Compilation Update Class/TOC dispatch tables, perform OSR Phase 2: JIT Method compilation Compilation Shape: Source Method Dispatch Shape(s): Corresponding MC Code Array & Machine Code Trace Fragment Phase 1: Interpreter Compilation Shape: Source Instruction Dispatch Shape: Corresponding MC Instruction(s) Machine Code Trace Fragment
Trace Fragment Dispatch • Trace • A specific sequence of instructions observed at runtime • Span: • Branches • Procedure calls and returns • Potentially arbitrary number of instructions • Trace Fragment • A finite, linear sequence of machine code instructions • Single-entry, multiple-exit (viz. superblock) • Cached, linked foo() A B C bar() D M N O E P A B D M O P E to C to N
Trace Fragment Dispatch: The Good • Location, Location, Location • “Inlining-like”: • Context sensitive • Partial • Spatial locality provides most of achieved speedup • Simple, low-cost “local” optimizations • Redundancy elimination • Nimbly adjusts to changing behavior • Efficient • Lots of early-exits? Discard fragment and re-trace foo() A B C bar() D M N O E P A B D M O P E to C to N
Trace Fragment Dispatch: The Bad foo() A B C bar() • Lacks optimization power • Data flow analysis • Code motion & loop optimizations • Code expansion • Tail duplication • Exponential growth (if all paths maintained indefinitely) D M N O E P A B D M O P E to C to N
Trace Fragment Dispatch: The Bad foo() A B C bar() • Lacks optimization power • Data flow analysis • Code motion & loop optimizations • Code expansion • Tail duplication • Exponential growth (if all paths maintained indefinitely) D M N O E P A B D M O P E to C to N C D M O P E to A to N
Trace Fragment Dispatch: The Bad foo() A B C bar() • Lacks optimization power • Data flow analysis • Code motion & loop optimizations • Code expansion • Tail duplication • Exponential growth (if all paths maintained indefinitely) D M N O E P A B D M O P E to C to N C D M O P E to A to N N P E to A
Supplement Method Dispatch with Trace Dispatch • Why? • Improve VM performance via better instruction layout • Easily-disposable fragments reflect current program behavior • How? • JIT compiler inserts instrumentation into method code arrays: • Monitor potential “hot trace headers” • Record control flow • VM runtime assembles & patches trace fragments: • Blocks “scavenged” from compiled code arrays • Conditionals adjusted for proper fallthoughs • Method code arrays patched to transfer control to fragments • New fragments linked to existing fragments
Easy Fragment Management • Improved trace selection • JIT to identify trace starting • VM to determine trace stopping locations • “Friendly” encoding of instructions • Patch spots built-in • Avoid pesky PC-relative jumps (e.g., switch statements) • Knowledge of language implementation features: • Calling conventions • Stack layout • Virtual method dispatch tables
Efficient Fragment Management • “Mixed-mode” scheme: • Execution in both method code arrays & trace fragments • Share the same register allocation • Control flows off-trace into method code arrays • Fewer trace fragments • Manageable code expansion • JVM control is already built into yield points • Disposable trace fragments • No need to redo expensive analysis as behavior changes
Our Work: Trace Fragment Selection • Develop new trace selection methodology • Leverage JIT global analysis, VM runtime • Implement trace selection in JikesRVM and evaluate viability • Do recorded traces indicate room for code improvement? • Do the traces exhibit good characteristics? • Is instrumentation overhead reasonable?
Improved Trace Selection: Starting Locations foo() A B C bar() • Loop Header Locations • Identified by JIT loop analysis • More accurate than “target of backward branch” heuristic • “Early exit” blocks • Allows trace fragments to be “layered” • Method prologue • Catches recursive execution D M N O E P A B D M O P E to C to N
Improved Trace Selection: Starting Locations foo() A B C bar() • Loop Header Locations • Identified by JIT loop analysis • More accurate than “target of backward branch” heuristic • “Early exit” blocks • Allows trace fragments to be “layered” • Method prologue • Catches recursive execution D M N O E P A B D M O P E to C to N N P E to A
Improved Trace Selection: Starting Locations foo() • Loop Header Locations • Identified by JIT loop analysis • More accurate than “target of backward branch” heuristic • “Early exit” blocks • Allows trace fragments to be “layered” • Method prologue • Catches recursive execution A B C D A B D to Epilogue to C
Improved Trace Selection: Stopping Criteria foo() A B C bar() • Cycle Returned to the loop header • Abutted Arrived at another loop header • Length Limited (unusual) 128 basic blocks encountered • Rejoined (unusual) Returned to a basic block already in trace • Exited (unusual) Exited the method without meeting above conditions. (Identifiable by stack height.) D M N O E P A B D M O P E to C to N N P E to A
Improved Trace Selection: Stopping Criteria foo() A B C bar() • Cycle Returned to the loop header • Abutted Arrived at another loop header • Length Limited (unusual) 128 basic blocks encountered • Rejoined (unusual) Returned to a basic block already in trace • Exited (unusual) Exited the method without meeting above conditions. (Identifiable by stack height.) D M N O E P A B D M O P E to C to N N P E to A
A B C D JIT-Inserted Instrumentation (a) Assembly of original method code-block (Loop header) • (b) Assembly of code-block to be used for tracing Low-fidelity Instrumentation High-fidelity Instrumentation JUMP_BLOCK TRACE_HEAD_A A TRACE_HEAD_B B C D TRAMPOLINE_A TRAMPOLINE_B INSTRUM_A A’ INSTRUM_B B’ INSTRUM_C C’ INSTRUM_D D’ TRAMPOLINE_A’ TRAMPOLINE_B’ TRAMPOLINE_C’ TRAMPOLINE_D’ Loop header counters Paths through blocks
A B C D Low-fidelity Instrumentation High-fidelity Instrumentation JUMP_BLOCK TRACE_HEAD_A A TRACE_HEAD_B B C D TRAMPOLINE_A TRAMPOLINE_B INSTRUM_A A’ INSTRUM_B B’ INSTRUM_C C’ INSTRUM_D D’ TRAMPOLINE_A’ TRAMPOLINE_B’ TRAMPOLINE_C’ TRAMPOLINE_D’ JIT-Inserted Instrumentation (a) Assembly of original method code-block (Loop header) • (b) Assembly of code-block to be used for tracing Loop header counters Paths through blocks
A B C D JIT-Inserted Instrumentation (a) Assembly of original method code-block (Loop header) • (b) Assembly of code-block to be used for tracing Low-fidelity Instrumentation High-fidelity Instrumentation JUMP_BLOCK TRACE_HEAD_A A TRACE_HEAD_B B C D TRAMPOLINE_A TRAMPOLINE_B INSTRUM_A A’ INSTRUM_B B’ INSTRUM_C C’ INSTRUM_D D’ TRAMPOLINE_A’ TRAMPOLINE_B’ TRAMPOLINE_C’ TRAMPOLINE_D’ Loop header counters Paths through blocks
A B C D JIT-Inserted Instrumentation (a) Assembly of original method code-block (Loop header) • (b) Assembly of code-block to be used for tracing Low-fidelity Instrumentation High-fidelity Instrumentation JUMP_BLOCK A TRACE_HEAD_B B C D TRAMPOLINE_A TRAMPOLINE_B INSTRUM_A A’ INSTRUM_B B’ INSTRUM_C C’ INSTRUM_D D’ TRAMPOLINE_A’ TRAMPOLINE_B’ TRAMPOLINE_C’ TRAMPOLINE_D’ Loop header counters Paths through blocks
foo() A B C bar() D M N O E P Improvement Opportunity A B D E C M N P O
foo() A B C bar() D M N O E P Improvement Opportunity A B D E C M N P O Virtual Address Space (1GB) 5B0480C6 (Low) 9BFE8D1F (High)
Trace Layouts in Address Space (227_MTRT) Traces Virtual Address Space (1GB) 5B0480C6 (Low) 9BFE8D1F (High)
foo() A B C bar() D M N O E P Improvement Opportunity A B D E C M N P O Gap Transition Fallthrough Transition
Trace ContinuityDaCapo & SpecJVM98 Benchmarks • 1/3 traces necessarily fragmented (inter-procedural) • Most intra-procedural traces non-contiguous
Transitions between basic blocks • Appropriate fallthough block 80% of the time • 15% misprediction rate for local control flow. • 20% of all transitions could benefit from trace fragment dispatch
Trace Characteristics • Cycle and abutted traces make the majority • Few length-limited, rejoined traces • Surprisingly large number of exited traces • Sporadic loops
Instrumentation Overhead (Startup) • One-iteration tests. (40x) • Mixed slowdown results: 7.4% (jython), -6.5% (_227_mtrt) • Average startup overhead: 1.7%
Instrumentation Overhead (Steady State) • 40-iteration tests. (8x) • Average steady-state overhead: 1.7%
Summary • Envision trace fragment dispatch as a feedback-directed optimization • Locality optimizations not addressed by JIT compiler • Adapt to changing behavior without recompilation • More accurate trace selection • Enabled by the co-location with the JIT and VM runtime • Evaluated opportunity and cost • 20% of basic block transitions do not use sequential fallthough. • 25% of taken branches/calls transfer control flow to locations outside the VM page • Minimal startup and maintenance overhead for trace selection
Improved Trace Selection: Starting Locations foo() A • Loop Header Locations • Identified by JIT loop analysis • More accurate than “target of backward branch” heuristic • “Early exit” blocks • Allows trace fragments to be “layered” • Method prologue • Catches recursive execution B C D B C to D
Improved Trace Selection: Starting Locations foo() A • Loop Header Locations • Identified by JIT loop analysis • More accurate than “target of backward branch” heuristic • “Early exit” blocks • Allows trace fragments to be “layered” • Method prologue • Catches recursive execution B C D B C to D D A to A