290 likes | 467 Views
Lengthening Traces to Improve Opportunities for Dynamic Optimization. Chuck Zhao , Cristiana Amza, Greg Steffan, University of Toronto Youfeng Wu Intel Research. Feb. 16, 2007 Interact-12, HPCA. Intel’s StarDBT Project. StarDBT A D ynamic B inary T ranslation framework
E N D
Lengthening Traces to Improve Opportunities for Dynamic Optimization Chuck Zhao, Cristiana Amza, Greg Steffan, University of Toronto Youfeng Wu Intel Research Feb. 16, 2007 Interact-12, HPCA
Intel’s StarDBT Project • StarDBT • A Dynamic Binary Translation framework • Operates on traces, optimizes hot traces • Long term goal: Use StarDBT to allow legacy apps to exploit TM support • (NOT by automatically parallelizing legacy apps) • Allow speculative sequential optimizations • Use hardware TM’s checkpoint/restore • Problem: default traces are too small • TM overheads would overwhelm benefits Challenge: lengthening traces can be tricky
Trace Formation basic-block profile trace profile A A B B C C D D F E E F G G off-trace stub on-trace blocks Control flow that goes off-trace can be costly
Trade-offs when Lengthening Traces • Completion ratio: • likelihood of execution • staying on trace • percentage of execution • reaching trace tail side-exit ratio A B 5% D A F 5% B G 5% D 5% A F B 5% G 5% Tradeoffs: longer traces have more optimization opportunities longer traces have more side-exit branches D F 5% G 100% - 10% = 90% 100% - 25% = 75% completion ratio Sweet spot exits in between, can we find it?
Our Work So Far (i.e., this talk) • Lengthening traces while maintaining completion ratios • Through unrolling and straightening • A characterization of the impact on traces • length, completion ratio, unroll factor, … • Improving optimization opportunities on longer traces • Improve Local Value Numbering (LVN) hits • Measurement of impact on performance is pending • Performing on-the-fly actions by DBT system • Decisions made by instrumenting/sampling code online
Related Work • Binary Translation Systems • Dynamo • DynamoRIO • PIN • StarDBT • transparent translation • x86 legacy code • Trace Collection and Optimizations • Java JIT • Dynamo, DynamoRIO, Mojo • StarDBT • x86 binary level • MRET2 to improve trace formation • aggressive trace optimizations First full analysis of trace-lengthening issues for DBT systems
StarDBT Trace Types c dispatcher b d a self type other trace type elsewhere type
a a a Lengthening Traces Through Unrolling 90% 81% 72.9% a completion ratio: 90% Unrolling increases trace’s length, but reduces completion ratio
a a a a a Finding the Sweet-Spot Unroll Factor ... chosen by system designer given porig = 99% and ptarget = 90% Traces with 100% completion ratio: set N = 10
d Lengthening Traces Through Straightening c b b c We don’t yet implement/evaluate straightening
Distribution of Original Completion Ratios original completion ratio Original Completion Ratios Majority of hot traces have completion ratios in 90%-100%
Impact of Unrolling on Hot Trace Size 36% longer completion ratio Average Number of Instructions Select SPECIntCPU 2000 bmarks with MinneSpec input Lengthening increases hot trace size by more than 36%
How Much are Traces Unrolled? Target completion ratio Average Unroll Factor 1.38-1.58x Not unrolled Hot traces are unrolled on average by 1.38x or more
Average Completion Ratio After Lengthening 90% 80% <0.5% 70% completion ratio 60% 50% Completion Ratio 40% 30% 20% 10% Lengthening traces reduces completion ratio by < 0.5%
Local Value Numbering (LVN) • No need to build Control Flow Graph (CFG) • Partial info • No need to perform Data Flow Analysis (DFA) • Expensive, rely on CFG • Can be arranged into a single-pass scan • Ease of implementation • Relatively light weight algorithm • Performs three optimizations: • Common Subexpression Elimination (CSE) • Copy Propagation (CP) • Dead-Code Elimination (DCE) LVN is common in JIT optimizers
Ex: LVN On a Lengthened Trace Original Traces Lengthened Trace Optimized Trace … c = a + b d = a e = b … c3 = a1 + b2 d1 = a1 e2 = b2 f3 = d1 + e2 f3 = c3 d4 = x4 … … c = a + b e = b f = c d = x … DCE hit f = d + e d = x … CSE hit
LVN Hits Improvement (%) 35% 30% target completion ratio 25% 20% % Increase in LVN Hits 15% 10% 5% 10+% more LVN hits are available through lengthening
Ongoing Work • Complete DBT Optimization Framework • Evaluate speculative optimizations on long hot traces with high completion ratios • Automatically determine optimal transaction granularity • Use HTM to support trace-based speculative optimizations
cmp 90+% 10-% ld x=[y] … Control Speculation A Compiler Framework for Speculative Analysis and Optimizations: Lin et. al, PLDI 03 ld.s x = [y] if(c){ chk.s x, recovery next: … } recovery: ld x=[y] jmp next
cmp 90+% 10-% ld x=[y] … Use HTM to Support Trace-based Speculative Optimizations start_tx ld x = [y] if(c){ chk x, abort_tx … } commit_tx Use longer traces with high completion ratio as tx granularity HTM hardware support simplifies speculative optimization
Conclusion • Traces can be effectively lengthened • increase in trace size by 36+% • decrease completion ratio by less than 0.5% • Longer traces provide better opportunities for optimization • increase in LVN hits by 10%+
Complete StarDBT Optimization Framework • X86 CISIC ISA • code patching won’t work • Really need a code generator and IR • Design + implement a low-level Runtime IR • close to hardware • capture + represent all necessary low-level info • easy to convert from/to machine code • easy to implement analysis and optimizations • Starting point • Dynamo IR • LLVM IR • GCC RTL • …
Trace Formation Heuristics • MRET: Most Recent Execution Tail • originally proposed by Dynamo • Trace head • loop head (backward branch target) • sampling counter reaches a certain threshold • Trace tail • satisfy certain trace-tail conditions • MRET2: 2-pass MRET • perform 2 independent MRET trace formation • intersect traces with common head
Traces and Hot Traces • Trace • MRET2 recognize trace heads • Trace tails satisfy certain conditions • Blocks in between become a trace • Hot Trace • Based on recognized Traces • Put in additional software counters • head: head counter • each early-exit branch: off-trace counters • sampling: hot-trace’s completion ratio