80 likes | 182 Views
Open64 workshop, CGO 2008 April 6, 2008. Feedback-directed optimizations with estimated edge profiles from hardware event sampling. Vinodha Ramasamy, Robert Hundt Google Inc., Dehao Chen, Wenguang Chen Tsinghua University. cd. INSTRUMENTATION. INSTRUMENTED. OPTIMIZED. BUILD. BINARY.
E N D
Open64 workshop, CGO 2008 April 6, 2008 Feedback-directed optimizations with estimated edge profiles from hardware event sampling Vinodha Ramasamy, Robert Hundt Google Inc., Dehao Chen, Wenguang Chen Tsinghua University
cd INSTRUMENTATION INSTRUMENTED OPTIMIZED BUILD BINARY BINARY PROFILE FDO BUILD DATA Background • Traditional FDO model: Instrument – Run – Recompile • Usage Model • Difficulties in generating representative training datasets • High overhead of profile collection • Requires dual-compilation - tightly coupled builds • Benefits • Supports both value and edge profiling • High performance potential TRAINING DATA
Our methodology • Skip the instrumentation step • Use INST_RETIRED event samples for feedback • Source position information used to correlate samples to basic blocks • Generate traditional edge profiles from basic block samples • Feedback data stored in same data structures as instrumented FDO • Leverage feedback-directed optimizations, validation and propagation SAMPLE PROFILE FDO BUILD OPTIMIZED BINARY Input Data Overview
Algorithm • Basic block counts • Scale samples per source line by # of instructions • Samples per source line stored in profile datafile • Annotate IR statements in basic blocks with source line sample counts • Scale basic block sample count BB.count = (∑ IR.count) / num_IR_stmts pbla.c:60 iplus = iplus->pred; // 280 ÷ 4 = 70 100 : 804a8b7: mov 0x10(%ebp),%eax30 : 804a8ba: mov 0x8(%eax),%eax70 : 804a8bd: mov %eax,0x10(%ebp)80 : 804a8c0: jmp 804a94b <primal_iminus+0x137> IR1 = 70 IR2 = 10 IR3 = 70 IR4 = 0 IR5 = 0 ∑IR.count = 70 + 10 + 70 + 0 + 0 = 150 BB.count = 150 ÷ 5 = 30
Edge frequency estimation • Edge counts from basic block counts • Uses higher level program structure - branch, loop etc., • Recursive algorithm used to smooth sample counts 500 ENTRY: 0 ENTRY: 500 BODY: 0 BODY: 7954 → BR: 7954 BR: 7954 BACK: 0 NT: 30 T: 7922 BACK: 7454 NT: 32 T: 7922 JOIN: 420 JOIN: 7954 EXIT: 0 EXIT: 500
Challenges • Inaccuracies inherent to sampling • Source position information issues • Missing information due to optimization transformations • Disambiguating samples per source line if (cond) {stmt1; stmt2;} • Edge estimation heuristics • Evaluate algorithm proposed by Levin et. al. • Inlining • Annotate early inlined functions with scaled sample counts
Results SPEC2006 C benchmarks • Intel Core-2 platform using 64-bit binaries • -O2 FDO with instrumented runs • 4–5% gain over default –O2 runs • -O2 FDO with sampled profiles • Profile collection using –O2 binaries • ~60% of FDO instrumented gain
Q&A Thank You!