200 likes | 350 Views
Rational Apex 4.0 Optimization. “Beware the benchmark!”. Presentation Outline. Outline Rational Apex optimization behaviour Demonstrate some of the optimization techniques being used by modern compilers
E N D
Rational Apex 4.0 Optimization “Beware the benchmark!”
Presentation Outline • Outline Rational Apex optimization behaviour • Demonstrate some of the optimization techniques being used by modern compilers • Show how these techniques defeat many of the assumptions made by traditional benchmarking suites
Rational Apex Optimization • Optimization with Apex has 3 levels, controlled by the OPTIMIZATION_LEVEL switch • Level 0 – No optimization, maximize debuggability • This is the default • Level 1 – Many optimizations performed, some debuggability maintained • Level 2 – All optimizations performed, debugging may be very limited in some code • Optimization with Apex can have one of two objectives • Time – try to generate code with that will execute in minimal time • Space – try to generate code that is as compact as possible • These two objectives are not mutually exclusive!
Rational Apex Optimization • Apex performs optimization in several different places • Front End – post semantics • Common sub-expression elimination • Code in-lining • Loop unrolling • Remove unused code from local scope • Machine independent instruction stream optimizer “optim” • Loop invariant hoisting • Range propogation • Constraint check elimination • Reduce memory movement • Machine specific code generator • Peep-hole optimization • All optimization consumes extra CPU during compilation • The default is off – OPTIMIZATION_LEVEL: 0
Example Code – Summation of SQRT • Simple routine that sums up square roots and prints the result
Optimization Level 0 • No inlining, no code elimination, no check elimination • Disassembly of sum_sqrt.2.ada is 15845 lines long • No unused code has been eliminated – all the code for generic_elementary_functions remains
Optimization Level 0 – Disassembly of sqrt • 163 lines ofassembly • Slightlyabridged
Optimization Level 0 – Disassembly of hardware • 56 Lines of disassembly for SQRT • 10 Instructions for SQRT_32
Optimization Level 0 – Summary • Total of over 220 instructions generated for the code that we are interested in • Lots of it will be unused • Not to mention the rest of the code for the instantiation • Code maps back to source easily • Code layout follows source • Lots of overhead for this straightforward code • Subprogram prolog/epilog code • Stack checks • Register management • Subprogram call/return code (3 levels deep) • No delayed branch slots being filled
Optimization Level 2 – Observations • Disassembly of sum_sqrt.2.ada is 85 lines long • Entire loop and all the called subprogram code is now 12 instructions long • 5 instructions for “for” loop management • Includes 2 instructions for branching • 4 instructions for integer to float conversion • 2 are identical, as one copy is used to fill a delayed branch slot at the bottom of the loop • 1 instruction for the Text_Io code is used to fill a branch delay slot • 2 Instructions to perform the actual Sqrt and summation.
Optimization Level 2 – Observations • The optimization objective was Time • Time is certainly optimized, but Space also benefited enormously • Different optimization techniques combined effectively to produce very effective code • Inlining of 3 levels of subprogram call eliminated a significant amount of subprogram prolog/epilog • Range propagation determined that the argument to SQRT could never be less than zero, which allowed the argument check to be removed • Evaluation of compile static expressions resulted in a lot of code not being generated • Kind of floating point type – no case statement needed • Availability of Hardware SQRT – no call needed to Has_Sqrt • Register lifetime analysis on the resulting code meant that the loop control variable and the summation variable could live in registers
Performing Benchmarks • Benchmarks usually consist of two distinct loops • A “Null Timing” loop to determine the overhead of the loop code itself • The Code Under Test loop which has the same structure as the Null timing loop with the inside of the loop replaced with the C.U.T • Timing equation looks like • TCUT = (TCUT_loop – Tnull_loop) / n • Where n is the number of iterations • Usually n has to be very high so that the resolution of the system clock is not significant in the result
Performing Benchmarks • One effect we notice is that sometimes a benchmark suite reports slower times for code even though we know we have improved our optimizations! • What’s happening? • The Null Timing loops of benchmark suites attempt to defeat compiler optimizations that skew their results • Compilers are better at getting rid of unnecessary code, often defeating the smart null loop • So now the equation looks like: TCUT = (TCUT_loop – 0 ) / n • So the remaining loop overhead time gets included in the time of the Code Under Test making it look worse than before
Performing Benchmarks • One other effect we observe is that benchmarks often don’t do anything with the results they calculate • Compilers can detect this and conclude that running the code has no effect and (very importantly) no side-effects • Range propagation concludes that overflow cannot be raised • Result is never used • Code is thrown away • A good example is the Henessey Benchmark in the PIWG suite • Large matrix multiplications, using a range of values that will not result in overflow • Apex 4.0 reports zero time for that test
Performing Benchmarks • When trying to compare different compiler technologies you need to look beyond the results printed by a benchmark program • Printed numbers can be very misleading • Look at absolute times and iteration counts • Benchmarks don’t translate well b/n processor variants and processor types • The best benchmark is your application • Or a sizable portion of it