Memory Optimizations & Post-Compilation Techniques

Memory Optimizations & Post-Compilation Techniques CS 671 April 3, 2008

The Problem The placement of program text in memory matters Large working sets  excessive TLB & page misses Bad placement increases instruction cache misses Pettis & Hansen report that on some benchmarks, 1 of 3 cycles was a cache miss (for PA-RISC) Random placement leaves these effects to chance The plan Discover execution-time paths Rearrange the code to keep those paths in contiguous memory Make heavy use of execution profiles

Does this work? Motivating examples within HP Pascal compiler Moved frequently executed blocks to top of procedure 40% reduction in instruction cache misses 5% improvement in running time Fortran compiler Rearranged object files before linking Attempt to improve locality on calls 20% throughput improvement

Two Major Issues Procedure placement If A calls B, would like A & B in adjacent locations On same page means smaller working set adjacent locations limit I-cache conflicts Unfortunately, many procedures might call B (& A) This is an issue for the linker Block placement Same effects occur on a smaller scale Fall through branches create an additional incentive Rarely executed code fills up the cache, too! This is an issue for the compiler & optimizer

Procedure Placement Simple principles Build the call graph Annotate edges with execution frequencies Use “closest is best” placement A calls B most often  place A next to B Keeps branches short (advantage on PA-RISC) Direct mapped I-cache  A & B unlikely to overlap in I-cache Profiling the call graph Linker inserts a stub for each call that bumps a counter Counters are kept in statically initialized storage (set to zero) Adds overhead to execution, but only in training runs

Procedure Placement Computing an order Combine all edges from A to B Select highest weight edge, say XY Combine X & Y, along with their common edges, XZ & YZ Place X next to Y Repeat until graph cannot be reduced further X 10 4 Y Z 2 • May have disconnected subgraphs • Must add new procedures at end • WX and YZ with WZ & XY • Use weights in original graph • Largest weight closest

Block Placement Targets branches with unequal execution frequencies Make likely case the “fall through” case Move unlikely case out-of-line & out-of-sight Potential benefits Longer branch-free code sequences More executed operations per cache line Denser instruction stream  fewer cache misses Moving unlikely code  denser page use & fewer page faults

Block Placement Moving infrequently executed code B1 B1 1 1000 B2 B2 B3 B3 1000 1 B4 B4 • Would like this to become Long distance B1 B2 B3 Long distance In another page, ... This branch goes away B4 Denser instruction stream Unlikely path gets fall through (cheap) case Likely path gets an extra branch

Block Placement Principles Goal is to eliminate taken branches Build up traces – single paths Work from profile data Edges are better than blocks Use a greedy, bottom-up strategy to combine blocks Gathering profile data Insert code to count edges Split critical edges Use name mangling to separate data for different procedures

Block Placement The Idea Form chains that should be placed as straight-line code The Algorithm 1. Make each block a degenerate chain & set its priority to # blocks 2. P  1 3.  edge e = <x,y> in the CFG, in order by decreasing frequency if x is the tail of chain a and y is the head of chain b then merge a and b else set priority(y) to min(priority(y),P++) { Point is to place targets after their sources, to make forward branches

Block Placement Now, to lay out the code WorkList  chain containing the entry node, n0 While (WorkList ≠ Ø) Pick the chain c with lowest priority(c) from WorkList Place it next in the code  edge <c,z > leaving c add z to WorkList Intuition Entry node first Tries to make edge from chain i to chain j a forward branch Predicted not-taken on target machine Edge remains only if it is lower probability choice

Going Further – Procedure Splitting Any code that has zero profile is “fluff” Move fluff into the distance It rarely executes Get more useful operations into I cache Increase effective density of I cache Slower execution for rarely executed code Implementation Create a linkage-less procedure with an invented name Give it a priority that the linker will sort to the code’s end Replace the branch with a call (a stub that does the call ) Branch to call at the end of the procedure to maintain density

Putting It Together Procedure placement is done in the linker Block placement is done in the optimizer Allows branch elision due to fluff, other tailoring Speedups averaged from 2 to 26%, depending on cache size This idea became popular on early 1990s PCs Long cache lines Slow page faults Microsoft insiders suggested it was most important optimization for codes like Office (Word, Excel) Why?

Peephole Optimization &Other Post-Compilation Techniques

The Problem • After compilation, the code still has some flaws • Scheduling & allocation really are NP-Complete • Optimizer may not implement every needed transformation • Curing the problem • More work on scheduling and allocation • Implement more optimizations • — or — • Optimize after compilation • Peephole optimization • Link-time optimization

Peephole Optimization • The Basic Idea • Discover local improvements by looking at a window on the code • A tiny window is good enough — a peephole • Slide the peephole over the code and examine the contents • Pattern match with a limited set of patterns • Examples  storeAI r1 r0,8 loadAI r0,8  r15 storeAI r1 r0,8 cp r1 r15  addI r2,0  r7 mult r4,r7 r10 mult r4,r2 r10  jumpI  l10 l10: jumpI  l11 jumpI  l11 l10: jumpI  l11

Peephole Optimization • Early Peephole Optimizers (McKeeman) • Used limited set of hand-coded patterns • Matched with exhaustive search • Small window, small pattern set  quick execution • They proved effective at cleaning up the rough edges • Code generation is inherently local • Boundaries between local regions are trouble spots • Improvements in code gen, opt, & architecture should have let these fade into obscurity • Much better allocation & scheduling today than in 1965 • But, we have much more complex architectures Window of 2 to 5 ops

Peephole Optimization • Modern Peephole Optimizers (Davidson, Fraser) • Larger, more complex ISAs  larger pattern sets • This has produced a more systematic approach • Expander • Operation-by-operation expansion into LLIR • Needs no context • Captures full effect of an operation ASM Expander ASMLLIR LLIR Simplifier LLIRLLIR LLIR Matcher LLIRASM ASM

Peephole Optimization • Modern Peephole Optimizers (Davidson, Fraser) • Larger, more complex ISAs  larger pattern sets • This has produced a more systematic approach • Simplifier • Single pass over LLIR, moving the peephole • Forward substitution, algebraic simplification, constant folding, & eliminating useless effects (must know what is dead ) • Eliminate as many LLIR operations as possible ASM Expander ASMLLIR LLIR Simplifier LLIRLLIR LLIR Matcher LLIRASM ASM

Peephole Optimization • Modern Peephole Optimizers (Davidson, Fraser) • Larger, more complex ISAs  larger pattern sets • This has produced a more systematic approach • Matcher • Starts with reduced LLIR program • Compares LLIR from peephole against pattern library • Selects 1+ ASM patterns that “cover” the LLIR ASM Expander ASMLLIR LLIR Simplifier LLIRLLIR LLIR Matcher LLIRASM ASM

ASM mult r5,r9 r12 add r12,r17  r13 LLIR r12 r5 * r9 cc  f(r5*r9) r13 r12 + r17 cc  f(r12+r17) LLIR r12 r5 * r9 cc  f(r5*r9) r13 r12 + r17 cc  f(r12+r17) ASM maddr5,r9,r17 r13 expand simplify match    This effect would prevent multiply-add from matching Finding Dead Effects • The simplifier must know what is useless (i.e., dead) • Expander works in a context-independent fashion • It can process the operations in any order • Use a backward walk and compute local LIVE information • Tag each operation with a list of useless values • What about non-local effects? • Most useless effects are local — DEF & USE in same block • It can be conservative & assume LIVE until proven dead

Peephole Optimization • Can use it to perform instruction selection • Key issue in selection is effective pattern matching ASM Expander ASMLLIR LLIR Simplifier LLIRLLIR LLIR Matcher LLIRASM ASM • Using peephole system for instruction selection • Have front-end generate LLIR directly • Eliminates need for the Expander • Keep Simplifier and Matcher • Add a simple register assigner, follow with real allocation • This basic scheme is used in GCC

Peephole-Based Selection • Basic Structure of Compilers like GCC • Uses RTL as its IR (very low level) • Numerous optimization passes • Quick translation into RTL limits what optimizer can do ... • Matcher generated from spec (hard-coded tree-pattern matcher) Front End Source LLIR Optimizer LLIRLLIR Simplifier LLIRLLIR Source LLIR LLIR Allocator ASMASM Matcher LLIRASM ASM LLIR ASM

An Example LLIR r10  2 r11  @ y r12  r0 + r11 r13  M(r12) r14  r10 x r13 r15  @ x r16  r0 + r15 r17  M(r16) r18  r17 - r14 r19  @ w r20  r0 + r19 M(r20)  r18 Translation Original Code w  x - 2 * y  — or — Compiler’s IR Expander 

r10  2 r12  r0 + @ y r13  M(r12) r10  2 r13  M(r0 + @ y) r14  r10 x r13 r14  2 x r13 r16  r0 + @ x r17  M(r16) r14  2 x r13 r15  @ x r16  r0 + r15 r13  M(r0 + @ y) r14  2 x r13 r15  @ x r14  2 x r13 r17  M(r0 + @ x) r18  r17 - r14 r17  M(r0 + @ x) r18  r17 - r14 r19  @ w r18  r17 - r14 r19  @ w r20  r0 + r19 r18  r17 - r14 M(r0+@ w)  r18 r18  r17 - r14 r20  r0 + @ w M(r20)  r18 Simplification - 3 Operation Window r10  2 r11  @y r12  r0 + r11 r10  2 r11  @ y r12  r0 + r11 r13  M(r12) r14  r10 x r13 r15  @ x r16  r0 + r15 r17  M(r16) r18  r17 - r14 r19  @ w r20  r0 + r19 M(r20)  r18 Original Code No further improvement is found

Example, Continued • Simplification shrinks the code significantly r10  @ y r11  r0 + r10 r12  M(r11) r13  2 r14  r12 x r13 r15  @ x r16  r0 + r15 r17  M(r16) r18  r17 - r14 r19  @ w r20  r0 + r19 M(r20)  r18 r13  M(r0 + @ y) r14  2 x r13 r17  M(r0 + @ x) r18  r17 - r14 M(r0+@ w)  r18 Takes 5 operations instead of 12 Uses 4 registers instead of 11. Simplify   Match loadAI r0, @ y  r13 multI r13,2  r14 loadAI r0,@ x r17 sub r17,r14 r18 storeAI r18 r0,@ w and, we’re done ...

Other Considerations • Control-flow operations • Can clear simplifier’s window at branch or label • More aggressive approach: combine across branches • Must account for effects on all paths • Not clear that this pays off …. • Same considerations arise with predication • Physical versus logical windows • Can run optimizer over a logical window • k operations connected by DEF-USE chains • Expander can link DEFs &USEs • Logical windows (within block) improve effectiveness Davidson & Fraser report 30% faster & 20% fewer ops with local logical window.

Peephole Optimization • So, … • Peephole optimization remains viable • Post allocation improvements • Cleans up rough edges • Peephole technology works for selection • Description driven matchers • Used in several important systems • Simplification pays off late in process • Low-level substitution, identities, folding, & dead effects

Other Post-Compilation Techniques • What else makes sense to do after compilation? • Profile-guided code positioning • Allocation intact, schedule intact • Cross-jumping • Allocation intact, schedule changed • Hoisting • Changes allocation & schedule, needs data-flow analysis • Procedure abstraction • Changes allocation & schedule, really needs an allocator • Register scavenging • Changes allocation & schedule, purely local transformation • Bit-transition reduction • Schedule & allocation intact, assignment changed

Register Scavenging • Simple idea • Global allocation does a good job on the big picture items • Leaves behind blocks where some registers are unused • Let’s scavenge those unused registers • Compute LIVE information • Walk each block to find underallocated region • Find spilled local subranges • Opportunistically promote them to registers • A note of realism: • Opportunities exist, but this is a 1% to 2% improvement T.J. Harvey, Reducing the Impact of Spill Code, MS Thesis, Rice University, May 1998

Bit-transition Reduction • Inter-operation bit-transitions relate to power consumption • Large fraction of CMOS power is spent switching states • Same op on same functional unit costs less power • All other things being equal • Simple idea • Reassign registers to minimize interoperation bit transitions • Build some sort of weighted graph • Use a greedy algorithm to pick names by distance • Should reduce power consumption in fetch & decode hardware • Toburen’s MS thesis

Bit-transition Reduction • Other transformations • Swap operands on commutative operators • More complex than it sounds • Shoot for zero-transition pairs • Swap operations within “fetch packets” • Works for superscalar, not VLIW • Consider bit transitions in scheduling • Same ops to same functional unit • Nearby (Hamming distance) ops next, and so on… • Factor bit transitions into instruction selection • Maybe use a BURS model with dynamic costs • Again, most of this fits into a post-compilation framework…..

Summary • Memory hierarchy is often the bottleneck • Memory optimizations are very important • We’ve only scratched the surface • Many optimizations can be applied “post-compile time” • Procedure placement • Peephole optimizations

Memory Optimizations & Post-Compilation Techniques