Register Allocation

Register Allocation btw, there are lots of examples, but I will probably forget to stop and let everyone digest or ask questions. please feel stop me if i do.

Overview • Base case algorithm • Optimization goals • Improvements at basic block level • Improvements at function level • Practicalities • Going further

Register allocation • Registers hold values • Sometimes only certain types • Used for the input and output of instructions • Exclusively, in the case of RISC architectures • There are not many • Generally less for CISCs than RISCs • Need to swap values in and out of memory • “Fills” and “spills” • Optimization problem: minimize # fills, spills • Hard (in the human and theoretical sense)

Pseudo-registers • Isolate the complexity of registers with p-regs • Post-pone decision of what to put in registers • The idea: pretend you have infinite registers • Simplifies AST  IR • Simplifies high-level IR optimizations • Infinite, but not dynamically addressable • Each use refers to one static p-reg, no indirection • Contrast with “the store” • Allows loads from dynamic addresses: lw $2, 4($1) add $1001, $999, $1000mul $1002, $1001, $1000

psuedo-registers the store ...$41$42... 0xffff00xffff40xffff80xffffc 0xffff0 7 lw $42, 4($41) 7

High-level organization Lex, parse, check, IR codegen Use p-regs. Put as much work as possible here… Optimization Register allocation Low level optimization / ISA codegen … and not here

Pseudo-registers • After high-level optimization, IR  ISA • No more psuedo-registers • The idea: • Not enough registers: “spill” to memory • Need spilled contents: “fill” from memory • Important distinction: • Load/stores (IR) vs. fills/spills (Regalloc) • For example: • [$1] := $2 • IR: 1 store, 2 read • Regalloc: 1 store, 0-2 fills • $2 := [$1] • IR: 1 load, 1 read, 1 write • Regalloc: 1 load, 0-1 fills, 0-1 spills

Simplest approach • Give every pseudo-register a home • The home is in memory, but separate from the store • Keep things in registers for as little time as possible • Every write to a p-reg spills • Every read from a p-reg fills • Efficient? • Extremely • Just kidding, this is the worst possible

Example IR Code generated Source A[x] = 3 mul p, x, 4 add q, p, A [q] := 3 lw $t1, x($fp) muli $t2, $t1, #4 sw $t2, p($fp) lw $t1, p($fp) lw $t2, A($fp) add $t3, $t1, $t2 sw $t3, q($fp) ldi $t1, 3 lw $t2, q($fp) sw $t1, 0($t2) But on the bright side, there only need be as many registers as there are operands.

Optimization goals • Clearly room for improvement • Want to minimize fills and spills • Same constraint as any optimization: • Preserve the observable behavior • That means for any execution path

Basic block level • Keep track of what p-reg is in what register • Avoid obviously redundant fill/fill, spill/fill • Handle multiple incoming paths to BB entry: • Reset records at beginning of BB • Handle multiple paths from BB to exit: • Spill every p-reg currently in a register • Described in more detail in dragon book

Function level • Global register allocation • Old term, “global” means intra-procedural • Minor improvement: • Use dataflow analysis • Live variables • Avoid useless spills: • Do not spill if not live • At end of BB • When spilling to make room

Function level • ...and now the big algorithm • Major improvement • Same idea at the heart of modern compilers • Actually, two approaches: • Top-down • Bottom-up (a.k.a graph-coloring)

Global register allocation • Top-down register allocation [Chow, 84] • “Use high-level information to make allocation decisions.” • Priority-function determines ordering • More pessimistic: assumes nothing live at the start • More conservative: courser definition of interference • [Briggs, 92] found O(n log n) for bottom-up and O(n2) for top-down • Research appears to favor bottom-up (# papers) • ? Industry too ?

Global register allocation • Bottom-up register allocation [Chaitin, 81][Briggs, 94] • Step 0: dataflow analyses • Step 1: build webs • Step 2: build interference graph • Step 3: coalesce • Step 4: compute spill costs • Step 5: color • Step 6: spill

Step 0: dataflow analyses • Given: IR • Build the CFG • Find reaching definitions • Find live variables

Step 0: dataflow analyses def x use x use x use xdef x def x def x use x

def x use x use x use xdef x def x def x use x Step 0: dataflow analyses Live variables

def x use x use x use xdef x def x def x use x Step 0: dataflow analyses Reaching defs

Step 1: build webs • A web is: • a set of statements whose definitions and uses of a given pseudo-register must share a physical register • (In the classic approach) all or nothing: • All reads/writes fill/spill or none do pseudoregister web physicalregister 1 1 * *

Step 1: build webs Web building approach: • Initially: • Each use and def points to a web containing only itself • For each statement: • For each use Uof a p-reg in the statement: • For each reaching definition D (from step 0) : • Merge D’s and U’s webs

def x def x use x use x use x use x use xdef x use xdef x def x def x def x def x use x use x Step 1: build webs def x * use x use x use xdef x def x def x use x Webs

Step 2: interference graph • An interference graph is a graph where: • Nodes are webs (step 1) • Edges are webs that cannot occupy the same physical register • Overly conservative approach • Two webs interfere if they are both live at any statement • Better: • Two webs interfere if one is live at the other’s definition

Step 2: interference graph Interference graph building approach: • For each web W: • For each defining statement S in W: • For each reaching and live (step 0) definition D at statement S that is not in W: • W interferes with the web containing D Store results as both: • Triangular adjacency matrix: • Efficient form for coalescing step • Adjacency list: • Efficient form for coloring step

def x use x use x def z def z def x def x def y def x use y use z use x Step 2: interference graph

def x use x use x def z def z def x def x def y def x use y use z use x Step 2: interference graph Alive-at-def vs. Alive-at-same-statement

Step 3: coalesce • Given a copy statement: a := b • If a’s web and b’s web do not interfere: • All uses of a can be replaced with b or vice-versa • a or b could be fixed (parameter or return register) • Eliminate copy instruction • Redundant copies often are introduced by optimizations • Can have a negative effect: • Live range is longer  less coloring flexibility  more spilling • Optimistic coalescing [Park, 1998] • Changes interference graph • Just merging edges is too conservative • Need to go back to step 2

Step 4: compute spill costs • Order webs by how expensive it is to spill • Take into account: • Number of uses and defs • Loop nesting depth • Possibility of rematerialization • This is a heuristic • Cannot generally know branch frequency, loop trip count (without profiling data)

Step 5: color • Problem: assign physical registers to webs • Reduces to map maker’s coloring problem: • Give each node of the interference graph a color property • Color = physical register • No two adjacent nodes can have the same color • Adjacent  edge between  cannot share a register • # available colors = # available registers • To address ISA restrictions: • Register classes: • Separate graphs • Other: • Add a node to the graph for every register • Register nodes are fully connected • Add an edge between a register and every web that cannot be allocated to that register

Step 5: color • Graph coloring is NPC for N >= 3 • But there are heuristics: • Don’t try to find the minimum: • Given k registers, try to use k colors • Just pick the best looking at the time • Might not be best overall • May not find solution, even if it exists • Acceptable and fast

Step 5: color Optimistic graph coloring approach: [Chaitin, 81][Briggs, 94] • Initially: • Each node’s degree is the number of adjacent nodes • Until there are no more uncolored nodes: • If there is a node with degree < k • Choose it • Otherwise • Choose node with lowest spill cost (Step 3) (optimism here) • Lower degree of chosen node’s neighbors • Push onto stack • For each node of the stack (LIFO) • If there is a color not yet assigned to neighbors: • Use that color • Else (optimistic failed; cold, hard reality) • Mark as spilled, keep uncolored

def x use x use x def z def z def x def x def y def x use y use z use x Step 5: color Trivial if # registers >= # webs

Step 5: color def x use x use x def z def z def x def x def y def x use y use z use x

Step 5: color def x Add some ISA restrictions: use x use x def z def z def x registers def x def y def x use y use z use x

Step 6: spill • Maybe do not even need to spill: • Rematerialization (should be chosen first, lowest spill cost) • Better register usage: • Insert new load and store instructions • Creates new, very short webs • Interference graph changed, need restart at Step 2 • Need to modify spill cost to make new web’s cost = ∞ • Simple approach: • Keep a set of registers reserved for filling/spilling • Add “spilt” flag to web • When emitting an instruction: • Load spilt webs of input into reserved registers • Execute • Use reserved register as destination of spilt web, then store

Epilogue: codegen • Now we know: • For each use/def, it’s web • For each web, whether it spills or not • If the web spills: • Same as the base case: reads fill, writes spill • Otherwise • Just use the web’s register as the operand $1  $s1$2  spills to 28($sp) $1  $s1$2  $s2 lw $t1, 28($sp)neg $s1, $t1 neg $1, $2 neg $s1, $s2

Practicalities • Webs that span calls: • Doesn’t appear to be addressed much [at all?] in literature • Caller- and callee-preserved registers • If a web spans a call, does it get split in two? • Not if it is callee-saved • Chicken-egg problem: • Splitting a caller-saved web changes the interference graph, makes it more colorable, which could change whether this web spans the call... • Simple heuristic: • During allocation: mark webs as either call-spanning or not • When picking registers: • Prefer caller-saved for non-spanning • Prefer callee-saved for spanning • If no callee-saved left, use caller-saved and spill/fill

Practical details • Parameters (passed on the stack) and globals • Want to keep in registers • Cannot for globals unless: • Simple: no calls • 10x harder: inter-procedural analysis says it’s ok • Need to insert “import” statements • Otherwise there will be use statements for a variable with no reaching defs; messes up algorithms • Where to put the imports? • CFG head • Makes long webs • Especially if variable only used near the end • As late as possible • Requires an analysis like Partial Redundancy Elimination

Further optimizations • Live-range splitting Good: split x Need to spill x or y Bad: spill x Bad: spill y def x • Create contains graph • During coloring, use contains graph to split before resorting to spilling • Other variations in [Cooper, 04] def x def,spill x def x use x use x fill,use x use x def x def x def,spill x def x spill x def y def y def y def,spill y use y use y use y fill,use y def y def y def y def,spill y fill x use x fill,use x use x use x def x def,spill x def x def x use x fill,use x use x use x

Further optimizations • Stack allocation for fills/spills • Goal: minimize stack usage • Essentially, it’s the same problem we just solved: • P-reg is to register file as home is to stack memory

Further optimizations • Alias analysis for heap • a and y have pseudo-registers, so they may be kept live • a->x does not have a pseudo-register: it has a dynamic location. Loads and stores generated during code generation, even before register allocation. • Start by creating indirect pseudo-registers and postponing loads/stores • More problems... void foo(A *a) { int y = 1; for (; a->x < 1000; ++a->x) y += a->x;}

Further optimizations • Alias analysis for heap • Need to generate import of a->x before first use. • Only when original program would have. Cannot introduce new memory accesses. • Aliasing is a problem: • Does a2 point to the same object as a? • Inter-procedural calls: • Does bar() modify the object a points to? • Solutions: • Easy: very conservative • 10x harder: points-to analysis void foo(A *a, A *a2) { int y = 1; for (; a->x < 1000; ++a->x) {bar(); y += a->x;a2->x *= a->x; }}

Further optimizations • Interaction with instruction scheduling and selection • Naïve approach: select, allocate, schedule • Not orthogonal • Scheduling goal: put as much space between reads and writes. • Allocation goal: want short live ranges, so put definitions and uses close together. • Need to balance both interests • GCC: select, allocate 1, schedule 1, allocate 2, schedule 2

Further optimizations • List of techniques at end of chapter 13 in: Keith D. Cooper and Linda Torczon. Engineering a Compiler. 2004

References [Briggs, 92] Preston Briggs. Register Allocation via Graph Coloring, Tech. Rept. CRPC-TR92218, Ctr. for Research on Parallel Computation, Rice Univ., Houston, TX, Apr. 1992. [Briggs, 94] Preston Briggs, Keith D. Cooper, and Linda Torczon. Improvements to graph coloring register allocation. ACM Transactions on Programming Languages and Systems, 16(3):428-255, May 1994. [Chaitin, 81] Gregory J. Chaitin. Register allocation and spilling via graph coloring. United States Paten 4,571,678, February 1986. [Chow, 84] Frederick C. Chow and John L. Hennessy. Register allocation by priority-based coloring. SIGPLAN Notices, 19(6):222-232, June 1984. Proceedings of the ACM SIGPLAN ’84 Symposium on Compiler Construction. [Park, 98] Jinpyo Park and Soo-Mook Moon. Optimistic register coalescing. In Proceedings of the 1998 International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 196-204, October 1998.

Register Allocation