190 likes | 207 Views
Learn about architecture-dependent optimizations, functional units, and the importance of instruction scheduling in RISC architectures to minimize stalls and fill delay slots effectively.
E N D
Architecture-dependent optimizations Functional units, delay slots and dependency analysis
RISC architectures • The pipeline structure of modern architectures requires careful instruction scheduling. • If instruction I1 creates a value, there may be a latency that has to elapse before another instruction I2 can use this value • If an instruction awaits the result of a previous computation, the pipeline may have to stall until the result becomes available • A branch instruction is time-consuming and affects the contents of the instruction cache. Execution cannot start at destination before one or more cycles have elapsed.
Instruction scheduling • Purpose: minimize stalls and delays, fill delay slots with useful computations, minimize execution time of basic block. • Tool: dependency analysis. Uncover legal reorderings of instructions, available parallelism in basic blocks and beyond • Applications: • Filling delay slots is important for all programs • Dependency analysis is critical for reordering of loop computations on vector processors and others
Dependence relations • A data dependence is a constraint that arises from the flow of data between statements. Violating a data dependence by reordering may lead to incorrect results. • If S1 sets a value that S2 uses, this is flow dependence or true dependence between S1 and S2 • If S1 uses some variable’s value and S2 sets it, there is an antidependence between them. • If both S1 and S2 set the value of some variable, there is an output dependence between them. • If both S1 and S2 read the value of some variable there is an input dependence between then. This does not impose an ordering.
The dependence DAG of a basic block • There is an edge in the dependence dag if: • I1 writes a register or location that I2 uses: I1 fd I2 • I1 uses a register or location that I2 changes: I1 ad I2 • I1 and I2 write to the same register or location: I1 od I2 • I1 and I2 exhibit a structural hazard: a load followed by a store cannot be interchanged unless the addresses are known to be distinct: X := A[I]; A[J] := Y; -- cannot interchange, X might get Y • if there is an edge between I1 and I2, I2 must not start executing until I1 has executed for some number of cycles.
Example 1: R3 = [R15] 2: R4 = [R15 + 4] 3: R2 = R3 – R4 -- needs R4: stall one cycle 4: R5 = [R12] 5: R12 = R12 + 4 6: R6 = R3 * R5 7: [R15+4] = R3 8: R5 = R6 + 2 1 2 4 3 7 5 6 8
Contention for resources • Functional unit is pipelined, consists of multiple resources. Instructions through the pipeline may conflict on use of resources. • Eg: floating-point unit on MIPS: • A: Mantissa add • E: Exception test • M: Multiplier first stage • N: Multiplier second stage • R: Adder Round • S: Operand shift • U: unpack • Add uses successively U, S and A, A and R, R and S. (4 cycles) • Mul uses U, M, M, M, M, M and A, R. (7 cycles) • Conflict depends on relative starting time of two instructions. • Edges in dependency graph are labelled with latencies (>= 1).
Branch scheduling • Important use of dependency graph: fill delay slots (branch takes two cycles to reach destination) R2 = [R1] R2 = [R1] R3 = [R1+4] R3 = [R1+4] R4 = R2 + R3 (stall) R5 = R2 -1 R5 = R2 -1 goto L1 goto L1 R4 = R2 + R3 nop
Conditional jumps and delay slots • Instruction in delay slot is executed while jump is in progress. What if jump is not taken? Need mechanism to annull instruction. • Branch prediction: assume target is known, fill delay slot with first instruction in target block • If both destinations start with same instruction, ideal choice for delay slot • Good heuristics for loops: assume that a backwards conditional jump is usually taken. Move first instruction in loop to delay slot for branch at end • Call instruction has delay slot: fill with parameter push
A greedy algorithm: list scheduling • Finding optimal schedule for DAG is NP-complete • Simple algorithm is O (N2) at worst, usually linear • Roots of DAG are instructions without predecessors • First pass: from leaves to roots: compute latest possible starting time for each instruction to end of block • For leaf: execution time of instruction • For inner node: maximum delay imposed by successors • E.g. if In is followed by Im, Im can start at T – 4, and there is a latency of 2 between In and Im, then In must start by T – 6.
List scheduling: second pass • Second pass: from roots to leaves: schedule instructions with the greatest slack (farthest from block end) and that can start as early as possible from now. • At each step: • D1: candidates with the largest remaining delay • D2: candidates with the earliest possible starting time (computed from starting time of their predecessors) • Choose from D1 if unique, else from D2 if unique, else use heuristics: • Choose earliest starting time, or • Choose instruction that uses least used pipeline, or • Choose instruction that frees register.
Procedure integration: inlining • Calls make optimizations harder. There is a large payoff to local optimizations over large basic blocks: inlining subprogram bodies is often very effective: • It exposes the values of the actuals in the body • It creates larger basic blocks • It saves the cost of the call • Can be done at the tree level or at the RTL level. In both cases it can enable other optimizations. • Possible disadvantages: code size increase, debugging is harder
Inlining as a tree transformation • Treat body of subprogram as a generic unit • Each inlined body needs its own local variables • Global references are captured at the point of definition • Inlining works like instantiation: replace formals with actuals, complete analysis and expansion of inserted body • Replace multiple return statements where needed • Introduce temporary to hold return value of function
Name capture: recognize global entities function memo (x : integer) return integer is local : integer := x **2; begin Saved := Saved + local + x; -- Saved is global return Saved; end memo; … Val := memo (15); • Becomes declare local : integer := 15 ** 2; -- each inlining has its own result : integer; -- maybe superfluous if context is assignment begin Saved := Saved + local + 15l; -- Saved is the same entity in all inlinings result := Saved; end; Val := result;
Handling return statements • Subprogram needs a label to serve a single exit point. • In a function: identify target of result, or create temporary for it; replace return with assignment to target, followed by goto to exit label • In a procedure: replace return with goto to exit label • Optimizations • if body of function is single return statement and context is assignment, can replace right-hand side with expression • If procedure has no return statement, exit label is superfluous
Parameter passing • If actual is an expression, it is evaluated once: create temporary in block and replace formal with temporary Val2 := memo (x * f (z)); • Becomes: declare C1 : integer := x * f (z); local : integer := C1 ** 2; begin Saved := Saved + local + C1; Val2 := Saved; -- context is assignment end;
Parameter passing: variables • An in-out parameter is a location, cannot create a temporary for it: must use a renaming declaration. procedure incr (x : in out integer) is begin x := x + 1; end; … incr (a (i)); • Becomes declare c1 : integer renames a (i); begin c1 := c1 + 1; end;
Context includes more than global names • semantics of inlined call must be identical to original program • If constraint checks are not suppressed in the body, they must not be suppressed in the inlined block, even if suppressed at the point of call. • Status of constraint checks is part of closure of body to inline, must be applied when analyzing inlined block
Specialized inlining: loop unrolling • Create successive copies of body of loop: saves tests, makes bigger basic block, increases instruction level parallelism for j in 1 .. N loop loop_body; end loop; • Becomes for k in 1 .. N / r loop -- unroll r times loop_ body loop_body [j -> j +1] -- replace loop variable for each unrolling … loop_ body [j -> j +r -1] end; for k in N / r +1.. N loop loop_body end loop; -- leftover iterations