1 / 19

Architecture-dependent optimizations

Learn about architecture-dependent optimizations, functional units, and the importance of instruction scheduling in RISC architectures to minimize stalls and fill delay slots effectively.

bfanning
Download Presentation

Architecture-dependent optimizations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architecture-dependent optimizations Functional units, delay slots and dependency analysis

  2. RISC architectures • The pipeline structure of modern architectures requires careful instruction scheduling. • If instruction I1 creates a value, there may be a latency that has to elapse before another instruction I2 can use this value • If an instruction awaits the result of a previous computation, the pipeline may have to stall until the result becomes available • A branch instruction is time-consuming and affects the contents of the instruction cache. Execution cannot start at destination before one or more cycles have elapsed.

  3. Instruction scheduling • Purpose: minimize stalls and delays, fill delay slots with useful computations, minimize execution time of basic block. • Tool: dependency analysis. Uncover legal reorderings of instructions, available parallelism in basic blocks and beyond • Applications: • Filling delay slots is important for all programs • Dependency analysis is critical for reordering of loop computations on vector processors and others

  4. Dependence relations • A data dependence is a constraint that arises from the flow of data between statements. Violating a data dependence by reordering may lead to incorrect results. • If S1 sets a value that S2 uses, this is flow dependence or true dependence between S1 and S2 • If S1 uses some variable’s value and S2 sets it, there is an antidependence between them. • If both S1 and S2 set the value of some variable, there is an output dependence between them. • If both S1 and S2 read the value of some variable there is an input dependence between then. This does not impose an ordering.

  5. The dependence DAG of a basic block • There is an edge in the dependence dag if: • I1 writes a register or location that I2 uses: I1 fd I2 • I1 uses a register or location that I2 changes: I1 ad I2 • I1 and I2 write to the same register or location: I1 od I2 • I1 and I2 exhibit a structural hazard: a load followed by a store cannot be interchanged unless the addresses are known to be distinct: X := A[I]; A[J] := Y; -- cannot interchange, X might get Y • if there is an edge between I1 and I2, I2 must not start executing until I1 has executed for some number of cycles.

  6. Example 1: R3 = [R15] 2: R4 = [R15 + 4] 3: R2 = R3 – R4 -- needs R4: stall one cycle 4: R5 = [R12] 5: R12 = R12 + 4 6: R6 = R3 * R5 7: [R15+4] = R3 8: R5 = R6 + 2 1 2 4 3 7 5 6 8

  7. Contention for resources • Functional unit is pipelined, consists of multiple resources. Instructions through the pipeline may conflict on use of resources. • Eg: floating-point unit on MIPS: • A: Mantissa add • E: Exception test • M: Multiplier first stage • N: Multiplier second stage • R: Adder Round • S: Operand shift • U: unpack • Add uses successively U, S and A, A and R, R and S. (4 cycles) • Mul uses U, M, M, M, M, M and A, R. (7 cycles) • Conflict depends on relative starting time of two instructions. • Edges in dependency graph are labelled with latencies (>= 1).

  8. Branch scheduling • Important use of dependency graph: fill delay slots (branch takes two cycles to reach destination) R2 = [R1] R2 = [R1] R3 = [R1+4] R3 = [R1+4] R4 = R2 + R3 (stall) R5 = R2 -1 R5 = R2 -1 goto L1 goto L1 R4 = R2 + R3 nop

  9. Conditional jumps and delay slots • Instruction in delay slot is executed while jump is in progress. What if jump is not taken? Need mechanism to annull instruction. • Branch prediction: assume target is known, fill delay slot with first instruction in target block • If both destinations start with same instruction, ideal choice for delay slot • Good heuristics for loops: assume that a backwards conditional jump is usually taken. Move first instruction in loop to delay slot for branch at end • Call instruction has delay slot: fill with parameter push

  10. A greedy algorithm: list scheduling • Finding optimal schedule for DAG is NP-complete • Simple algorithm is O (N2) at worst, usually linear • Roots of DAG are instructions without predecessors • First pass: from leaves to roots: compute latest possible starting time for each instruction to end of block • For leaf: execution time of instruction • For inner node: maximum delay imposed by successors • E.g. if In is followed by Im, Im can start at T – 4, and there is a latency of 2 between In and Im, then In must start by T – 6.

  11. List scheduling: second pass • Second pass: from roots to leaves: schedule instructions with the greatest slack (farthest from block end) and that can start as early as possible from now. • At each step: • D1: candidates with the largest remaining delay • D2: candidates with the earliest possible starting time (computed from starting time of their predecessors) • Choose from D1 if unique, else from D2 if unique, else use heuristics: • Choose earliest starting time, or • Choose instruction that uses least used pipeline, or • Choose instruction that frees register.

  12. Procedure integration: inlining • Calls make optimizations harder. There is a large payoff to local optimizations over large basic blocks: inlining subprogram bodies is often very effective: • It exposes the values of the actuals in the body • It creates larger basic blocks • It saves the cost of the call • Can be done at the tree level or at the RTL level. In both cases it can enable other optimizations. • Possible disadvantages: code size increase, debugging is harder

  13. Inlining as a tree transformation • Treat body of subprogram as a generic unit • Each inlined body needs its own local variables • Global references are captured at the point of definition • Inlining works like instantiation: replace formals with actuals, complete analysis and expansion of inserted body • Replace multiple return statements where needed • Introduce temporary to hold return value of function

  14. Name capture: recognize global entities function memo (x : integer) return integer is local : integer := x **2; begin Saved := Saved + local + x; -- Saved is global return Saved; end memo; … Val := memo (15); • Becomes declare local : integer := 15 ** 2; -- each inlining has its own result : integer; -- maybe superfluous if context is assignment begin Saved := Saved + local + 15l; -- Saved is the same entity in all inlinings result := Saved; end; Val := result;

  15. Handling return statements • Subprogram needs a label to serve a single exit point. • In a function: identify target of result, or create temporary for it; replace return with assignment to target, followed by goto to exit label • In a procedure: replace return with goto to exit label • Optimizations • if body of function is single return statement and context is assignment, can replace right-hand side with expression • If procedure has no return statement, exit label is superfluous

  16. Parameter passing • If actual is an expression, it is evaluated once: create temporary in block and replace formal with temporary Val2 := memo (x * f (z)); • Becomes: declare C1 : integer := x * f (z); local : integer := C1 ** 2; begin Saved := Saved + local + C1; Val2 := Saved; -- context is assignment end;

  17. Parameter passing: variables • An in-out parameter is a location, cannot create a temporary for it: must use a renaming declaration. procedure incr (x : in out integer) is begin x := x + 1; end; … incr (a (i)); • Becomes declare c1 : integer renames a (i); begin c1 := c1 + 1; end;

  18. Context includes more than global names • semantics of inlined call must be identical to original program • If constraint checks are not suppressed in the body, they must not be suppressed in the inlined block, even if suppressed at the point of call. • Status of constraint checks is part of closure of body to inline, must be applied when analyzing inlined block

  19. Specialized inlining: loop unrolling • Create successive copies of body of loop: saves tests, makes bigger basic block, increases instruction level parallelism for j in 1 .. N loop loop_body; end loop; • Becomes for k in 1 .. N / r loop -- unroll r times loop_ body loop_body [j -> j +1] -- replace loop variable for each unrolling … loop_ body [j -> j +r -1] end; for k in N / r +1.. N loop loop_body end loop; -- leftover iterations

More Related