1 / 31

Static Code Scheduling

Static Code Scheduling. CS 671 April 1, 2008. Code Scheduling. Scheduling or reordering instructions to improve performance and/or guarantee correctness Important for dynamically-scheduled architectures Crucial (assumed!) for statically-scheduled architectures, e.g. VLIW or EPIC

aholeman
Download Presentation

Static Code Scheduling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Static Code Scheduling CS 671 April 1, 2008

  2. Code Scheduling • Scheduling or reordering instructions to improve performance and/or guarantee correctness • Important for dynamically-scheduled architectures • Crucial (assumed!) for statically-scheduled architectures, e.g. VLIW or EPIC • Takes into account anticipated latencies • Machine-specific, performed later in the optimization pass • How does this contrast with our earlier exploration of code motion?

  3. Why Must the Compiler Schedule? • Many machines are pipelined and expose some aspects of pipelining to the user (compiler) • Examples: • Branch delay slots! • Memory-access delays • Multi-cycle operations • Some machines don’t have scheduling hardware

  4. Example • Assume loads take 2 cycles and branches have a delay slot. • ____cycles

  5. Example • Assume loads take 2 cycles and branches have a delay slot. • ____cycles

  6. Start Op Try to fill Use Op Code Scheduling Strategy • Get resources operating in parallel • Integer data path • Integer multiply / divide hardware • FP adder, multiplier, divider • Method • Fill with computations that do not require result or same hardware resources • Drawbacks • Highly hardware dependent

  7. Scheduling Approaches • Local • Branch scheduling • Basic-block scheduling • Global • Cross-block scheduling • Software pipelining • Trace scheduling • Percolation scheduling

  8. Branch Scheduling • Two problems: • Branches often take some number of cycles to complete • Can be a delay between a compare b and its associated branch • A compiler will try to fill these slots with valid instructions (rather than nop) • Delay slots – present in PA-RISC, SPARC, MIPS • Condition delay – PowerPC, Pentium

  9. Recall from Architecture… • IF – Instruction Fetch • ID – Instruction Decode • EX – Execute • MA – Memory access • WB – Write back IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB

  10. Control Hazards Taken Branch ID EX MA WB IF IF --- --- --- --- Instr + 1 Branch Target IF ID EX MA WB IF ID EX MA WB Branch Target + 1

  11. Data Dependences • If two operations access the same register, they are dependent • Types of data dependences Output Anti Flow r1 = r2 + r3 r2 = r5 * 6 r1 = r2 + r3 r1 = r4 * 6 r1 = r2 + r3 r4 = r1 * 6

  12. Data Hazards Memory latency: data not ready lw R1,0(R2) IF ID EX MA WB IF ID EX stall MA WB add R3,R1,R4

  13. Data Hazards Instruction latency: execute takes > 1 cycle addf R3,R1,R2 IF ID EX EX MA WB IF ID stall EX EX MA WB addf R3,R3,R4 Assumes floating point ops take 2 execute cycles

  14. Multi-cycle Instructions • Scheduling is particularly important for multi-cycle operations • Alpha instructions > 1 cycle latency (partial list) mull(32-bit integer multiply) 8 mulq(64-bit integer multiply) 16 addt(fp add) 4 mult(fp multiply) 4 divs(fp single-precision divide) 10 divt(fp double-precision divide) 23

  15. Avoiding data hazards • Move loads earlier and stores later (assuming this does not violate correctness) • Other stalls may require more sophisticated re-ordering, i.e. ((a+b)+c)+d becomes (a+b)+(c+d) • How can we do this in a systematic way??

  16. Example: Without Scheduling • Assume: • memory instrs take 3 cycles • mult takes 2 cycles (to have • result in register) • rest take 1 cycle • ____cycles

  17. Basic Block Dependence DAGS • Nodes - instructions • Edges - dependence between I1 and I2 • When we cannot determine whether there is a dependence, we must assume there is one • a) lw R2, (R1) • b) lw R3, (R1) 4 • c) R4  R2 + R3 • d) R5  R2 - 1 a b 2 2 2 d c

  18. Example – Build the DAG Assume: memory instrs = 3 mult = 2 (to have result in register) rest = 1 cycle

  19. Creating a schedule • Create a DAG of dependences • Determine priority • Schedule instructions with • Ready operands • Highest priority • Heuristics: If multiple possibilities, fall back on other priority functions

  20. Operation Priority • Priority – Need a mechanism to decide which ops to schedule first (when you have choices) • Common priority functions • Height – Distance from exit node • Give priority to amount of work left to do • Slackness – inversely proportional to slack • Give priority to ops on the critical path • Register use – priority to nodes with more source operands and fewer destination operands • Reduces number of live registers • Uncover – high priority to nodes with many children • Frees up more nodes • Original order – when all else fails

  21. Computing Priorities • Height(n) = • exec(n) if n is a leaf • max(height(m)) + exec(n) for m, where m is a successor of n • Critical path(s) = path through the dependence DAG with longest latency

  22. Example – Determine Height and CP Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle Critical path: _______

  23. Example – List Scheduling _____cycles

  24. Scheduling vs. Register Allocation

  25. Register Renaming

  26. VLIW • Very Long Instruction Word • Compiler determines exactly what is issued every cycle (before the program is run) • Schedules also account for latencies • All hardware changes result in a compiler change • Usually embedded systems (hence simple HW) • Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies)

  27. Sample VLIW code VLIW processor: 5 issue 2 Add/Sub units (1 cycle) 1 Mul/Div unit (2 cycle, unpipelined) 1 LD/ST unit (2 cycle, pipelined) 1 Branch unit (no delay slots) Add/Sub Add/Sub Mul/Div Ld/St Branch c = a + b d = a - b e = a * b ld j = [x] nop g = c + d h = c - d nop ld k = [y] nop nop nop i = j * c ld f = [z] br g

  28. 1 2m 3m 4 5m 6 7m 8 9 10 Multi-Issue Scheduling Example Machine: 2 issue, 1 memory port, 1 ALU Memory port = 2 cycles, non-pipelined ALU = 1 cycle RU_map Schedule time ALU MEM 0 1 2 3 4 5 6 7 8 9 time Ready Placed 0 1 2 3 4 5 6 7 8 9

  29. Earliest Latest Sets Machine: 2 issue, 1 memory port, 1 ALU Memory port = 2 cycles, pipelined ALU = 1 cycle 1m 2m 3 4m 5 6 7 8 9m 10

  30. List Scheduling Algorithm • Build dependence graph, calculate priority • Add all ops to UNSCHEDULED set • time = 0 • while (UNSCHEDULED is not empty) • time++ • READY = UNSCHEDULED ops whose incoming deps have been satisfied • Sort READY using priority function • For each op in READY (highest to lowest priority) • op can be scheduled at current time? (resources free?) • Yes: schedule it, op.issue_time = time • Mark resources busy in RU_map relative to issue time • Remove op from UNSCHEDULED/READY sets • No: continue

  31. Improving Basic Block Scheduling • Loop unrolling – creates longer basic blocks • Register renaming – can change register usage in blocks to remove immediate reuse of registers • Summary • Static scheduling complements (or replaces) dynamic scheduling by the hardware

More Related