250 likes | 382 Views
Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling. John Cavazos University of Delaware. Instruction Scheduling. Reordering instructions to improve performance Takes into account anticipated latencies Machine-specific Performed late in optimization pass
E N D
Optimizing CompilersCISC 673Spring 2009Instruction Scheduling John Cavazos University of Delaware
Instruction Scheduling • Reordering instructions to improve performance • Takes into account anticipated latencies • Machine-specific • Performed late in optimization pass • Instruction-Level Parallelism (ILP)
Modern Architectures Features • Superscalar • Multiple logic units • Multiple issue • 2 or more instructions issued per cycle • Speculative execution • Branch predictors • Speculative loads • Deep pipelines
Types of Instruction Scheduling • Local Scheduling • Basic Block Scheduling • Global Scheduling • Trace Scheduling • Superblock Scheduling • Software Pipelining
Scheduling for different Computer Architectures • Out-of-order Issue • Scheduling is useful • In-order issue • Scheduling is very important • VLIW • Scheduling is essential!
Challenges to ILP • Structural hazards: • Insufficient resources to exploit parallelism • Data hazards • Instruction depends on result of previous instruction still in pipeline • Control hazards • Branches & jumps modify PC • affect which instructions should be in pipeline
Recall from Architecture… • IF – Instruction Fetch • ID – Instruction Decode • EX – Execute • MA – Memory access • WB – Write back IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB
Structural Hazards Instruction latency: execute takes > 1 cycle addf R3,R1,R2 IF ID EX EX MA WB IF ID stall EX EX MA WB addf R3,R3,R4 Assumes floating point ops take 2 execute cycles
Data Hazards Memory latency: data not ready lw R1,0(R2) IF ID EX MA WB IF ID EX stall MA WB add R3,R1,R4
Control Hazards ID EX MA WB Taken Branch IF IF --- --- --- --- Instr + 1 Branch Target IF ID EX MA WB IF ID EX MA WB Branch Target + 1
Basic Block Scheduling • For each basic block: • Construct directed acyclic graph (DAG) using dependences between statements • Node = statement / instruction • Edge (a,b) = statement a must execute before b • Schedule instructions using the DAG
Data Dependences • If two operations access the same register and one access is a write, they are dependent • Types of data dependences RAW=Read after Write WAW WAR r1 = r2 + r3 r2 = r5 * 6 r1 = r2 + r3 r4 = r1 * 6 r1 = r2 + r3 r1 = r4 * 6 Cannot reorder two dependent instructions
Basic Block Scheduling Example Original Schedule Dependence DAG a) lw R2, (R1) b) lw R3, (R1) 4 c) R4 R2 + R3 d) R5 R2 - 1 a b 2 2 2 d c Schedule 1 (5 cycles) Schedule 2 (4 cycles) • a) lw R2, (R1) • lw R3, (R1) 4 • --- nop ----- • c) R4 R2 + R3 • d) R5 R2 - 1 • a) lw R2, (R1) • b) lw R3, (R1) 4 • R5 R2 - 1 • c) R4 R2 + R3
Scheduling Algorithm • Construct dependence dag on basic block • Put roots in candidate set • Use scheduling heuristics (in order) to select instruction • While candidate set not empty • Evaluate all candidates and select best one • Delete scheduled instruction from candidate set • Add newly-exposed candidates
Instruction Scheduling Heuristics • NP-complete = we need heuristics • Bias scheduler to prefer instructions: • Earliest execution time • Have many successors • More flexibility in scheduling • Progress along critical path • Free registers • Reduce register pressure • Can be a combination of heuristics
Computing Priorities Height(n) = • exec(n) if n is a leaf • max(height(m)) + exec(n) for m, where m is a successor of n Critical path(s) = path through the dependence DAG with longest latency
Example – Determine Height and CP Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle a 3 b c 3 1 d e 2 3 f g 2 3 h 2 Critical path: _______ i
Example 13 a 3 10 b c 12 3 1 d e 10 9 2 3 f g 7 8 2 3 h 5 2 i 3 ___ cycles
Global Scheduling: Superblock • Definition: • single trace of contiguous, frequently executed blocks • a single entry and multiple exits • Formation algorithm: • pick a trace of frequently executed basic block • eliminate side entrance (tail duplication) • Scheduling and optimization: • speculate operations in the superblock • apply optimization to scope defined by superblock
A 100 A 100 B 90 C 10 B 90 C 10 D 0 E 90 E 90 D 0 F 100 F 90 F’ 10 Superblock Formation Tail duplicate Select a trace
r1 = r2*3 r1 = r2*3 r1 = r2*3 r2 = r2 +1 r2 = r2 +1 r2 = r2 +1 r3 = r2*3 r3 = r2*3 r3 = r1 r3 = r2*3 r3 = r2*3 trace selection tail duplication CSE within superblock (no merge since single entry) Optimizations within Superblock • By limiting the scope of optimization to superblock: • optimize for the frequent path • may enable optimizations that are not feasible otherwise (CSE, loop invariant code motion,...) • For example: CSE
Scheduling Algorithm Complexity • Time complexity: O(n2) • n = max number of instructions in basic block • Building dependence dag: worst-case O(n2) • Each instruction must be compared to every other instruction • Scheduling then requires each instruction be inspected at each step = O(n2) • Average-case: small constant (e.g., 3)
Very Long Instruction Word (VLIW) • Compiler determines exactly what is issued every cycle (before the program is run) • Schedules also account for latencies • All hardware changes result in a compiler change • Usually embedded systems (hence simple HW) • Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies)
Sample VLIW code VLIW processor: 5 issue 2 Add/Sub units (1 cycle) 1 Mul/Div unit (2 cycle, unpipelined) 1 LD/ST unit (2 cycle, pipelined) 1 Branch unit (no delay slots) Add/Sub Add/Sub Mul/Div Ld/St Branch c = a + b d = a - b e = a * b ld j = [x] nop g = c + d h = c - d nop ld k = [y] nop nop nop i = j * c ld f = [z] br g
Next Time • Phase-ordering