Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling

Optimizing CompilersCISC 673Spring 2009Instruction Scheduling John Cavazos University of Delaware

Instruction Scheduling • Reordering instructions to improve performance • Takes into account anticipated latencies • Machine-specific • Performed late in optimization pass • Instruction-Level Parallelism (ILP)

Modern Architectures Features • Superscalar • Multiple logic units • Multiple issue • 2 or more instructions issued per cycle • Speculative execution • Branch predictors • Speculative loads • Deep pipelines

Types of Instruction Scheduling • Local Scheduling • Basic Block Scheduling • Global Scheduling • Trace Scheduling • Superblock Scheduling • Software Pipelining

Scheduling for different Computer Architectures • Out-of-order Issue • Scheduling is useful • In-order issue • Scheduling is very important • VLIW • Scheduling is essential!

Challenges to ILP • Structural hazards: • Insufficient resources to exploit parallelism • Data hazards • Instruction depends on result of previous instruction still in pipeline • Control hazards • Branches & jumps modify PC • affect which instructions should be in pipeline

Recall from Architecture… • IF – Instruction Fetch • ID – Instruction Decode • EX – Execute • MA – Memory access • WB – Write back IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB

Structural Hazards Instruction latency: execute takes > 1 cycle addf R3,R1,R2 IF ID EX EX MA WB IF ID stall EX EX MA WB addf R3,R3,R4 Assumes floating point ops take 2 execute cycles

Data Hazards Memory latency: data not ready lw R1,0(R2) IF ID EX MA WB IF ID EX stall MA WB add R3,R1,R4

Control Hazards ID EX MA WB Taken Branch IF IF --- --- --- --- Instr + 1 Branch Target IF ID EX MA WB IF ID EX MA WB Branch Target + 1

Basic Block Scheduling • For each basic block: • Construct directed acyclic graph (DAG) using dependences between statements • Node = statement / instruction • Edge (a,b) = statement a must execute before b • Schedule instructions using the DAG

Data Dependences • If two operations access the same register and one access is a write, they are dependent • Types of data dependences RAW=Read after Write WAW WAR r1 = r2 + r3 r2 = r5 * 6 r1 = r2 + r3 r4 = r1 * 6 r1 = r2 + r3 r1 = r4 * 6 Cannot reorder two dependent instructions

Basic Block Scheduling Example Original Schedule Dependence DAG a) lw R2, (R1) b) lw R3, (R1) 4 c) R4  R2 + R3 d) R5  R2 - 1 a b 2 2 2 d c Schedule 1 (5 cycles) Schedule 2 (4 cycles) • a) lw R2, (R1) • lw R3, (R1) 4 • --- nop ----- • c) R4  R2 + R3 • d) R5  R2 - 1 • a) lw R2, (R1) • b) lw R3, (R1) 4 • R5  R2 - 1 • c) R4  R2 + R3

Scheduling Algorithm • Construct dependence dag on basic block • Put roots in candidate set • Use scheduling heuristics (in order) to select instruction • While candidate set not empty • Evaluate all candidates and select best one • Delete scheduled instruction from candidate set • Add newly-exposed candidates

Instruction Scheduling Heuristics • NP-complete = we need heuristics • Bias scheduler to prefer instructions: • Earliest execution time • Have many successors • More flexibility in scheduling • Progress along critical path • Free registers • Reduce register pressure • Can be a combination of heuristics

Computing Priorities Height(n) = • exec(n) if n is a leaf • max(height(m)) + exec(n) for m, where m is a successor of n Critical path(s) = path through the dependence DAG with longest latency

Example – Determine Height and CP Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle a 3 b c 3 1 d e 2 3 f g 2 3 h 2 Critical path: _______ i

Example 13 a 3 10 b c 12 3 1 d e 10 9 2 3 f g 7 8 2 3 h 5 2 i 3 ___ cycles

Global Scheduling: Superblock • Definition: • single trace of contiguous, frequently executed blocks • a single entry and multiple exits • Formation algorithm: • pick a trace of frequently executed basic block • eliminate side entrance (tail duplication) • Scheduling and optimization: • speculate operations in the superblock • apply optimization to scope defined by superblock

A 100 A 100 B 90 C 10 B 90 C 10 D 0 E 90 E 90 D 0 F 100 F 90 F’ 10 Superblock Formation Tail duplicate Select a trace

r1 = r2*3 r1 = r2*3 r1 = r2*3 r2 = r2 +1 r2 = r2 +1 r2 = r2 +1 r3 = r2*3 r3 = r2*3 r3 = r1 r3 = r2*3 r3 = r2*3 trace selection tail duplication CSE within superblock (no merge since single entry) Optimizations within Superblock • By limiting the scope of optimization to superblock: • optimize for the frequent path • may enable optimizations that are not feasible otherwise (CSE, loop invariant code motion,...) • For example: CSE

Scheduling Algorithm Complexity • Time complexity: O(n2) • n = max number of instructions in basic block • Building dependence dag: worst-case O(n2) • Each instruction must be compared to every other instruction • Scheduling then requires each instruction be inspected at each step = O(n2) • Average-case: small constant (e.g., 3)

Very Long Instruction Word (VLIW) • Compiler determines exactly what is issued every cycle (before the program is run) • Schedules also account for latencies • All hardware changes result in a compiler change • Usually embedded systems (hence simple HW) • Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies)

Sample VLIW code VLIW processor: 5 issue 2 Add/Sub units (1 cycle) 1 Mul/Div unit (2 cycle, unpipelined) 1 LD/ST unit (2 cycle, pipelined) 1 Branch unit (no delay slots) Add/Sub Add/Sub Mul/Div Ld/St Branch c = a + b d = a - b e = a * b ld j = [x] nop g = c + d h = c - d nop ld k = [y] nop nop nop i = j * c ld f = [z] br g

Next Time • Phase-ordering

Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling

Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling

Presentation Transcript

Optimizing Compilers CISC 673 Spring 2011 More Control Flow

Optimizing Compilers CISC 673 Spring 2011 Inlining

Optimizing Compilers CISC 673 Spring 2009 Overview of Compilers and JikesRVM

Optimizing Compilers CISC 673 Spring 2009 More Control Flow

Optimizing Compilers CISC 673 Spring 2011 Static Single Assignment

Optimizing Compilers CISC 673 Spring 2011 Gobal Instruction Scheduling

Optimizing Compilers CISC 673 Spring 2011 Dynamic Compilation

Optimizing Compilers CISC 673 Spring 2009 More Data flow analysis

Optimizing Compilers CISC 673 Spring 2009 Control Flow

Optimizing Compilers CISC 673 Spring 2009 Data flow analysis

Optimizing Compilers CISC 673 Spring 2009 Dynamic Compilation II

Optimizing Compilers CISC 673 Spring 2009 Dynamic Compilation I

Optimizing Compilers CISC 673 Spring 2011 Register Allocation

Optimizing Compilers CISC 673 Spring 2009 Feedback Directed Optimization

Optimizing Compilers CISC 673 Spring 2011 Data flow analysis

Optimizing Compilers CISC 673 Spring 2011 Dynamic Compilation

Optimizing Compilers CISC 673 Spring 2009 Dynamic Compilation II

Optimizing Compilers CISC 673 Spring 2011 Register Allocation

Optimizing Compilers CISC 673 Spring 2011 Gobal Instruction Scheduling

Optimizing Compilers CISC 673 Spring 2011 Overview of Compilers and JikesRVM

Optimizing Compilers CISC 673 Spring 2011 Static Single Assignment II

Optimizing Compilers CISC 673 Spring 2009 More Control Flow