Computer Architecture Out-of-order execution

Computer ArchitectureOut-of-order execution By Dan Tsafrir, 11/4/2011Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz

Out-of-order in a nutshell • Real processors • Do not really serially execute instruction by instruction • Instead, they have an instruction window • Moves across the program • Ready instructions get picked up from window • Not necessarily in program order • Need window because some instructions aren’t ready • User is unaware (except that the program runs faster)

What’s Next • Remember our goal: minimize CPU Time CPU Time = clock cycle  CPI  IC • So far we have learned • Minimize clock cycle add more pipe stages • Minimize CPI use pipeline • Minimize IC  architecture • In a pipelined CPU: • CPI w/o hazards is 1 • CPI with hazards is > 1 • Adding more pipe stages reduces clock cycle but increases CPI • Higher penalty due to control hazards • More data hazards • Beyond some point adding more pipe stages does not help • What can we do? • Further reduce the CPI!

IF ID EXE MEM WB IF ID EXE MEM WB A Superscalar CPU • Duplicating HW in one pipe stage won’t help • e.g., have 2 ALUs • the bottleneck moves to other stages • Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock:

U-pipe IF ID pairing V-pipe The Pentium Processor • Fetches and decodes 2 instructions per cycle • Before register file read, decide on pairing • can the two instructions be executed in parallel • Pairing decision is based on • Data dependencies: instructions must be independent • Resources: • Some instructions use resources from the 2 pipes • The second pipe can only execute part of the instructions

Misprediction Penalty in a Superscalar CPU • MPI : miss-per-instruction (MPR = miss-predict rate): #incorrectly predicted branches #predicted branches MPI = = MPR× total # of instructions total # of instructions • MPI correlates well with performance. E.g., assume: • MPR = 5%, %branches = 20%  MPI = 1% • Without hazards IPC=2 (2 instructions per cycles) • Flush penalty of 5 cycles • We get: • MPI = 1%  flush in every 100 instructions • IPC=2  flush every 100/2 = 50 cycles • 5 cycles flush penalty every 50 cycles • 10% in performance • For IPC=1 we would get • 5 cycles flush penalty per 100 cycles  5% in performance

Is Superscalar Good Enough ? • A superscalar processor can fetch, decode, execute and retire 2 instructions in parallel • Can execute only independent instructions in parallel • But … adjacent instructions are usually dependent • The utilization of the second pipe is usually low • There are algorithms in which both pipes are highly utilized • Solution: out-of-order execution • Execute instructions based on “data flow” rather than program order • Still need to keep the semantics of the original program

Out Of Order Execution • Look ahead in a window of instructions and find instructions that are ready to execute • Don’t depend on data from previous instructions still not executed • Resources are available • Out-of-order execution • Start instruction execution before execution of a previous instructions • Advantages: • Help exploit Instruction Level Parallelism (ILP) • Help cover latencies (e.g., L1 data cache miss, divide) • Can Compilers do the Work ? • Compilers can statically reschedule instructions • Compilers do not have run time information • Conditional branch direction → limited to basic blocks • Data values, which may affect calculation time and control • Cache miss / hit

Data Flow Graph r5 r6 r1 r8 r4 In-order (superscalar 3) 1 3 1 4 6 5 2 1 2 3 5 2 6 5 3 6 4 4 Out-of-order execution Data Flow Analysis • Example: (1)r1 r4 / r7 ; assume divide takes 20 cycles (2) r8r1 + r2 (3) r5 r5 + 1 (4) r6 r6 - r3 (5) r4 r5 + r6 (6) r7  r8 * r4 In-order execution 4 2 6 1 3 5

Retire (commit) Instruction pool Fetch & Decode In-order In-order Execute Out-of-order OOOE – General Scheme • Fetch & decode instructions in parallel but in order, to fill inst. pool • Execute ready instructions from the instructions pool • All the data required for the instruction is ready • Execution resources are available • Once an instruction is executed • signal all dependant instructions that data is ready • Commit instructions in parallel but in-order • Can commit an instruction only after all preceding instructions (in program order) have committed

1 3 4 2 5 Out Of Order Execution – Example • Assume that executing a divide operation takes 20 cycles (1) r1  r5 / r4 (2) r3r1 + r8 (3) r8 r5 + 1 (4) r3 r7 - 2 (5) r6  r6 + r7 • Inst2 has a RAW dependency on r1 with Inst1 • It cannot be executed in parallel with Inst1 • Can successive instructions pass Inst2 ? • Inst3 cannot since Inst2 must read r8before Inst3 writes to it (WAR) • Inst4 cannot since it must write to r3 after Inst2 (WAW) • Inst5 can

False Dependencies • OOOE creates new dependencies • WAR: write to a register which is read by an earlier inst. (1) r3 r2 + r1 (2) r2 r4 + 3 • WAW: write to a register which is written by an earlier inst. (1) r3 r1 + r2 (2) r3 r4 + 3 • These are false dependencies • There is no missing data • Still prevent executing instructions out-of-order • Solution: Register Renaming

Register Renaming • Hold a pool of physical registers • Map architectural registers into physical registers • Before an instruction can be sent for execution • Allocate a free physical register from a pool • The physical register points to the architectural register • When an instruction writes a result • Write the result value to the physical register • When an instruction needs data from a register • Read data from the physical register allocated to the latest inst which writes to the same arch register, and precedes the current inst • If no such instruction exists, read directly from the arch. register • When an instruction commits • Move the value from the physical register to the arch register it points

WAW WAW WAR OOOE with Register Renaming: Example cycle 1cycle 2 (1) r1  mem1 r1’  mem1 (2) r2  r2 + r1 r2’  r2 + r1’ (3) r1  mem2 r1”  mem2 (4) r3  r3 + r1 r3’  r3 + r1” (5) r1  mem3 r1”’ mem3 (6) r4 r5 + r1 r4’  r5 + r1”’ (7) r5 2 r5’  2 (8) r6  r5 + 2 r6’  r5’ + 2 Register Renaming Benefits • Removes false dependencies • Removes architecture limit for # of registers WAR WAR

Executing Beyond Branches • So far we do not look for instructions ready to execute beyond a branch • Limited to the parallelism within a “basic-block” • A basic-block is ~5 instruction long (1)r1 r4 / r7 (2) r2 r2 + r1 (3) r3r2 - 5 (4) beq r3,0,300 If the beq is predicted NT, (5) r8  r8 + 1 Inst 5 can be spec executed • We would like to look beyond branches • But what if we execute an instruction beyond a branch and then it turns out that we predicted the wrong path ? Solution: Speculative Execution

Speculative Execution • Execution of instructions from a predicted (yet unsure) path • Eventually, path may turn wrong • Implementation: • Hold a pool of all not yet executed instructions • Fetch instructions into the pool from a predicted path • Instructions for which all operands are ready can be executed • An instruction may change the processor state (commit) only when it is safe • An instruction commits only when all previous (in-order) instructions have committed  instructions commit in-order • Instructions which follow a branch commit only after the branch commits • If a predicted branch is wrong all the instructions which follow it are flushed • Register Renaming helps speculative execution • Renamed registers are kept until speculation is verified to be correct

WAW WAW Speculative Execution WAR Speculative Execution - Example cycle 1cycle 2 (1) r1 mem1 r1’  mem1 (2) r2  r2 + r1 r2’  r2 + r1’ (3) r1 mem2 r1”  mem2 (4) r3  r3 + r1 r3’  r3 + r1” (5) jmp cond L2 predicted taken to L2 (6)L2 r1 mem3 r1”’ mem3 (7) r4 r5 + r1 r4’  r5 + r1”’ (8) r5 2 r5’  2 (9) r6  r5 + 2 r6’  r5’ + 2 • Instructions 6-9 are speculatively executed • If the prediction turns wrong, they will be flushed • If the branch was predicted not-taken • The instructions from the other path would have been speculatively executed

Out Of Order Execution – Summary • Advantages • Help exploit Instruction Level Parallelism (ILP) • Help cover latencies (e.g., cache miss, divide) • Superior/complementary to compiler scheduler • Dynamic instruction window • Regs renaming: can use more than the number of arch registers • Complex micro-architecture • Complex scheduler • Requires reordering mechanism (retirement) in the back-end for: • Precise interrupt resolution • Misprediction/speculation recovery • Memory ordering • Speculative Execution • Advantage: larger scheduling window  reveals more ILP • Issues: misprediction cost and misprediction recovery

Computer Architecture Out-of-order execution