Computer Architecture Out-Of-Order Execution

Computer ArchitectureOut-Of-Order Execution Lihu Rappoport and Adi Yoaz

What’s Next • Remember our goal: minimize CPU Time CPU Time = clock cycle  CPI  IC • So far we have learned • Minimize clock cycle add more pipe stages • Minimize CPI use pipeline • Minimize IC  architecture • In a pipelined CPU: • CPI w/o hazards is 1 • CPI with hazards is > 1 • Adding more pipe stages reduces clock cycle but increases CPI • Higher penalty due to control hazards • More data hazards • Beyond some point adding more pipe stages does not help • What can we do ? Further reduce the CPI !

IF ID EXE MEM WB IF ID EXE MEM WB A Superscalar CPU • Duplicating HW in one pipe stage won’t help • e.g., have 2 ALUs • the bottleneck moves to other stages • Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock:

U-pipe IF ID pairing V-pipe The Pentium Processor • Fetches and decodes 2 instructions per cycle • Before register file read, decide on pairing • can the two instructions be executed in parallel • Pairing decision is based on • Data dependencies: instructions must be independent • Resources: • Some instructions use resources from the 2 pipes • The second pipe can only execute part of the instructions

Misprediction Penalty in a Superscalar CPU • MPI : miss-per-instruction: #incorrectly predicted branches #predicted branches MPI = = MPR× total # of instructions total # of instructions • MPI correlates well with performance. E.g., assume: • MPR = 5%, %branches = 20%  MPI = 1% • Without hazards IPC=2 (2 instructions per cycles) • Flush penalty of 5 cycles • We get: • MPI = 1%  flush in every 100 instructions • IPC=2  flush every 100/2 = 50 cycles • 5 cycles flush penalty every 50 cycles • 10% in performance • For IPC=1 we would get • 5 cycles flush penalty per 100 cycles  5% in performance

Is Superscalar Good Enough ? • A superscalar processor can fetch, decode, execute and retire 2 instructions in parallel • Can execute only independent instructions in parallel • But … adjacent instructions are usually dependent • The utilization of the second pipe is usually low • There are algorithms in which both pipes are highly utilized • Solution: out-of-order execution • Execute instructions based on “data flow” rather than program order • Still need to keep the semantics of the original program

Out Of Order Execution • Look ahead in a window of instructions and find instructions that are ready to execute • Don’t depend on data from previous instructions still not executed • Resources are available • Out-of-order execution • Start instruction execution before execution of a previous instructions • Advantages: • Help exploit Instruction Level Parallelism (ILP) • Help cover latencies (e.g., L1 data cache miss, divide) • Can Compilers do the Work ? • Compilers can statically reschedule instructions • Compilers do not have run time information • Conditional branch direction → limited to basic blocks • Data values, which may affect calculation time and control • Cache miss / hit

Data Flow Graph r5 r6 r1 r8 r4 In-order execution 1 4 2 3 1 6 5 1 2 3 5 2 6 5 3 6 4 4 Out-of-order execution Data Flow Analysis • Example: (1)r1 r4 / r7 ; assume divide takes 20 cycles (2) r8r1 + r2 (3) r5 r5 + 1 (4) r6 r6 - r3 (5) r4 r5 + r6 (6) r7  r8 * r4

Retire (commit) Instruction pool Fetch & Decode In-order In-order Execute Out-of-order OOOE – General Scheme • Fetch & decode instructions in parallel but in order, to fill inst. pool • Execute ready instructions from the instructions pool • All the data required for the instruction is ready • Execution resources are available • Once an instruction is executed • signal all dependant instructions that data is ready • Commit instructions in parallel but in-order • Can commit an instruction only after all preceding instructions (in program order) have committed

1 3 4 2 5 Out Of Order Execution – Example • Assume that executing a divide operation takes 20 cycles (1) r1 r5 / r4 (2) r3r1 + r8 (3) r8 r5 + 1 (4) r3 r7 - 2 (5) r6  r6 + r7 • Inst2 has a RAW dependency on r1 with Inst1 • It cannot be executed in parallel with Inst1 • Can successive instructions pass Inst2 ? • Inst3 cannot since Inst2 must read r8 before Inst3 writes to it • Inst4 cannot since it must write to r3 after Inst2 • Inst5 can

False Dependencies • OOOE creates new dependencies • WAR: write to a register which is read by an earlier inst. (1) r3 r2 + r1 (2) r2 r4 + 3 • WAW: write to a register which is written by an earlier inst. (1) r3 r1 + r2 (2) r3 r4 + 3 • These are false dependencies • There is no missing data • Still prevent executing instructions out-of-order • Solution: Register Renaming

Register Renaming • Hold a pool of physical registers • Map architectural registers into physical registers • Before an instruction can be sent for execution • Allocate a free physical register from a pool • The physical register points to the architectural register • When an instruction writes a result • Write the result value to the physical register • When an instruction needs data from a register • Read data from the physical register allocated to the latest inst which writes to the same arch register, and precedes the current inst • If no such instruction exists, read directly from the arch. register • When an instruction commits • Move the value from the physical register to the arch register it points

WAW WAW WAR OOOE with Register Renaming: Example cycle 1cycle 2 (1) r1 mem1 r1’  mem1 (2) r2  r2 + r1 r2’  r2 + r1’ (3) r1 mem2 r1”  mem2 (4) r3  r3 + r1 r3’  r3 + r1” (5) r1 mem3 r1”’ mem3 (6) r4 r5 + r1 r4’  r5 + r1”’ (7) r5 2 r5’  2 (8) r6  r5 + 2 r6’  r5’ + 2 Register Renaming Benefits • Removes false dependencies • Removes architecture limit for # of registers

Executing Beyond Branches • So far we do not look for instructions ready to execute beyond a branch • Limited to the parallelism within a basic-block • A basic-block is ~5 instruction long (1)r1 r4 / r7 (2) r2 r2 + r1 (3) r3r2 - 5 (4) beqr3,0,300 If the beq is predicted NT, (5) r8  r8 + 1 Inst 5 can be spec executed • We would like to look beyond branches • But what if we execute an instruction beyond a branch and then it turns out that we predicted the wrong path ? Solution: Speculative Execution

Speculative Execution • Execution of instructions from a predicted (yet unsure) path • Eventually, path may turn wrong • Implementation: • Hold a pool of all not yet executed instructions • Fetch instructions into the pool from a predicted path • Instructions for which all operands are ready can be executed • An instruction may change the processor state (commit) only when it is safe • An instruction commits only when all previous (in-order) instructions have committed  instructions commit in-order • Instructions which follow a branch commit only after the branch commits • If a predicted branch is wrong all the instructions which follow it are flushed • Register Renaming helps speculative execution • Renamed registers are kept until speculation is verified to be correct

WAW WAW Speculative Execution WAR Speculative Execution – Example cycle 1 cycle 2 (1) r1 mem1 r1’  mem1 (2) r2  r2 + r1 r2’  r2 + r1’ (3) r1 mem2 r1”  mem2 (4) r3  r3 + r1 r3’  r3 + r1” (5) jmpcond L2 predicted taken to L2 (6)L2 r1 mem3 r1”’ mem3 (7) r4 r5 + r1 r4’  r5 + r1”’ (8) r5 2 r5’  2 (9) r6  r5 + 2 r6’  r5’ + 2 • Instructions 6-9 are speculatively executed • If the prediction turns wrong • Flush the instructions following the mispredicted branch • Restore the renaming table to its pre-flush state • Have each arch register be pointed to by the last phy register that writes to it prior to the speculative section (e.g. r1r1’’ and not r1r1’’’)

OOO Requires Accurate Branch Predictor • Accurate branch predictor increases effective scheduling window size • Speculate across multiple branches (a branch every 5 – 10 instructions) Instruction pool High chances to commit branches Low chances to commit

Interrupts and Faults Handling • Complications for pipelined and OOO execution • Interrupts occur in the middle of an instruction • A speculative instruction can get a fault (divide by 0, page fault) • Faults are served in program order, at retirement only • Mark an instruction that takes a fault at execution • Instructions older than the faulting instruction are retired • Only when the faulting instruction retires – handle the fault • Flush subsequent instructions • Initiate the fault handling code according to the fault type • Restart faulting and/or subsequent instructions • Interrupts are served when the next instruction retires • Let the instruction in the current cycle retire • Flush subsequent instructions and initiate the interrupt service code • Fetch the subsequent instructions

Out Of Order Execution – Summary • Advantages • Help exploit Instruction Level Parallelism (ILP) • Help cover latencies (e.g., cache miss, divide) • Superior/complementary to compiler scheduler • Dynamic instruction window • Reg Renaming: can use more than the number architectural registers • Complex micro-architecture • Complex scheduler • Requires reordering mechanism (retirement) in the back-end for: • Precise interrupt resolution • Misprediction/speculation recovery • Memory ordering • Speculative Execution • Advantage: larger scheduling window  reveals more ILP • Issues: misprediction cost and misprediction recovery

Computer Architecture Out-Of-Order Execution