Computer S tructure Out-Of-Order Execution

Computer StructureOut-Of-Order Execution Lihu Rappoport and Adi Yoaz

What’s Next • Remember our goal: minimize CPU Time CPU Time = clock cycle  CPI  IC • So far we have learned • Minimize clock cycle add more pipe stages • Minimize CPI use pipeline • Minimize IC  architecture • In a pipelined CPU: • CPI w/o hazards is 1 • CPI with hazards is > 1 • Adding more pipe stages reduces clock cycle but increases CPI • Higher penalty due to control hazards • More data hazards • Beyond some point adding more pipe stages does not help • What can we do ? Further reduce the CPI !

IF ID EXE MEM WB IF ID EXE MEM WB A Superscalar CPU • Duplicating HW in one pipe stage won’t help • e.g., have 2 ALUs • the bottleneck moves to other stages • Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock:

U-pipe IF ID pairing V-pipe The Pentium Processor • Fetches and decodes 2 instructions per cycle • Before register file read, decide on pairing • can the two instructions be executed in parallel • Pairing decision is based on • Data dependencies: instructions must be independent • Resources: • Some instructions use resources from the 2 pipes • The second pipe can only execute part of the instructions

Misprediction Penalty in a Superscalar CPU • MPI : miss-per-instruction: #incorrectly predicted branches #predicted branches MPI = = MPR× total # of instructions total # of instructions • MPI correlates well with performance. E.g., assume: • MPR = 5%, %branches = 20%  MPI = 1% • Without hazards IPC=2 (2 instructions per cycles) • Flush penalty of 5 cycles • We get: • MPI = 1%  flush in every 100 instructions • IPC=2  flush every 100/2 = 50 cycles • 5 cycles flush penalty every 50 cycles • 10% in performance • For IPC=1 we would get • 5 cycles flush penalty per 100 cycles  5% in performance

Is Superscalar Good Enough ? • A superscalar processor can fetch, decode, execute and retire 2 instructions in parallel • Can execute only independent instructions in parallel • But … adjacent instructions are usually dependent • The utilization of the second pipe is usually low • There are algorithms in which both pipes are highly utilized • Solution: out-of-order execution • Execute instructions based on “data flow” rather than program order • Still need to keep the semantics of the original program

Out Of Order Execution • Look ahead in a window of instructions • Find instructions that are ready to execute • Do not depend on data from previous instructions still not executed • Have the required execution resources available • Start instruction execution before execution of a previous instructions • Advantages • Help exploit Instruction Level Parallelism (ILP) • Help cover latencies (e.g., L1 data cache miss, divide) • Can Compilers do the Work ? • Compilers can statically reschedule instructions • Compilers do not have run time information • Conditional branch direction → limited to basic blocks • Data values, which may affect calculation time and control • Cache miss / hit

Data Flow Graph r5 r6 r1 r8 r4 In-order execution 1 4 2 3 1 6 5 1 2 3 5 2 6 5 3 6 4 4 Out-of-order execution Data Flow Analysis • Example: (1)r1 r4 / r7 ; assume divide takes 20 cycles (2) r8r1 + r2 (3) r5 r5 + 1 (4) r6 r6 - r3 (5) r4 r5 + r6 (6) r7  r8 * r4

Retire (commit) Instruction pool Fetch & Decode In-order In-order Execute Out-of-order OOOE – General Scheme • Fetch & decode instructions in parallel but in order, to fill inst. pool • Execute ready instructions from the instructions pool • All the data required for the instruction is ready • Execution resources are available • Once an instruction is executed • signal all dependant instructions that data is ready • Commit instructions in parallel but in-order • Instruction committed only after all preceding instructions (in program order) have committed

1 3 4 2 5 Out Of Order Execution – Example • Assume that executing a divide operation takes 20 cycles (1) r1 r5 / r4 (2) r3r1 + r8 (3) r8 r5 + 1 (4) r3 r7 - 2 (5) r6  r6 + r7 • Inst2 has a RAW dependency on r1 with Inst1 • It cannot be executed in parallel with Inst1 • Can successive instructions pass Inst2 ? • Inst3 cannot since Inst2 must read r8 before Inst3 writes to it • Inst4 cannot since it must write to r3 after Inst2 • Inst5 can

False Dependencies • OOOE creates new dependencies • WAR: write to a register which is read by an earlier inst. (1) r3 r2 + r1 (2) r2 r4 + 3 • WAW: write to a register which is written by an earlier inst. (1) r3 r1 + r2 (2) r3 r4 + 3 • These are false dependencies • There is no missing data • Still prevent executing instructions out-of-order • Solution: Register Renaming

Register Renaming • Hold a pool of physicalregisters • Renaming map maps architectural registers into physical registers • Map architectural registers into physical registers • Before an instruction is sent for execution • Map the arch destination register to a newly allocated physical register • Replace each of the arch source registers with the mapped phyregs • When an instruction needs data from a source register for execution • Read data from the physical register allocated to the latest inst which writes to the same arch register, and precedes the current inst • If no such instruction exists, read directly from the arch. register • When an instruction produces a result following its execution • Write the result value to the physical register • When an instruction commits • Move the value from the physical register to the arch register it points

WAW WAW WAR OOOE with Register Renaming: Example Alloc/ EXE EXE Renamecyc1cyc2 r1mem1r1:pr1 pr1mem1 x r2r2+r1r2:pr2pr2r2+pr1 x r1mem2r1:pr3 pr3mem2 x r3r3+r1 r3:pr4pr4r3+pr3 x r1mem3r1:pr5pr5mem3 x r4r5+r1r4:pr6pr6r5+pr5 x r52r5:pr7pr72 x r6r5+2r6:pr8pr8pr7+2 x Rename map

Executing Beyond Branches • So far we do not look for instructions ready to execute beyond a branch • Limited to the parallelism within a basic-block • A basic-block is ~5 instruction long (1)r1 r4 / r7 (2) r2 r2 + r1 (3) r3r2 - 5 (4) beqr3,0,300 If the beq is predicted NT, (5) r8  r8 + 1 Inst 5 can be spec executed • We would like to look beyond branches • But what if we execute an instruction beyond a branch and then it turns out that we predicted the wrong path ? Solution: Speculative Execution

Speculative Execution • Fetch instructions into the pool from a predicted path • Instructions for which all operands are ready can be executed • Instructions commit in-order • Instructions which follow a branch commit only after the branch commits • Branch resolved to be correctly predicted  the instructions are safe • Branch resolved to be wrongly predicted  flush the instructions • Reset the renaming map, so all register are mapped to arch. registers • Redirect the instruction fetch to the correct address • Branch is resolved at EXE  no need to wait until it commits • Start the misprediction recovery right after the branch executes • Start fetching instructions from correct path • Block renaming of new instructions until branch commitsOnce branch commits, reset the renaming map, and re-new renaming • Better: recover the renaming map to its state right after the branch,Once renaming map is recovered, re-new renaming

WAW WAW WAR OOOE with Register Renaming: Example Alloc/ EXE EXE Renamecyc 1cyc2 r1mem1 r1:pr1pr1mem1 x r2r2+r1 r2:pr2pr2r2+pr1 x r1mem2 r1:pr3 pr3mem2 x r3r3+r1 r3:pr4pr4r3+pr3 x jcc L2 predicted taken to L2 L2 r1mem3 r1:pr5pr5mem3 x r4r5+r1 r4:pr6 pr6r5+pr5x r52 r5:pr7pr72 x r6r5+2 r6:pr8pr8pr7+2 x • Rename map • If branch mispredicts, there are 2 options • When branch commits: reset rename map, and then rename correct instructions • When branch executes: recover rename map, and then rename correct instructions

OOO Requires Accurate Branch Predictor • Accurate branch predictor increases effective scheduling window size • Speculate across multiple branches (a branch every 5 – 10 instructions) Instruction pool High chances to commit branches Low chances to commit

Interrupts and Faults Handling • Complications for pipelined and OOO execution • Interrupts occur in the middle of an instruction • A speculative instruction can get a fault (divide by 0, page fault) • Faults are served in program order, at retirement only • Mark an instruction that takes a fault at execution • Instructions older than the faulting instruction are retired • Only when the faulting instruction retires – handle the fault • Flush subsequent instructions • Initiate the fault handling code according to the fault type • Restart faulting and/or subsequent instructions • Interrupts are served when the next instruction retires • Let the instruction in the current cycle retire • Flush subsequent instructions and initiate the interrupt service code • Fetch the subsequent instructions

Out Of Order Execution – Summary • Advantages • Help exploit Instruction Level Parallelism (ILP) • Help cover latencies (e.g., cache miss, divide) • Superior/complementary to compiler scheduler • Dynamic instruction window • Reg Renaming: can use more than the number architectural registers • Complex micro-architecture • Complex scheduler • Requires reordering mechanism (retirement) in the back-end for: • Precise interrupt resolution • Misprediction/speculation recovery • Memory ordering • Speculative Execution • Advantage: larger scheduling window  reveals more ILP • Issues: misprediction cost and misprediction recovery

Computer S tructure Out-Of-Order Execution