Computer Architecture

Computer Architecture Lecture 6 Overview of Branch Prediction

0% 0% matrix300 9% 9% 4096 entries: 2bits per entry Unlimited entries 2 bits per entry spice 9% 9% fpppp 12% 11% gcc 5% 5% espresso eqntott 10% 10% li 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% Frequency of mispredictions Prediction accuracy of a 4096- entry 2-bit prediction buffer vs. infinite buffer

Local 4096 entries: 2-bits per Unlimited entries 2-bits 1024 entries (2,2) Comparison of 2 bit predictors 0% 0% matrix300 9% 9% spice 5% 9% 9% fpppp 5% 12% 11% gcc 11% 5% 5% espresso 4% eqntott 6% 10% 10% li 5% 0 2 4 6 8 10 12 14 16 18 Frequency of mispredictions (%)

Tournament Predictor P1 Correct P2 Correct Use predictor P2 00 Use predictor P1 11 P1 Correct P2 Correct P2 Correct P1 Correct Use predictor P1 10 Use predictor P2 01 P1 Correct

Misprediction rate of three predictors 8% 7% 6% 5% 4% 3% 2% 1% 0% Local 2-bit Predictor Conditional Branch Mis-prediction Rate. Correlating Predictor Tournament Predictor 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 512 Total Predictor Size (KBits) • Note that predictors of equal capacity must be compared. Sizes of each level have to be selected to optimize prediction accurate. Influencing factors: degree of interference between branches, program likely to benefit from local/global history

Why Prediction • Prediction Reduces Branch hazards in Pipelined Processors. • Used in almost all pipelined processors 0 Mux 1 PC+4 Branch Target Address Cache Actual Next PC Branch Prediction Buffer Branch prediction (T/NT)

A Branch Target Buffer PC of instruction to fetch Prediction Hardware (Counter Etc) Predicted PC Lookup Number of entries In branch target buffer No: not branch instruction; proceed normally Branch predicted taken or untaken = Yes: Instruction is branch, use Predicted PC New PC

Send PC to memory and branch-target buffer IF No Entry found in the branch-target buffer? Yes Send out predicted PC Is Instruction a taken branch? No Yes Yes No Taken Branch? Normal instruction execution Branch correctly Predicted; Continue execution with no stalls Enter Branch instruction address and next PC into branch target buffer Mispredicted Branch, kill fetched instruction EX Handling an instruction with a branch-target ID

Penalties for possible combinations of whether the branch is in the buffer

Static Super Scalar pipeline in operation Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues – More ports for FP registers to do FP load & FP op in a pair Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB • 1 cycle load delay causes delay to 3 instructions in Superscalar • instruction in right half can’t use it, nor instructions in next slot

Dynamic Super Scalar pipeline in operation LD/ST Mem Access Wait for Operands Wait for Operands EX TAC Read Reg Integer Wait for Operands Wait for Operands EX CDB #1 Wider Bus FP ISSUE/ Rename to RS CDB #2 Wait for Operands Wait for Operands A 1 A 2 A 3 A 4 Instr. Cache Wait for Operands Wait for Operands M 1 M 2 .. M 7 ISSUE/ Rename to RS Write Reg Wait for Operands Divide Check for RS Check for RAW

Example 1 Loop: L.D F0,0(R1) ;F0=array element ADD.D F4,F0,F2 S.D F4,0(R1) ; store result ADDIU R1,R1,#-8 ;8 bytes (per DW) BNE R1,R2,LOOP ;branch R1!=R2

Dual issue, 1 Integer Unit FPMUL = 3 cc

Dual issue, 1 Integer Unit

Dual issue, 1 Integer Unit, FPMUL = 3 cc

Dual issue, 2 Integer Unit

Speculative Execution • Need to overcome • Branch Hazards • Precise Exception

LD/ST Wait for Operands EX TAC Mem Acces Integer Wait for Operands EX Wait for Operands A 1 A 2 A 3 A 4 Wait for Operands M 1 M 2 .. M 7 Wait for Operands Divide Speculative Pipeline Read Reg ROB CDB ISSUE/ Rename to RS FP Write Reg Check for RS Check for RAW

The Hardware: Reorder Buffer IM • If inst write results in program order, reg/memory always get the correct values • Reorder buffer (ROB) – reorder out-of-order inst to program order at the time of writing reg/memory (commit) • If some inst goes wrong, handle it at the time of commit – just flush inst afterwards • Inst cannot write reg/memory immediately after execution, so ROB also buffer the results No such a place in Tomasulo original Fetch Unit Reorder Buffer Decode Rename Regfile S-buf L-buf RS RS DM FU1 FU2

Speculative Tomasulo Algorithm • Issue — get instruction from FP Op Queue • Condition: a free RS at the required FU • Actions: (1) decode the instruction; (2) allocate a RS and ROB entry; (3) do source register renaming; (4) do dest register renaming; (5) read register file; (6) dispatch the decoded and renamed instruction to the RS and ROB • Execution — operate on operands (EX) • Condition: At a given FU, At lease one instruction is ready • Action: select a ready instruction and send it to the FU • Write result— finish execution (WB) • Condition: At a given FU, some instruction finishes FU execution • Actions: (1) FU writes to CDB, broadcast to all RSs and to the ROB; (2) FU broadcast tag (ROB index) to all RS; (3) de-allocate the RS. Note: no register status update at this time

Speculative Tomasulo Algorithm • Commit—update register with reorder result • Condition: ROB is not empty and ROB head inst has finished execution • Actions if no mis-prediction/exception: (1) write result to register/memory, (2) update register status, (3) de-allocate the ROB entry • Actions if with mis-prediction/exception: flush the pipeline, e.g. (1) flush IFQ; (2) clear register status; (3) flush all RS and reset FU; (4) reset ROB

Computer Architecture