Branch Prediction

Branch Prediction Static, Dynamic Branch prediction techniques

Next fetch started PC Fetch I-cache Fetch Buffer Decode Issue Buffer Execute Func. Units Result Buffer Branchexecuted Commit Arch. State Control Flow PenaltyWhy Branch Prediction Modern processors have 10 -14 pipeline stages between next PC calculation and branch resolution ! work lost if pipeline makes wrong prediction ~ Loop length x pipeline width

Branch Penalties in a Superscalarare extensive

Reducing Control Flow Penalty • Software solutions • Minimize branches - loop unrolling • Increases the run length • Hardware solutions • Find something else to do - delay slots • Speculate –Dynamicbranch prediction • Speculative execution of instructionsbeyond branch

Branch Prediction • Motivation: • Branch penalties limit performance of deeply pipelined processors • Much worse for superscalar processors • Modern branch predictors have high accuracy • (>95%) and can reduce branch penalties significantly • Required hardware support: • Dynamic Prediction HW: • Branch history tables, branch target buffers, etc. • Mispredict recovery mechanisms: • Keep computation result separate from commit • Kill instructions following branch • Restore state to state following branch

JZ JZ Static Branch Prediction- review Overall probability a branch is taken is ~60-70% but: backward 90% forward 50% • ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110 • bne0 (preferred taken) beq0 (not taken) • ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA-64 typically reported as ~80% accurate

Branch Prediction Needs • Target address generation • Get register: PC, Link reg, GP reg. • Calculate: +/- offset, auto inc/dec • Target speculation • Condition resolution • Get register: condition code reg, count reg., other reg. • Compare registers • Condition speculation

Target address generation takes time

Condition resolution takes time

Solution: Branch speculation

Branch Prediction Schemes • 2-bit Branch-Prediction Buffer • Branch Target Buffer • Correlating Branch Prediction Buffer • Tournament Branch Predictor • Integrated Instruction Fetch Units • Return Address Predictors (for subroutines, Pentium, Core Duo) • Predicated Execution (Itanium)

Dynamic Branch Predictionlearning based on past behavior History Information • Incoming stream of addresses • Fast outgoing stream of predictions • Correction information returned from pipeline Branch Predictor Incoming Branches { Address } Prediction { Address, Value } Corrections { Address, Value }

Branch History Table (BHT)Table of predictors Predictor 0 Branch PC Predictor 1 • Each branch given its own predictor • BHT is table of “Predictors” • Could be 1-bit or more • Indexed by PC address of Branch • Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit): • End of loop case: when it exits loop • First time through loop, it predicts exit instead of looping • most schemes use at least 2 bit predictors • Performance = ƒ(accuracy, cost of misprediction) • Misprediction  Flush Reorder Buffer • In Fetch state of branch: • Use Predictor to make prediction • When branch completes • Update corresponding Predictor Predictor 7

Fetch PC 0 0 I-Cache k 2k-entry BHT, 2 bits/entry BHT Index Instruction Opcode offset + Branch? Taken/¬Taken? Target PC Branch History Table Organization Target PC calculation takes time 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions

2-bit Dynamic Branch Predictionmore accurate than 1-bit • Better Solution: 2-bit scheme where change prediction only if get misprediction twice: • Red: stop, not taken • Green: go, taken • Adds hysteresis to decision making process T NT Predict Taken Predict Taken T T NT NT Predict Not Taken Predict Not Taken T NT

Branch PC Predicted PC BTB: Branch Address at Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) PC of instruction FETCH Yes: instruction is branch and use predicted PC as next PC =? prediction state bits No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb

BTB contains only Branch & Jump Instructions BTB contains information for branch and jump instructions only  not updated for other instructions For all other instructions the next PC is PC+4 ! Achieved without decoding instruction

A PC Generation/Mux P Instruction Fetch Stage 1 BTB F Instruction Fetch Stage 2 BHT in later pipeline stage corrects when BTB misses a predicted taken branch B BHT Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute BTB/BHT only updated after branch resolves in E stage Combining BTB and BHT • BTB entries considerably more expensive than BHT, • fetch redirected earlier in pipeline - can accelerate indirect branches (JR) • BHT can hold many more entries - more accurate

Pop return address when subroutine return decoded Push return address when function call executed k entries (typically k=8-16) Subroutine Return Stack • Small stack – accelerate subroutine returns • more accurate than BTBs. &nextc &nextb &nexta

Mispredict Recovery • In-order execution machines: • Instructions issued after branch cannot write-back before branch resolves • all instructions in pipeline behind mispredicted branch Killed

Predicated Execution • Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP • If false, then neither store result nor cause exception • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. • IA-64: 64 1-bit condition fields selected so conditional execution of any instruction • This transformation is called “if-conversion” • Drawbacks to conditional instructions • Still takes a clock even if “annulled” • Stall if condition evaluated late • Complex conditions reduce effectiveness; condition becomes known late in pipeline x A = B op C

Accuracy v. Size (SPEC89)

Dynamic Branch Prediction Summary • Prediction becoming important part of scalar execution • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch. • Tournament Predictor: more resources to competitive solutions and pick between them • Branch Target Buffer: include branch address & prediction • Predicated Execution can reduce number of branches, number of mispredicted branches • Return address stack for prediction of indirect jump

Branch Prediction