CSE 420/598 Computer Architecture Lec 10 – Chapter 2 - DynPred-BTB

CSE 420/598 Computer Architecture Lec 10 – Chapter 2 - DynPred-BTB Sandeep K. S. Gupta School of Computing and Informatics Arizona State University Based on Slides by David Patterson

Agenda • Dynamic Branch Prediction (Review) • BTB CSE420/598

Applying the Prediction • The earliest time we can begin using the prediction is when • the prediction bits are available • the branch target is available • The earliest time we can know whether we have predicted correctly is when • the branch condition is resolved • The difference between these times is roughly what is saved by a correct prediction • If the branch target is available late, the window of savings is reduced CSE420/598

Correlating Predictors • The prediction is a function of the last k branch outcomes • The branch history buffer is indexed by • m bits taken from address of branch • k bits of branch history • i.e., m + k bits all told • Each entry in the branch history buffer has q bits (i.e., is a q-bit predictor) • The branch history buffer has 2m+k q bits of storage CSE420/598

Correlating predictor with2 history bits and 2 state bits (2,2) CSE420/598

Local versus Global CSE420/598

Hashing Correlation For the same amount of table storage, we can get better associativity in the case of fewer branches but highly correlated behavior. CSE420/598

Tournament Predictor • Move “toward” the other predictor when • I am wrong • He is right • Stay put when I am right and he is right, or I am wrong and he is wrong. CSE420/598

Tournament predictor local vs global CSE420/598

Alpha 21264 Branch Predictor • Tournament predictor (4K x 2) chooses between global and local • Global has 4K 2-bit entries indexed by last 12 branch outcomes XORed with address • Local is also a two-level predictor • 1K x 10 branch history buffer (last 10 outcomes for indexed branch) indexed by address • The selected 10-bit history is XORed with address to index a table of 3-bit entries CSE420/598

Alpha 21264 Predictor CSE420/598

Branch Target Buffers (BTB) or Caches (BTC) Branch target calculation is costly and stalls the instruction fetch. To reduce the branch penalty need to know what the address is by the end of IF but the instruction isn’t even decoded yet so we have to wait a cycle and perhaps get a branch (penalty = 1 for MIPS) so use the branch instruction address to predict the branch target if prediction works then penalty goes to 0!

BTB - Idea • BTB stores PCs the same way as caches • Only PCs of predicted taken branches are stored (no need to store untaken) • The match tag is the PC (associative memory OK if it’s small) • The datafield is the predicted PC • The PC of a (potential) branch is sent to the BTB • When a match is found the corresponding Predicted PC is returned • If PC not in table, it is taken to mean • either not a branch • or not predicted taken • in either case, continue fetching from PC + k (k =4 for MIPS) • If the branch was predicted taken, instruction fetch continues at the returned predicted PC • BTB gets us the branch target address early CSE420/598

Branch Target Buffers

Changes in MIPS to incorporate BTB CSE420/598

Penalties Using BTB in MIPS • Note • Penalties for mis-prediction more complex machines are much higher CSE420/598

Questions Concerning BTBs • Can BTB be combined with branch prediction machinery introduced earlier in this lecture? How? • What kind of branches can a BTB accelerate that are out of the reach of ordinary branch predictors? CSE420/598

BTB coupled with BHT CSE420/598

Improvements • Store instructions rather than target address • increases entry size but removes Ifetch time • permits BTB to run slower and therefore be larger • permits branch folding - branches effectively disappear • branch job is to change PC and get the real instruction • if you have the instruction then the branch isn’t there (folded out of the way) • result is 0-cycle jumps and effectively 0-cycle properly predicted branches • however - branches must be checked • in a parallel path the branch must be fetched and checked to see if the prediction is true • Predicting indirect jumps • major source is procedure return • obvious model is to use a stack as the return predictor • note this can be combined with the above to get jump folding CSE420/598

Dynamic Branch Prediction Summary • Prediction becoming important part of execution • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch • Either different branches (GA) • Or different executions of same branches (PA) • Tournament predictors take insight to next level, by using multiple predictors • usually one based on global information and one based on local information, and combining them with a selector • In 2006, tournament predictors using  30K bits are in processors like the Power5 and Pentium 4 • Branch Target Buffer: include branch address & prediction • Next Class: Dynamic Scheduling CSE420/598

CSE 420/598 Computer Architecture Lec 10 – Chapter 2 - DynPred-BTB