200 likes | 365 Views
CSE 420/598 Computer Architecture Lec 10 – Chapter 2 - DynPred-BTB. Sandeep K. S. Gupta School of Computing and Informatics Arizona State University. Based on Slides by David Patterson. Agenda. Dynamic Branch Prediction (Review) BTB. Applying the Prediction.
E N D
CSE 420/598 Computer Architecture Lec 10 – Chapter 2 - DynPred-BTB Sandeep K. S. Gupta School of Computing and Informatics Arizona State University Based on Slides by David Patterson
Agenda • Dynamic Branch Prediction (Review) • BTB CSE420/598
Applying the Prediction • The earliest time we can begin using the prediction is when • the prediction bits are available • the branch target is available • The earliest time we can know whether we have predicted correctly is when • the branch condition is resolved • The difference between these times is roughly what is saved by a correct prediction • If the branch target is available late, the window of savings is reduced CSE420/598
Correlating Predictors • The prediction is a function of the last k branch outcomes • The branch history buffer is indexed by • m bits taken from address of branch • k bits of branch history • i.e., m + k bits all told • Each entry in the branch history buffer has q bits (i.e., is a q-bit predictor) • The branch history buffer has 2m+k q bits of storage CSE420/598
Correlating predictor with2 history bits and 2 state bits (2,2) CSE420/598
Local versus Global CSE420/598
Hashing Correlation For the same amount of table storage, we can get better associativity in the case of fewer branches but highly correlated behavior. CSE420/598
Tournament Predictor • Move “toward” the other predictor when • I am wrong • He is right • Stay put when I am right and he is right, or I am wrong and he is wrong. CSE420/598
Tournament predictor local vs global CSE420/598
Alpha 21264 Branch Predictor • Tournament predictor (4K x 2) chooses between global and local • Global has 4K 2-bit entries indexed by last 12 branch outcomes XORed with address • Local is also a two-level predictor • 1K x 10 branch history buffer (last 10 outcomes for indexed branch) indexed by address • The selected 10-bit history is XORed with address to index a table of 3-bit entries CSE420/598
Alpha 21264 Predictor CSE420/598
Branch Target Buffers (BTB) or Caches (BTC) Branch target calculation is costly and stalls the instruction fetch. To reduce the branch penalty need to know what the address is by the end of IF but the instruction isn’t even decoded yet so we have to wait a cycle and perhaps get a branch (penalty = 1 for MIPS) so use the branch instruction address to predict the branch target if prediction works then penalty goes to 0!
BTB - Idea • BTB stores PCs the same way as caches • Only PCs of predicted taken branches are stored (no need to store untaken) • The match tag is the PC (associative memory OK if it’s small) • The datafield is the predicted PC • The PC of a (potential) branch is sent to the BTB • When a match is found the corresponding Predicted PC is returned • If PC not in table, it is taken to mean • either not a branch • or not predicted taken • in either case, continue fetching from PC + k (k =4 for MIPS) • If the branch was predicted taken, instruction fetch continues at the returned predicted PC • BTB gets us the branch target address early CSE420/598
Changes in MIPS to incorporate BTB CSE420/598
Penalties Using BTB in MIPS • Note • Penalties for mis-prediction more complex machines are much higher CSE420/598
Questions Concerning BTBs • Can BTB be combined with branch prediction machinery introduced earlier in this lecture? How? • What kind of branches can a BTB accelerate that are out of the reach of ordinary branch predictors? CSE420/598
BTB coupled with BHT CSE420/598
Improvements • Store instructions rather than target address • increases entry size but removes Ifetch time • permits BTB to run slower and therefore be larger • permits branch folding - branches effectively disappear • branch job is to change PC and get the real instruction • if you have the instruction then the branch isn’t there (folded out of the way) • result is 0-cycle jumps and effectively 0-cycle properly predicted branches • however - branches must be checked • in a parallel path the branch must be fetched and checked to see if the prediction is true • Predicting indirect jumps • major source is procedure return • obvious model is to use a stack as the return predictor • note this can be combined with the above to get jump folding CSE420/598
Dynamic Branch Prediction Summary • Prediction becoming important part of execution • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch • Either different branches (GA) • Or different executions of same branches (PA) • Tournament predictors take insight to next level, by using multiple predictors • usually one based on global information and one based on local information, and combining them with a selector • In 2006, tournament predictors using 30K bits are in processors like the Power5 and Pentium 4 • Branch Target Buffer: include branch address & prediction • Next Class: Dynamic Scheduling CSE420/598