140 likes | 162 Views
Learn how to speed up branch prediction and reduce branch penalty in a 5-stage pipeline by predicting branch outcomes and handling penalties effectively.
E N D
CSC 4250Computer Architectures October 31, 2006 Chapter 3. Instruction-Level Parallelism & Its Dynamic Exploitation
Simple 5-Stage Pipeline • Branch prediction may not help 5-stage pipeline: IF|ID|EX|ME|WB • We decode branch instruction, test branch condition, and compute branch address during ID • No gain in predicting branch outcome in ID • How to speed up branch prediction?
How to Reduce Branch Penalty • 5-stage pipeline: IF|ID|EX|ME|WB • “Predict” fetched instruction as a branch instr. ─ Decide that instr. just fetched is a branch during IF • “Predict” target instruction and fetch it next ─ No need to compute address for next instr. • Branch penalty becomes zero cycle if prediction is correct
Figure 3.20. Steps to handle an instruction with a branch-target buffer
Figure 3.21 Penalties, assuming that we store only taken branches in the buffer: • If the branch is not correctly predicted, the penalty is equal to one clock cycle to update the buffer with the correct information (during which an instruction cannot be fetched) and one clock cycle to restart fetching the next correct instruction for the branch • If the branch is not found and taken, a two-cycle penalty is encountered, during which time the buffer is updated
Example (p. 211) • Determine the total branch penalty for a branch-target buffer assuming the penalty cycles from Figure 3.21 • The following assumptions are made: • Prediction accuracy is 90% (for instructions in the buffer) • Hit rate in the buffer is 90% (for branches predicted taken) • Assume that 60% of the branches are taken
Answer (p. 211) Compute the penalty by looking at two events: the branch is predicted taken but ends up being not taken, and the branch is taken but is not found in the buffer. Both carry a penalty of two cycles Probability (branch in buffer, but actually not taken) = Percent buffer hit rate × Percent incorrect predictions = 90% × 10% = 0.09 Probability (branch not in buffer, but actually taken) = 10% Branch penalty = (0.09 + 0.10) × 2 = 0.38
Comparison Branch-Target Buffer (BTB) versus Branch-Prediction Buffer (BPB): • Shape, size, and contents • Which stage in pipeline? • How to find an entry? • Placement of an entry • Replacement of an entry • With BTB, why need BPB? • Does BPB save any clock cycles? • If predicted NT, should branch instr. be kept in BTB?
Variation of Branch-Target Buffer (p. 211) • Store one or more target instructions instead of, or in addition to, the predicted target address • Two potential advantages: • Allow the branch-target buffer access to take longer than the time between successive instruction fetches, possibly allowing a larger branch-target buffer • Allow us to perform an optimization called branch folding
Branch Folding (p. 213) • Use branch folding to obtain zero-cycle unconditional branches • Consider a branch-target buffer that buffers instructions from the predicted path and is being accessed with the address of an unconditional branch. The only function of the unconditional branch is to change the PC. Thus, when the branch-target buffer signals a hit and indicates that the branch is unconditional, the pipeline can simply substitute the instruction from the branch-target buffer in place of the instruction that is returned from the cache (which is the unconditional branch).
Integrated Instruction Fetch Unit An instruction fetch unit that integrates several functions: • Integrated branch prediction ─ the branch predictor becomes a part of the integrated unit and is constantly predicting branches, so as to drive the fetch pipeline • Instruction prefetch ─ the unit autonomously manages prefetching, integrating it with branch prediction • Instruction memory access and buffering prediction ─ the unit uses prefetching to hide the cost of crossing cache blocks; it also provides buffering, to provide instructions to the issue stage as needed and in the quantity needed.
Return Address Predictor • Want to predict indirect jumps, i.e., jumps whose destination address varies at run time • Vast majority of indirect jumps come from procedure returns; 85% for SPEC89 • May predict procedure returns with a branch-target buffer. But accuracy will be low if procedure is called from multiple sites and the calls from one site are not clustered in time • What can we do?
Figure 3.22. Prediction accuracy for a return address buffer operated as a stack The accuracy is the fraction of return addresses predicted correctly. Since call depths are typically not large, a modest buffer works well.