250 likes | 432 Views
EECS 470. Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A. Pipeline function for BEQ. Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate target address and test for equality
E N D
EECS 470 Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A
Pipeline function for BEQ • Fetch: read instruction from memory • Decode: read source operands from reg • Execute: calculate target address and test for equality • Memory: Send target to PC if test is equal • Writeback: Nothing left to do
Control Hazards beq 1 1 10 sub 3 4 5 time beq fetch decode execute memorywriteback sub fetch decode execute
Approaches to handling control hazards • Avoidance • Make sure there are no hazards in the code • Detect and Stall • Delay fetch until branch resolved. • Speculate and Squash if wrong • Go ahead and fetch more instruction in case it is correct, but stop them if they shouldn’t have been executed
Handling branch hazards: avoid all hazards • Don’t have branch instructions! • Maybe a little impractical • Predication can eliminate some branches • If-conversion • Hyperblocks
if-conversion if (a == b) { x++; y = n / d; } sub t1 a, b jnz t1, PC+2 add x x, #1 div y n, d sub t1 a, b add t2 x, #1 div t3 n, d cmov(t1) x t2 cmov(t1) y t3 sub t1 a, b add(t1) x x, #1 div(t1) y n, d
Removing hazards by refining a branch instruction • Redefine branch instructions: ptbeq regA regB offset prepare to branch if equal If (R[regA] = = R[regB]) execute instructions at PC+1, PC+2, PC+3 then PC+1+offset
ptbnz example g = c + 2 bnz g, PC + 4 t = 5 n = 7 noop m = 5 a = 3 t = 5 n = 7 g = c + 2 bnz g, PC + 1 m = 5 a = 3
Problems with this solution • Old programs (legacy code) may not run correctly on new implementations • Longer pipelines tend to need more noops • Programs get larger as noops are included • Especially a problem for machines that try to execute more than one instruction every cycle • Harder to find useful instructions • Program execution is slower • CPI is one, but some I’s are noops
Handling control hazards: detect and stall • Detection: • Must wait until decode • Compare opcode to beq or jalr • Alternately, this is just another control signal • Stall: • Keep current instructions in fetch • Pass noop to decode stage (not execute!)
+ + A L U M U X 1 REG file M U X PC Inst mem Data memory M U X sign ext Control bnz r1 IF/ ID ID/ EX EX/ Mem Mem/ WB
M U X + + A L U M U X 1 REG file M U X PC Inst mem Data memory M U X sign ext noop Control IF/ ID ID/ EX EX/ Mem Mem/ WB
fetch or fetch Target: Control Hazards beq 1 1 10 sub 3 4 5 time beq fetch decode execute memorywriteback sub fetch fetchfetch
Problems with detect and stall • CPI increases every time a branch is detected! • Is that necessary? Not always! • Only about ½ of the time is the branch taken • Let’s assume that it is NOT taken… • In this case, we can ignore the beq (treat it like a noop) • Keep fetching PC + 1 • What if we are wrong? • OK, as long as we do not COMPLETE any instructions we mistakenly executed (i.e. don’t perform writeback)
Handling data hazards: speculate and squash • Speculate: assume not equal • Keep fetching from PC+1 until we know that the branch is really taken • Squash: stop bad instructions if taken • Send a noop to: • Decode, Execute and Memory • Send target address to PC
M U X + + noop A L U noop noop M U X 1 equal REG file M U X PC Inst mem Data memory add M U X sign ext beq sub add nand Control sub beq beq IF/ ID ID/ EX EX/ Mem Mem/ WB
Problems with fetching PC+1 • CPI increases every time a branch is taken! • About ½ of the time • Is that necessary? No!, but how can you fetch from the target before you even know the previous instruction is a branch – much less whether it is taken???
M U X + + target eq? A L U M U X 1 REG file M U X PC Inst mem Data memory M U X sign ext bpc target Control IF/ ID ID/ EX EX/ Mem Mem/ WB beq
Branch Target Buffer Fetch PC Send PC to BTB found? No Yes use target use PC+1 Predicted target PC
Branch prediction • Predict not taken: ~50% accurate • No BTB needed; always use PC+1 • Predict backward taken: ~65% accurate • BTB holds targets for backward branches (loops) • Predict same as last time: ~80% accurate • Update BTB for any taken branch
What about indirect branches? • Could use same approach • PC+1 unlikely indirect target • Indirect jumps often have multiple targets (for same instruction) • Switch statements • Virtual function calls • Shared library (DLL) calls
Indirect jump: Special Case • Return address stack • Function returns have deterministic behavior (usually) • Return to different locations (BTB doesn’t work well) • Return location known ahead of time • In some register at the time of the call • Build a specialize structure for return addresses • Call instructions write return address to R31 AND RAS • Return instructions pop predicted target off stack • Issues: finite size (save or forget on overflow?); • Issues: long jumps (clear when wrong?)
Branch prediction • Pentium: ~85% accurate • Pentium Pro: ~92% accurate • Best paper designs: ~96% accurate
Costs of branch prediction/speculation • Performance costs? • Minimal: no difference between waiting and squashing; and it is a huge gain when prediction is correct! • Power? • Large: in very long/wide pipelines many instructions can be squashed • Squashed = # mispredictions pipeline length/width before target resolved • Area? • Can be large: predictors can get very big as we will see next time • Complexity? • Designs are more complex • Testing becomes more difficult
What else can be speculated? • Dependencies • I think this data is coming from that store instruction) • Values • I think I will load a 0 value • Accuracy? • Branch prediction (direction) is Boolean (T,NT) • Branch targets are stable or predictable (RAS) • Dependencies are limited • Values cover a huge space (0 – 4B)