EECS 470

EECS 470 Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A

Pipeline function for BEQ • Fetch: read instruction from memory • Decode: read source operands from reg • Execute: calculate target address and test for equality • Memory: Send target to PC if test is equal • Writeback: Nothing left to do

Control Hazards beq 1 1 10 sub 3 4 5 time beq fetch decode execute memorywriteback sub fetch decode execute

Approaches to handling control hazards • Avoidance • Make sure there are no hazards in the code • Detect and Stall • Delay fetch until branch resolved. • Speculate and Squash if wrong • Go ahead and fetch more instruction in case it is correct, but stop them if they shouldn’t have been executed

Handling branch hazards: avoid all hazards • Don’t have branch instructions! • Maybe a little impractical  • Predication can eliminate some branches • If-conversion • Hyperblocks

if-conversion if (a == b) { x++; y = n / d; } sub t1  a, b jnz t1, PC+2 add x  x, #1 div y  n, d sub t1  a, b add t2  x, #1 div t3  n, d cmov(t1) x  t2 cmov(t1) y  t3 sub t1  a, b add(t1) x  x, #1 div(t1) y  n, d

Removing hazards by refining a branch instruction • Redefine branch instructions: ptbeq regA regB offset prepare to branch if equal If (R[regA] = = R[regB]) execute instructions at PC+1, PC+2, PC+3 then PC+1+offset

ptbnz example g = c + 2 bnz g, PC + 4 t = 5 n = 7 noop m = 5 a = 3 t = 5 n = 7 g = c + 2 bnz g, PC + 1 m = 5 a = 3

Problems with this solution • Old programs (legacy code) may not run correctly on new implementations • Longer pipelines tend to need more noops • Programs get larger as noops are included • Especially a problem for machines that try to execute more than one instruction every cycle • Harder to find useful instructions • Program execution is slower • CPI is one, but some I’s are noops

Handling control hazards: detect and stall • Detection: • Must wait until decode • Compare opcode to beq or jalr • Alternately, this is just another control signal • Stall: • Keep current instructions in fetch • Pass noop to decode stage (not execute!)

+ + A L U M U X 1 REG file M U X PC Inst mem Data memory M U X sign ext Control bnz r1 IF/ ID ID/ EX EX/ Mem Mem/ WB

M U X + + A L U M U X 1 REG file M U X PC Inst mem Data memory M U X sign ext noop Control IF/ ID ID/ EX EX/ Mem Mem/ WB

fetch or fetch Target: Control Hazards beq 1 1 10 sub 3 4 5 time beq fetch decode execute memorywriteback sub fetch fetchfetch

Problems with detect and stall • CPI increases every time a branch is detected! • Is that necessary? Not always! • Only about ½ of the time is the branch taken • Let’s assume that it is NOT taken… • In this case, we can ignore the beq (treat it like a noop) • Keep fetching PC + 1 • What if we are wrong? • OK, as long as we do not COMPLETE any instructions we mistakenly executed (i.e. don’t perform writeback)

Handling data hazards: speculate and squash • Speculate: assume not equal • Keep fetching from PC+1 until we know that the branch is really taken • Squash: stop bad instructions if taken • Send a noop to: • Decode, Execute and Memory • Send target address to PC

M U X + + noop A L U noop noop M U X 1 equal REG file M U X PC Inst mem Data memory add M U X sign ext beq sub add nand Control sub beq beq IF/ ID ID/ EX EX/ Mem Mem/ WB

Problems with fetching PC+1 • CPI increases every time a branch is taken! • About ½ of the time • Is that necessary? No!, but how can you fetch from the target before you even know the previous instruction is a branch – much less whether it is taken???

M U X + + target eq? A L U M U X 1 REG file M U X PC Inst mem Data memory M U X sign ext bpc target Control IF/ ID ID/ EX EX/ Mem Mem/ WB beq

Branch Target Buffer Fetch PC Send PC to BTB found? No Yes use target use PC+1 Predicted target PC

Branch prediction • Predict not taken: ~50% accurate • No BTB needed; always use PC+1 • Predict backward taken: ~65% accurate • BTB holds targets for backward branches (loops) • Predict same as last time: ~80% accurate • Update BTB for any taken branch

What about indirect branches? • Could use same approach • PC+1 unlikely indirect target • Indirect jumps often have multiple targets (for same instruction) • Switch statements • Virtual function calls • Shared library (DLL) calls

Indirect jump: Special Case • Return address stack • Function returns have deterministic behavior (usually) • Return to different locations (BTB doesn’t work well) • Return location known ahead of time • In some register at the time of the call • Build a specialize structure for return addresses • Call instructions write return address to R31 AND RAS • Return instructions pop predicted target off stack • Issues: finite size (save or forget on overflow?); • Issues: long jumps (clear when wrong?)

Branch prediction • Pentium: ~85% accurate • Pentium Pro: ~92% accurate • Best paper designs: ~96% accurate

Costs of branch prediction/speculation • Performance costs? • Minimal: no difference between waiting and squashing; and it is a huge gain when prediction is correct! • Power? • Large: in very long/wide pipelines many instructions can be squashed • Squashed = # mispredictions  pipeline length/width before target resolved • Area? • Can be large: predictors can get very big as we will see next time • Complexity? • Designs are more complex • Testing becomes more difficult

What else can be speculated? • Dependencies • I think this data is coming from that store instruction) • Values • I think I will load a 0 value • Accuracy? • Branch prediction (direction) is Boolean (T,NT) • Branch targets are stable or predictable (RAS) • Dependencies are limited • Values cover a huge space (0 – 4B)

EECS 470

EECS 470

Presentation Transcript

EECS 470: Computer Architecture

EECS 470 Power and Architecture

EECS 470

EECS 470 Lecture 8

EECS 470 Lecture 8

Finishing out EECS 470

EECS 470

EECS 470

EECS 470 Lecture 1

EECS 470 Power and Architecture

EECS/CS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470

EECS 470