290 likes | 402 Views
CS152 – Computer Architecture and Engineering Lecture 12 – Pipeline Wrap up: Control Hazards, RAW/WAR/WAW. 2004-10-07 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Dave Patterson (www.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs152/. Pipelining Review. What makes it easy
E N D
CS152 – Computer Architecture andEngineeringLecture 12 – Pipeline Wrap up: Control Hazards, RAW/WAR/WAW 2004-10-07 John Lazzaro(www.cs.berkeley.edu/~lazzaro) Dave Patterson (www.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs152/
Pipelining Review • What makes it easy • all instructions are the same length • just a few instruction formats • memory operands appear only in loads and stores • Hazards limit performance • Structural: need more HW resources • Data: need forwarding, compiler scheduling • Data hazards must be handled carefully • MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load)
Outline • Pipelined Control • Control Hazards • RAW, WAR, WAW • Brainstorm on pipeline bugs
MIPS Pipeline Data / Control Paths A (fast) 1 PCSrc ID/EX 0 EX/MEM EX Control MEM IF/ID Add MEM/WB Branch Add WB 4 RegWrite Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 MemtoReg Read Address ALUSrc PC Read Data Address 1 Write Addr ALU Read Data 2 0 Write Data 0 Write Data 1 ALU cntrl MemWrite MemRead Sign Extend 16 32 ALUOp 0 1 RegDst
Control MIPS Pipeline Data / Control Paths (debug) 1 PCSrc ID/EX EX/MEM MEM/WB 0 EX MEM WB Instr Instr Instr IF/ID Control Control Add Branch Add 4 RegWrite Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 MemtoReg Read Address ALUSrc PC Read Data Address 1 Write Addr ALU Read Data 2 0 Write Data 0 Write Data 1 ALU cntrl MemWrite MemRead Sign Extend 16 32 ALUOp 0 1 RegDst
MIPS Pipeline Control (pipelined debug) 1 PCSrc ID/EX EX/MEM MEM/WB 0 Instr Instr Instr EX MEM IF/ID WB Control Control Control Add Branch Add 4 RegWrite Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 MemtoReg Read Address ALUSrc PC Read Data Address 1 Write Addr ALU Read Data 2 0 Write Data 0 Write Data 1 ALU cntrl MemWrite MemRead Sign Extend 16 32 ALUOp 0 1 RegDst
Control Hazards • When the flow of instruction addresses is not what the pipeline expects; incurred by change of flow instructions • Conditional branches (beq, bne) • Unconditional branches (j) • Possible solutions • Stall • Move decision point earlier in the pipeline • Predict • Delay decision (requires compiler support) • Control hazards occur less frequently than data hazards; there is nothing as effective against control hazards as forwarding is for data hazards
Jump 1 PCSrc 1 0 ID/EX Shift left 2 EX/MEM 0 IF/ID Control Add MEM/WB PC+4[31-28] Branch Add 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 Read Address PC Read Data Address 1 Write Addr ALU Read Data 2 1 Write Data 0 Write Data 0 ALU cntrl 16 32 Sign Extend 0 1 Forward Unit Datapath Branch and Jump Hardware
Administrivia • Finish Lab 3; meet with TA Friday • Midterm Tue Oct 12 5:30 - 8:30 in 101 Morgan • Northwest corner of campus, near Arch and Hearst • Midterm review Sunday Oct 10, 7 PM, 306 Soda • Bring 1 page, handwritten notes, both sides • Nothing electronic: no calculators, cell phones, pagers, … • Meet at LaVal’s Northside afterwards for Pizza
DM DM DM Reg Reg Reg Reg Reg Reg IM IM IM ALU ALU ALU stall Jumps Incur One Stall • Jumps not decoded until ID, so one stall is needed • Fortunately, jumps are very infrequent – only 2% of the SPECint instruction mix j I n s t r. O r d e r lw and
DM DM Reg Reg Reg Reg IM IM IM ALU ALU ALU stall stall stall lw DM Reg and Review: Branches Incur Three Stalls beq Can fix branch hazard by waiting – stall – but affects throughput I n s t r. O r d e r
Moving Branch Decisions Earlier in Pipe • Move the branch decision hardware back to the EX stage • Reduces the number of stall cycles to two • Adds an and gate and a 2x1 mux to the EX timing path • Add hardware to compute the branch target address and evaluate the branch decision to the ID stage • Reduces the number of stall cycles to one (like with jumps) • Computing branch target address can be done in parallel with RegFile read (done for all instructions – only used when needed) • Comparing the registers can’t be done until after RegFile read, so comparing and updating the PC adds a comparator, an and gate, and a 3x1 mux to the ID timing path • Need forwarding hardware in ID stage • For longer pipelines, decision points are later in the pipeline, incurring more stalls, so we need a better solution
Early Branch Forwarding Issues • Bypass of source operands from the EX/MEM if (IDcontrol.Branch and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd == IF/ID.RegisterRs)) ForwardC = 1 if (IDcontrol.Branch and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd == IF/ID.RegisterRt)) ForwardD = 1 Forwards the result from the second previous instr. to either input of the Compare • MEM/WB dependency also needs to be forwarded • If the instruction 2 before the branch is a load, then a stall will be required since the MEM stage memory access is occurring at the same time as the ID stage branch compare operation
Branch Prediction • Resolve branch hazards by assuming a given outcome and proceeding without waiting to see the actual branch outcome • Predict not taken – always predict branches will not be taken, continue to fetch from the sequential instruction stream, only when branch is taken does the pipeline stall • If taken, flush instructions in the pipeline after the branch • in IF, ID, and EX if branch logic in MEM – three stalls • in IF if branch logic in ID – one stall • ensure that those flushed instructions haven’t changed machine state– automatic in the MIPS pipeline since machine state changing operations are at the tail end of the pipeline (MemWrite or RegWrite) • restart the pipeline at the branch destination
I n s t r. O r d e r DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg flush IM IM IM IM ALU ALU ALU ALU 16 and $6,$1,$7 20 or r8,$1,$9 Flushing with Misprediction (Not Taken) • To flush the IF stage instruction, add a IF.Flush control line that zeros the instruction field of the IF/ID pipeline register (transforming it into a noop) 4 beq $1,$2,2 8 sub $4,$1,$5
Branch Prediction, con’t • Resolve branch hazards by statically assuming a given outcome and proceeding • Predict taken – always predict branches will be taken • Predict taken always incurs a stall (if branch destination hardware has been moved to the ID stage) • As the branch penalty increases (for deeper pipelines), a simple static prediction scheme will hurt performance • With more hardware, possible to try to predict branch behavior dynamically during program execution • Dynamic branch prediction – predict branches at run-time using run-time information
Dynamic Branch Prediction • A branch prediction buffer (aka branch history table (BHT)) in the IF stage, addressed by the lower bits of the PC, contains a bit that tells whether the branch was taken the last time it was execute • Bit may predict incorrectly (may be from a different branch with the same low order PC bits, or may be a wrong prediction for this branch) but the doesn’t affect correctness, just performance • If the prediction is wrong, flush the incorrect instructions in pipeline, restart the pipeline with the right instructions, and invert the prediction bit • The BHT predicts when a branch is taken, but does not tell where its taken to! • A branch target buffer (BTB) in the IF stage can cache the branch target address (or !even! the branch target instruction) so that a stall can be avoided
1-bit Prediction Accuracy • 1-bit predictor in loop is incorrect twice when not taken • Assume predict_bit = 0 to start (indicating branch not taken) and loop control is at the bottom of the loop code • First time through the loop, the predictor mispredicts the branch since the branch is taken back to the top of the loop; invert prediction bit (predict_bit = 1) • As long as branch is taken (looping), prediction is correct • Exiting the loop, the predictor again mispredicts the branch since this time the branch is not taken falling out of the loop; invert prediction bit (predict_bit = 0) Loop: 1st loop instr 2nd loop instr . . . last loop instr bne $1,$2,Loop fall out instr • For 10 times through the loop we have a 80% prediction accuracy for a branch that is taken 90% of the time
2-bit Predictors • A 2-bit scheme can give 90% accuracy since a prediction must be wrong twice before the prediction bit is changed right 9 times Loop: 1st loop instr 2nd loop instr . . . last loop instr bne $1,$2,Loop fall out instr wrong on loop fall out Taken Not taken 1 Predict Taken Predict Taken 1 Taken right on 1st iteration Not taken Taken Not taken 0 Predict Not Taken Predict Not Taken 0 Taken Not taken
Delayed Decision • First, move the branch decision hardware and target address calculation to the ID pipeline stage • A delayed branch always executes the next sequential instruction – the branch takes effect after that next instruction • MIPS software moves an instruction to immediately after the branch that is not affected by the branch (a safe instruction) thereby hiding the branch delay • As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot. • Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches • Growth in available transistors has made dynamic approaches relatively cheaper
becomes becomes becomes if $2=0 then add $1,$2,$3 if $1=0 then add $1,$2,$3 sub $4,$5,$6 add $1,$2,$3 if $1=0 then sub $4,$5,$6 Scheduling Branch Delay Slots A. From before branch B. From branch target C. From fall through • A is the best choice, fills delay slot & reduces instruction count (IC) • In B, the sub instruction may need to be copied, increasing IC • In B and C, must be okay to execute sub when branch fails add $1,$2,$3 if $1=0 then add $1,$2,$3 if $2=0 then sub $4,$5,$6 delay slot delay slot add $1,$2,$3 if $1=0 then sub $4,$5,$6 delay slot
3 Generic Data Hazards: RAW, WAR, WAW • Read After Write (RAW)InstrJ tries to read operand before InstrI writes it • Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. • Forwarding handles many, but not all, RAW dependencies in 5 stage MIPS pipeline I: add r1,r2,r3 J: sub r4,r1,r3
I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 3 Generic Data Hazards: RAW, WAR, WAW • Write After Read (WAR)InstrJ writes operand before InstrI reads it • Called an “anti-dependence” by compiler writers.This results from “reuse” of the name “r1”. • Can’t happen in MIPS 5 stage pipeline because: • All instructions take 5 stages, and • Reads are always in stage 2, and • Register Writes must be in stage 5
I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 3 Generic Data Hazards: RAW, WAR, WAW • Write After Write (WAW)InstrJ writes operand before InstrI writes it. • Called an “output dependence” by compiler writersThis also results from the “reuse” of name “r1”. • Can’t happen in MIPS 5 stage pipeline because: • All instructions take 5 stages, and • Register Writes must be in stage 5 • Can see WAR and WAW in more complicated pipes
IF.Flush 0 0 1 Supporting ID Stage Branches PCSrc Branch 1 Hazard Unit ID/EX 0 EX/MEM 0 1 0 Control IF/ID Add MEM/WB 4 Shift left 2 Add Compare Read Addr 1 Instruction Memory Data Memory RegFile Read Addr 2 Read Address Read Data 1 PC Read Data 1 Write Addr ALU Address 1 ReadData 2 Write Data 0 Write Data 0 ALU cntrl 16 Sign Extend 32 Forward Unit Forward Unit
Brain storm on pipeline bugs • Where are bugs likely to hide in a pipelined processor? • … • How can you write tests to uncover these likely bugs? • … • Once it passes a test, never need to run it again in the design process?
Brain storm on pipeline bugs • Depending on branch solution (move to ID, delayed, static prediction, dynamic prediction), where are bugs likely to hide? • … • How can you write tests to uncover these likely bugs? • … • Once it passes a test, don’t need to run it again?
Ifetch Ifetch Ifetch Reg/Dec Reg/Dec Reg/Dec Exec Exec Exec Mem Wr Peer Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock • Suppose we use with a 4 stage pipeline that combines memory access and write back stages for all instructions but load, stalling when there are structural hazards. Impact? 1. The branch delay slot is now 0 instructions 2. Most loads cause stall since often a structural hazard on reg. writes 3. Most stores cause stall since they have a structural hazard 4. Both 2 & 3: most loads&stores cause stall due to structural hazards 5. Most loads cause stall, but there is no load-use hazard anymore 6. Both 2 & 3, but there is no load-use hazard anymore 7. None of the above 1st add Mem/Wr 2nd lw 3rd add Mem/Wr
Summary: Designing a Pipelined Processor • Go back and examine your data path and control diagram • Associate resources with states • Be sure there are no structural hazards: one use / clock cycle • Add pipeline registers between stages to balance clock cycle • Amdahl’s Law suggests splitting longest stage • Resolve all data and control dependencies • If backwards in time in pipeline drawing to registers=> data hazard: forward or stall to resolve them • If backwards in time in pipeline drawing to PC=> control hazard: we’ll see next time • 5 stage pipeline with reads early in same stage, writes later in same stage, avoids WAR/WAW hazards • Assert control in appropriate stage • Develop test instruction sequences likely to uncover pipeline bugs (If you don’t test it, it won’t work )