170 likes | 248 Views
CPE 335 Computer Organization Basic MIPS Pipelining – Part III. Dr. Iyad Jafar Adapted from Dr. Gheith Abandah slides http://www.abandah.com/gheith/Courses/CPE335_S08/index.html. DM. DM. DM. Reg. Reg. Reg. Reg. Reg. Reg. IM. IM. IM. IM. ALU. ALU. ALU. ALU. beq. DM. Reg.
E N D
CPE 335 Computer Organization Basic MIPS Pipelining – Part III Dr. Iyad Jafar Adapted from Dr. Gheith Abandah slides http://www.abandah.com/gheith/Courses/CPE335_S08/index.html
DM DM DM Reg Reg Reg Reg Reg Reg IM IM IM IM ALU ALU ALU ALU beq DM Reg Reg Branch Instructions Cause Control Hazards • The address of the instruction to be fetched after the beq instruction is known in the MEM stage • Dependencies backward in time cause hazards I n s t r. O r d e r lw Inst 3 Inst 4
DM DM Reg Reg Reg Reg IM IM IM ALU ALU ALU stall stall stall lw DM Reg Inst 3 Fixing Control Hazard by Stalls • Delay fetching the next instruction 3 cycles • IF (ID/EX.Branch) Stall the pipeline for the next 3 cycles • Is it actually a stall ?!!! beq I n s t r. O r d e r
Control Hazards Flush theseinstructions (Set controlvalues to 0)
DM DM Reg Reg Reg Reg IM IM ALU ALU stall lw Reducing the Cost of Control Hazard • Approach I: Modify the ID Stage • Compute the branch address in the ID stage • Compare the two register in the ID stage using additional hardware. • This reduces the stalls to one • Any complications ? beq
Reducing the Cost of Control Hazard • Approach I – continued • IF (ID/EX.Branch) then Flush IF/ID register
Reducing the Cost of Control Hazard • Approach II (Static Branch Prediction) • Assume the branch is not taken always and fetch the next sequential instruction • If the assumption is true, no additional cost is associated with the branch • If the assumption is false, we have to ignore the fetched instruction and fetch the instruction at the branch address • IF (ID/EX.Branch) and ID/EX.ZEROFlush IF/ID register • - Unlike Approach I, Flushing here is conditional !
Reducing the Cost of Control Hazard • Approach III (Dynamic Branch Prediction) • Use a history table or branch prediction buffer to store the branch prediction based on last branch resutl • The table is addressable by the lower bits of the branch instruction address. • If the branch is predicted untaken: • Fetch the next sequential instruction. • Later, if it comes out the branch is taken, then flush the pipeline, i.e. one cycle is lost. • If the branch is predicted taken: • we still have to wait for the computation of the branch address, so we have to wait one cycle • use branch target buffer to store the branch address of this instruction
Reducing the Cost of Control Hazard • Approach III (Dynamic Branch Prediction) • 1-bit dynamic branch predictor • Use one bit to store the prediction • Update prediction • Performance shortcoming ?! • Consider branching in loops ! We may miss predict twice.
Reducing the Cost of Control Hazard • Approach III (Dynamic Branch Prediction) • 2-bit dynamic branch predictor • The prediction should be wrong twice before it is changed. Strong Weak Strong Weak
Example • Consider a certain program that have a conditional branch instruction whose outcome is given below when the program is executed. • T-T-N-T-T-N-T • List predictions for the following branch prediction schemes and find the prediction accuracy. • Predict always taken • Predict always untaken • 1-bit predictor, initialized to predict taken • 2-bit predictor, initialized to weakly predict taken
Example • Actual branch actions : T-T-N-T-T-N-T • Predict as always taken • Predictions : T-T-T-T-T-T-T • Accuracy = 5/7 = 71% • Predict as always untaken • Predictions : N-N-N-N-N-N-N • Accuracy = 2/7 = 29% • 1-bit predictor initialized to predict taken • Predictions: T-T-T-N-T-T-T-N • Accuracy = 3/7 = 43% • 2-bit predictor initialized to weakly predict taken • Predictions: T-T-T-T-T-T-T • Accuracy = 5/7 = 71%
Example • Let’s compare the performance of single-cycle, multi-cycle, and pipeline implementation of MIPS processor given the operation times and instruction mix below. Assume that: • Branch decision is done in the MEM cycle. Branch handling in the pipeline implementation is done by stalling the pipeline. • Half of the load instructions incur load-use hazard. Forwarding is implemented.
Example • Clock cycle time • Single-cycle = 200 + 50 + 100 + 50 + 200 = 600 ps • Multi-cycle = 200 ps • Pipeline = 200 ps • CPI • Single-cycle = 1 • Multi-cycle = 5x 0.25 + 4x0.52 + 4x0.10 + 3x0.11 + 3x0.02 • =4.12 • Pipeline = 0.125x2 + 0.125x1 + 0.52x1 + 0.1x1 + 0.11x4 + • 0.02 x2 = 1.585 • Execution time per instruction • Single-cycle = 600 ps • Multi-cycle = 4.12 x 200 ps = 824 ps • Pipeline = 1.585 x 200 = 317 ps
Exercise • Redo the computations in the previous example by assuming that branch prediction is used in the pipelined implementation and one-quarter of the branches are miss predicted !
Summary • All modern day processors use pipelining • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Potential speedup: a CPI of 1 and fast a CC • Pipeline rate limited by slowest pipeline stage • Unbalanced pipe stages makes for inefficiencies • The time to “fill” pipeline and time to “drain” it can impact speedup for deep pipelines and short code runs • Must detect and resolve hazards • Stalling negatively affects CPI (makes CPI greater than the ideal of 1)