Lecture 10: Pipelining

Lecture 10: Pipelining Computer Engineering 585 Fall 2001

DLX Stages: RTL activities

Pipelining is Not That Easy for Computers • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: HW cannot support this combination of instructions (Same table used for reading newspaper and breakfast) • Data hazards: Instruction depends on result of prior instruction still in the pipeline (share the towel – hand over from shower to sink) • Control hazards: Pipelining of branches & other instructions that change the PC • Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

Structural Hazard: Memory Port Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 ALUU Mem Reg Mem Reg Load ALUU Mem Reg Mem Reg Instruction 1 ALUU Mem Reg Mem Reg Instruction 2 Mem Reg Reg ALUU Mem Instruction 3 ALUU Mem Reg Mem Instruction 4

ALUU Mem Reg Mem Reg Bubble Bubble Bubble Bubble Bubble Bubbles due to One Memory Port Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 Load ALUU Mem Reg Mem Reg Instruction 1 ALUU Mem Reg Mem Reg Instruction 2 Stall ALUU Mem Reg Mem Instruction 3

Bubbles: Instruction-Time Space

Speed Up Equation for Pipelining CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr Ideal CPI x Pipeline depth x Clock Cycleunpipelined Speedup = ------------------------------------------------- (Ideal CPI + Pipeline stall CPI) x Clock Cyclepipelined Pipeline depth x Clock Cycleunpipelined Speedup = ------------------------------------------------- (1 + Pipeline stall CPI) x Clock Cyclepipelined

Example: Dual-port vs. Single-port • Machine A: Dual ported memory • Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate • Ideal CPI = 1 for both • Loads/Stores are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x(clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 • Machine A is 1.33 times faster

Data Hazard Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 ADD R1, R2, R3 ALU Reg Reg IM DM SUB R4, R1, R5 # ALU Program execution order (in instructions) IM Reg DM Reg AND R6, R1, R7 ALU IM Reg DM OR R8, R1, R9 ALU IM Reg XOR R10, R1, R11 IM Reg

Data Forwarding Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 ADD R1, R2, R3 ALU Reg Reg IM DM # SUB R4, R1, R5 ALU Reg Program execution order (in instructions) IM Reg DM AND R6, R1, R7 ALU IM Reg DM OR R8, R1, R9 ALU IM Reg XOR R10, R1, R11 IM Reg ADD

Hardware Support for Forwarding ID/EX EX/MEM MEM/WB Zero? Mux ALU Data memory Mux FIGURE 3.20 Forwarding of results to the ALU requires the addition of three ext ra inputs on each ALU multiplexer and the addition of three paths to the new inputs .

Data Hazard (Add to Store) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 ADD R1, R2, R3 ALU Reg IM Reg IM DM Program execution order (in instructions) LW R4, 0(R1) # IM ALU Reg DM Reg SW 12(R1), R4 IM ALU Reg DM FIGURE 3.11 Stores require an operand during MEM, and forwarding of that operan d is shown here.

Not Forwardable Data Hazards Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 LW R1, 0(R2) ALU Reg Reg IM DM Program execution order (in instructions) # ALU SUB R4, R1, R5 IM Reg DM ALU AND R6, R1, R7 IM Reg OR R8, R1, R9 IM Reg

Load Stalls Time (in clock cycles) CC 1 CC 2 CC 5 CC 6 CC 33 CC 4 LW R1, 0(R2) IM Reg DM Reg ALU Program execution order (in instructions) # SUB R4, R1, R5 IM Reg Bubble DM ALU AND R6, R1, R7 IM Reg ALU Bubble IM Reg Bubble OR R8, R1, R9 FIGURE 3.13 The load interlock causes a stall to be inserted at clock cycle 4, delaying the SUB instruction and those that follow by one cycle.

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd

Effectiveness of Load Scheduling Int Avg: 25%, FP Avg: 13%, Overall: 24% 45% 41% 40% 35% 30% 24% 24% 25% 23% 20% 20% Fraction of loads that cause a stall 20% 15% 12% 10% 10% 10% 4% 5% 0% li ear gcc doduc mdljdp su2cor eqntott hydro2d espresso compress Benchmark

Lecture 10: Pipelining