Lecture 14: Intro to Pipelining

Lecture 14:Intro to Pipelining Michael B. Greenwald Computer Architecture CIS 501 Fall 1999

30 40 40 40 40 20 A B C D Pipelining Lessons 6 PM 7 8 9 • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup Time T a s k O r d e r

5 Steps of DLX DatapathFigure 3.1, Page 130 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back At end of each cycle, any data needed next cycle is stored in mem, gen reg, special reg or temp

Pipelined DLX DatapathFigure 3.4, page 137 Need Result Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Write Back Memory Access Pipeline registers Latch for later • Each instr. active in only 1 stage • 1st 2 stages independent of instr. type.

30 40 40 40 40 20 A B C D Visualizing Pipelining 6 PM 7 8 9 • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup Time T a s k O r d e r

Visualizing PipeliningFigure 3.3, Page 133 Time (clock cycles) Register Bank Data Memory Instruction Memory ALU Register Bank I n s t r. O r d e r Instruction Level Parallelism (ILP) Which resources active during each clock cycle?

Its Not That Easy for Computers • Decreasing cycle time increases demands on components • Limits to pipelining:Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: HW cannot support this combination of instructions • Data hazards: Instruction depends on result of prior instruction still in the pipeline • Control hazards: Pipelining of branches & other instructions. Common solution is to stall the pipeline until the hazard “bubbles” in the pipeline

Visualizing Pipelining:Higher performance componentsFigure 3.3, Page 133 Time (clock cycles) Register Bank Data Memory Write Instruction Memory ALU Register Bank I n s t r. O r d e r 10X mem. BW? 2 Reads

Structural Hazards • Functional unit not fully pipelined; (e.g. FPU). Instr. using that unit cannot exec. at 1/clock_cycle. • Functional unit not duplicated enough for all combinations of instructions in the pipe to execute (e.g. 1 ALU needed for ALU and branch instructions).

One Memory Port/Structural HazardsFigure 3.6, Page 142 Time (clock cycles) Load Inactive I n s t r. O r d e r Instr 1 Instr 2 Instr 3

One Memory Port/Structural HazardsFigure 3.7, Page 143 Time (clock cycles) Load I n s t r. O r d e r Instr 1 Instr 2 stall Instr 3

Structural Hazard: Registers? Time (clock cycles) Write: 1st half Register Bank I n s t r. O r d e r Not idle, so can’t just insert bubbles Read: 2nd half Register Bank

Speed Up Equation for Pipelining Speedup from pipelining = Avg Instr Time unpipelined Avg Instr Time pipelined = CPIunpipelined x Clock Cycleunpipelined CPIpipelined x Clock Cyclepipelined = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined Ideal CPI = CPIunpipelined/Pipeline depth Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined x x

Speed Up Equation for Pipelining CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined Ideal CPI + Pipeline stall CPI Clock Cyclepipelined Speedup = Pipeline depth Clock Cycleunpipelined 1 + Pipeline stall CPI Clock Cyclepipelined x x

Example: Dual-port vs. Single-port • Machine A: Dual ported memory, to avoid stalls • Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate • Ideal CPI = 1 for both • Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 • Machine A is 1.33 times faster Stalls can eat up performance gains

Data Hazards • Pipelining can change the relativeorder of instructions by overlapping their execution • A later instructions uses a register before an earlier instruction has finished with it. • add r1,r2,r3sub r4,r1,r3 • Not even deterministic: Will SUB see old or new value of r1?

Data Hazard on R1Figure 3.9, page 147 Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 I n s t r. O r d e r sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

Data Hazard on R1Figure 3.9, page 147 Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 I n s t r. O r d e r sub r4,r1,r3 and r6,r1,r7 Fixed with structural hazard or r8,r1,r9 xor r10,r1,r11

Three Generic Data Hazards Classification: • Read After Write (RAW) • Write After Read (WAR) • Write After Write (WAW)

Three Generic Data Hazards InstrI followed be InstrJ • Read After Write (RAW)InstrJ tries to read operand before InstrI writes it

Three Generic Data Hazards InstrI followed be InstrJ • Write After Read (WAR)InstrJ tries to write operand before InstrI reads it • Can’t happen in DLX 5 stage pipeline because: • All instructions take 5 stages, • Reads are always in stage 2, and • Writes are always in stage 5

Three Generic Data Hazards InstrI followed be InstrJ • Write After Write (WAW)InstrJ tries to write operand before InstrI writes it • Leaves wrong result ( InstrI not InstrJ) • Can’t happen in DLX 5 stage pipeline because: • All instructions take 5 stages, and • Writes are always in stage 5 • May see WAR and WAW in later more complicated pipes

Forwarding • Some data hazards can be avoided if the result isn’t really needed until after it is actually produced. • Always feed back result to input latches • If hardware detects that previous result is source for current instruction, use value from latch, and not from register file.(If stalled, then forwarding not needed or used).

HW Change for ForwardingFigure 3.20, Page 161 ID/EX EX/MEM MEM/WB For each MUX:3 New Paths3 New Inputs

Forwarding to Avoid RAW Data HazardFigure 3.10, Page 149 Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 I n s t r. O r d e r sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 A.K.A Bypass, Short Circuit

Data Hazard Even with ForwardingFigure 3.12, Page 153 Time (clock cycles) lwr1, 0(r2) I n s t r. O r d e r sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

Data Hazard Even with ForwardingFigure 3.13, Page 154 Time (clock cycles) I n s t r. O r d e r lwr1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Pipeline interlock or r8,r1,r9

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd

Control Hazards • When a branch is executed it may (or may not!) change the PC to something other than Current PC+4. • Simple solution: • After detecting branch instruction, stall pipeline until target address is computed. • This introduces 3 cycles of stalls. • DIfferent implementation than data stall, since IF cycle must be repeated.

Control Hazard on BranchesThree Stage Stall

Branch Stall Impact • If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! • Two part solution: • Determine branch taken or not sooner, AND • Compute taken branch address earlier • DLX branch tests if register = 0 or 0 • DLX Solution: • Move Zero test to ID/RF stage • Adder to calculate new PC in ID/RF stage • 1 clock cycle penalty for branch versus 3

Pipelined DLX DatapathFigure 3.4, page 137 Need Result Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Write Back Memory Access Pipeline registers Latch for later • Data stationary control • local decode for each instruction phase / pipeline stage

Pipelined DLX DatapathFigure 3.22, page 163 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Memory Access Write Back

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken • Execute successor instructions in sequence • “Flush” instructions in pipeline if branch actually taken • Advantage of late pipeline state update, since don’t need to “Undo” • 47% DLX branches not taken on average • PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken • 53% DLX branches taken on average • But haven’t calculated branch target address in DLX • DLX still incurs 1 cycle branch penalty

Four Branch Hazard Alternatives #4: Delayed Branch • Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken • 1 slot delay allows proper decision and branch target address in 5 stage pipeline • DLX uses this Branch delay of length n

Delayed Branch • Where to get instructions to fill branch delay slot? • Before branch instruction • From the target address: only valuable when branch taken • From fall through: only valuable when branch not taken • Cancelling branches allow more slots to be filled because become no-op if prediction is wrong. • Compiler effectiveness for single branch delay slot: • Fills about 60% of branch delay slots • About 80% of instructions executed in branch delay slots useful in computation • About 50% (60% x 80%) of slots usefully filled

Evaluating Branch Alternatives Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline 3 1.42 3.5 1.0 Predict taken 1 1.14 4.4 1.26 Predict not taken 1 1.09 4.5 1.29 Delayed branch 0.5 1.07 4.6 1.31 Conditional & Unconditional = 14%, 65% change PC

Lecture 14: Intro to Pipelining