Computer Organization and Design

Computer Organization and Design Pipelining Jair Gonzalez Jan 2004

A B C D Pipelining everyday • Laundry Example • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes • Dryer takes 40 minutes • Folder takes 20 minutes Jan 2004

A B C D Secuentially 6 PM Midnight 7 8 9 11 10 Time T a s k O r d e r 30 40 20 30 40 20 30 40 20 30 40 20 • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take? Jan 2004

40 40 40 30 40 20 A B C D Using Pipelining 6 PM Midnight 7 8 9 11 10 Time • Pipelined laundry takes 3.5 hours for 4 loads Jan 2004

30 40 40 40 40 20 A B C D Pipelining concepts Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Stall for Dependences Time Jan 2004

Ifetch Reg/Dec Exec Mem Wr Pipelining of five Stages Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load • Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory • Reg/Dec: Registers Fetch and Instruction Decode • Exec: Calculate the memory address • Mem: Read the data from the Data Memory • Wr: Write the data back to the register file Jan 2004

Increasing throughput

Basic idea • What do we need to add to split the datapath into stages?

Graphical Representation Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths

IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Conventional Representation Time Program Flow

Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Wr Single Cycle, Multiple cycle vs. Pipelining Cycle 1 Cycle 2 Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Pipeline Implementation: Load Store R-type

Pipeline speedup • Suppose we execute 100 instructions • Single Cycle Machine • 45 ns/cycle x 1 CPI x 100 inst = 4500 ns • Multicycle Machine • 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns • Ideal pipelined machine • 10 ns/cycle x (1 CPI x 100 inst + 4 cycle fill) = 1040 ns

Pipelining cost • structural hazards: attempt to use the same resource two different ways at the • same time E.g., combined washer/dryer would be a structural hazard or folder • busy doing something else (watching TV) • control hazards: attempt to make a decision before condition is evaluated • E.g., washing football uniforms and need to get proper detergent level; • need to see after dryer before next load in branch instructions • data hazards: attempt to use an item before it is ready E.g., one sock of pair in • dryer and one in washer; can’t fold until get sock from washer through dryer • instruction depends on result of prior instruction still in the pipeline • Can always resolve hazards by waiting • pipeline control must detect the hazard • take action (or delay action) to resolve hazards

Mem ALU Mem Mem Reg Reg ALU Mem Mem Reg Reg ALU ALU Mem Mem Reg Reg ALU Single memory, an structural hazard Time (clock cycles) I n s t r. O r d e r Load Mem Reg Reg Instr 1 Instr 2 Mem Mem Reg Reg Instr 3 Instr 4 Detection is easy in this case! (right half highlight means read, left half write)

Mem ALU Mem ALU ALU Mem Control Hazard • Stall: wait until decision is clear • Impact: 3 clock cycles per branch instruction => slow I n s t r. O r d e r Time (clock cycles) Mem Reg Reg Add Mem Reg Reg Beq Load Mem Reg Reg

Mem ALU Mem ALU ALU Mem Control Hazard • Stall: wait until decision is clear • Its possible to move up decision to 2nd stage by adding hardware to check registers as being read • Impact : 2 Clock Cycles per branch instruction I n s t r. O r d e r Time (clock cycles) Mem Reg Reg Add Mem Reg Reg Beq Load Mem Reg Reg

Mem ALU Mem ALU ALU Control Hazard Solutions • Predict: guess one direction then back up if wrong • Predict not taken • Impact: 1 clock cycles per branch instruction if right, 2 if wrong (right 50% of time) • More dynamic scheme: history of 1 branch ( 90%) I n s t r. O r d e r Time (clock cycles) Mem Reg Reg Add Mem Reg Reg Beq Load Mem Mem Reg Reg

Mem ALU Mem ALU ALU ALU Control Hazard Solutions • Redefine branch behavior (takes place after next instruction) delayed branch • Impact: 1 clock cycles per branch instruction if can find instruction to put in slot ( 50% of time) I n s t r. O r d e r Time (clock cycles) Mem Reg Reg Add Mem Reg Reg Beq Misc Mem Mem Reg Reg Load Mem Mem Reg Reg

Data Hazard on r1 add r1,r2,r3 sub r4, r1,r3 and r6, r1,r7 or r8, r1,r9 xor r10, r1,r11

Im ALU Im ALU Im Dm Reg Reg ALU Data Hazard on r1 • Dependencies backwards in time are hazards Time (clock cycles) ID/RF EX MEM WB IF add r1,r2,r3 Reg Reg ALU Im Dm I n s t r. O r d e r sub r4,r1,r3 Dm Reg Reg Dm Reg Reg and r6,r1,r7 Im Dm Reg Reg or r8,r1,r9 ALU xor r10,r1,r11

ALU Im ALU Im ALU Im Dm Reg Reg Data Hazard Solution: Forward result from one stage to another Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 ALU Reg Reg Im Dm I n s t r. O r d e r sub r4,r1,r3 Dm Reg Reg Dm Reg Reg and r6,r1,r7 ALU Im Dm Reg Reg or r8,r1,r9 xor r10,r1,r11

ALU Im Forwarding (or Bypassing): What about Loads -Dependencies backwards in time are hazards - Can’t solve with forwarding: - Must delay/stall instruction dependent on loads Time (clock cycles) EX MEM WB ID/RF IF ALU lw r1,0(r2) Reg Reg Im Dm Sub r4,r1,r3 Dm Reg Reg

Stall Im Dm Reg Reg ALU What about Loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Reg Reg ALU Im Dm sub r4,r1,r3

When Designing a Pipelined Processor • Go back and examine your datapath and control diagram • associated resources with states • ensure that flows do not conflict, or figure out how to resolve • assert control in appropriate stage

Like in the book

1st lw Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Pipelining everyday Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock 2nd lw 3rd lw • The five independent functional units in the pipeline datapath are: • Instruction Memory for the Ifetch stage • Register Files Read ports (bus A and busB) for the Reg/Dec stage • ALU for the Exec stage • Data Memory for the Mem stage • Register Files Write port (bus W) for the Wr stage

Observation • Each functional unit can only be used once per instruction • Each functional unit must be used at the same stage for all instructions:

ALU PC Clk Recall: Single cycle control! Control Ideal Instruction Memory Control Signals Conditions Instruction Rd Rs Rt 5 5 5 Instruction Address A Data Address Data Out 32 Rw Ra Rb 32 Ideal Data Memory 32 32 32-bit Registers Next Address Data In B Clk Clk 32 Datapath

Data Stationary Control • The Main Control generates the control signals during Reg/Dec • Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later • Control signals for Mem (MemWr Branch) are used 2 cycles later • Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc ALUOp ALUOp Main Control RegDst RegDst Ex/Mem Register IF/ID Register Mem/Wr Register ID/Ex Register MemWr MemWr MemWr Branch Branch Branch MemtoReg MemtoReg MemtoReg MemtoReg RegWr RegWr RegWr RegWr

A M S B D PC Datapath + Data Stationary Control IR v v v fun rw rw rw wb wb wb Decode Inst. Mem me me WB Ctrl rt Mem Ctrl rs ex op im rs rt Reg. File Reg File Exec Mem Access Data Mem Next PC

A try 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15

n n n n A M S B 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 D Start: Fetch 10 Inst. Mem Decode WB Ctrl Mem Ctrl IR im Rs Rt Reg. File Reg File Exec Mem Access Data Mem Next PC 10 PC

n n n A M S B 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 D Fetch 14, Decode 10 WB Ctrl Decode lw r1, r2(35) Inst. Mem Mem Ctrl IR im 2 rt Reg. File Reg File Exec Mem Access Data Mem Next PC 14 PC

n n M S B 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 D Fetch 20, Decode 14, Exec 10 addI r2, r2, 3 Inst. Mem Decode WB Ctrl lw r1 Mem Ctrl IR 35 2 rt Reg. File Reg File r2 Exec Mem Access Data Mem Next PC 20 PC

n M B 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 D Fetch 24, Decode 20, Exec 14, Mem 10 addI r2, r2, 3 sub r3, r4, r5 Inst. Mem Decode WB Ctrl lw r1 Mem Ctrl IR 3 4 5 Reg. File Reg File r2 r2+35 Exec Mem Access Data Mem Next PC 24 PC

r5 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 D Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 Inst. Mem Decode addI r2 WB Ctrl lw r1 sub r3 Mem Ctrl IR 6 7 Reg. File Reg File M[r2+35] r4 r2+3 Exec Mem Access Data Mem Next PC 30 PC Note Delayed Branch: always execute ori after beq

r7 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 D Fetch 34, Dcd 30, Ex 24, Mem 20, WB 14 ori r8, r9 17 Inst. Mem Decode addI r2 WB Ctrl sub r3 Mem Ctrl beq IR 9 xx 100 R1=M[r2+35] Reg. File Reg File r6 r2+3 r4-r5 Exec Mem Access Data Mem Next PC 34 PC

x 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 D Fetch 100, Dcd 34, Ex 30, Mem 24, WB 20 Inst. Mem ori r8 Decode WB Ctrl sub r3 beq add r10, r11, r12 Mem Ctrl 17 11 12 Reg. File r1=M[r2+35] IR Reg File r4-r5 r9 xxx r2 = r2+3 Exec Mem Access Data Mem Next PC 100 PC Ooops, we should have only one delayed instruction

r12 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 D Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24 Inst. Mem Decode add r10 ori r8 WB Ctrl beq Mem Ctrl and r13, r14, r15 xx 14 15 Reg. File r1=M[r2+35] IR Reg File r11 xxx r9 | 17 r2 = r2+3 Exec r3 = r4-r5 Mem Access Data Mem Next PC 104 PC Squash the extra instruction in the branch shadow!

r15 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 D Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30 Inst. Mem Decode ori r8 add r10 WB Ctrl and r13 Mem Ctrl xx Reg. File Reg File r1=M[r2+35] IR r9 | 17 r11+r12 r14 Exec r2 = r2+3 r3 = r4-r5 Mem Access Data Mem Next PC 110 PC Squash the extra instruction in the branch shadow!

10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 D n Fetch 114, Dcd 110, Ex 104, Mem 100, WB 34 NO WB NO Ovflow and r13 Inst. Mem Decode add r10 WB Ctrl Mem Ctrl r1=M[r2+35] Reg. File r2 = r2+3 IR Reg File r11+r12 R3 = r4-r5 r14 & R15 Exec r8 = r9 | 17 Mem Access Data Mem Next PC 114 PC Squash the extra instruction in the branch shadow!

Bubbles Valid Stalls IRex IR IRwb Inst. Mem WB Ctrl IRmem Ex Ctrl Dcd Ctrl Mem Ctrl Equal Reg. File Reg File A S Exec PC Next PC B Mem Access M D Data Mem Pipelined processor • Separate control at each stage • Stalls propagate backwards to freeze previous stages • Bubbles in pipeline introduced by placing Noops into local stage, stall previous stages.

RAW Data Hazard IF DCD EX Mem WB IF DCD EX Mem WB WAW Data Hazard IF DCD EX Mem WB IF DCD OF Ex Mem IF DCD OF Ex RS WAR Data Hazard Recap: Data Hazards • Avoid some by design • Eliminate WAR by always fetching operands early (DCD) in pipe • Eleminate WAW by doing all WBs in order (last stage, static) • Detect and resolve remaining ones • stall or forward (if possible)

New Inst Inst I Window on execution: Only pending instructions can cause exceptions Instruction Movement: Inst J Hazard Detection • Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline. • A RAW hazard exists on register r if r Î Rregs( i ) Ç Wregs( j ) • Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction. • When instruction issues, reserve its result register. • When on operation completes, remove its write reservation. • A WAW hazard exists on register r if r Î Wregs( i ) Ç Wregs( j ) • A WAR hazard exists on register r if r Î Wregs( i ) Ç Rregs( j )

Resolve RAW by forwarding (or bypassing) IAU • Detect nearest valid write op operand register and forward into op latches, bypassing remainder of the pipe • Increase muxes to add paths from pipeline registers • Data Forwarding = Data Bypassing npc I mem Regs op rw rs rt PC Forward mux im n op rw B A alu n op rw S D mem m n op rw Regs

Pipelining everyday

Speedup Average instruction time without pipelining Pipeline speedup = Average instruction time with pipelie CPI without pielining * Clock cycle without pipelining = CPI with pipelining * Clock cycle with pipelining CPI without pielining * Clock cycle without pipelining = CPI with pipelining * Clock cycle with pipelining CPI without pipelining Ideal CPI = Pipeline depth Clock cycle without pipelining * Ideal CPI * pipeline depth Pipeline speedup = Clock cycle with pipelining * CPI with pipelining CPI with pipelining = Ideal CPI + Pipeline stall clock cycles per instruction

Pipelining speedup Clock cycle without pipelining * Ideal CPI * pipeline depth Speedup = Clock cycle with pipelining * Ideal CPI + Pipeline stall cycles Ignoring the potential increase in clock rate Ideal CPI * pipeline depth Speedup = Ideal CPI + Pipeline stall cycles

Computer Organization and Design