ECE232: Hardware Organization and Design

ECE232: Hardware Organization and Design Part 12: Pipelining II Chapter 4 (6 in 3rd edition) http://www.ecs.umass.edu/ece/ece232/

Benefits of forwarding Consider the following stretch of assembly code executed on a pipelined implementation of MIPS Handling Branches:  No branch prediction, conditional branches resolved in EX stage  All fetches following a conditional branch flushed until branch resolved - No speculative fetching How long does it take to execute? addi $to, $t1, 40 Loop: lw $t2, 0($t1) addi $t2, $t2, 3 sw $t2, 0($t1) addi $t1, $t1, 4 bne $t0, $t1, Loop

DMem Reg Reg IMem ALU No forwarding (loop executed 10 times) addi $to, $t1, 40 Loop: lw $t2, 0($t1) addi $t2, $t2, 3 sw $t2, 0($t1) addi $t1, $t1, 4 bne $t0, $t1, Loop 13 * 10 + 1 = 131 cycles

With forwarding addi $to, $t1, 40 Loop: lw $t2, 0($t1) addi $t2, $t2, 3 sw $t2, 0($t1) addi $t1, $t1, 4 bne $t0, $t1, Loop speedup = 131/91=1.44 9* 10 + 1 = 91 cycles

Avoiding Hazard by Reordering Code • How you would reorder the stretch of code after the first addi and before bne instruction to make it run faster? addi $to, $t1, 40 Loop: lw $t2, 0($t1) addi $t2, $t2, 3 sw $t2, 0($t1) addi $t1, $t1, 4 bne $t0, $t1, Loop

IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WriteBack Add Add 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 IFetch/Dec Read Address PC Read Data Dec/Exec Address Exec/Mem Write Addr ALU Read Data 2 Mem/WB Write Data Write Data Sign Extend 16 32 System Clock MIPS Pipeline Datapath Modifications • State registers between each pipeline stage to isolate them

IF/ID ID/EX EX/MEM Add Add MEM/WB 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 Read Address PC Read Data Address Write Addr ALU Read Data 2 Write Data Write Data Sign Extend 16 32 Corrected Datapath to Save RegWrite Addr • Need to preserve the destination register address in the pipeline state registers

ID/EX EX/MEM IF/ID Control Add MEM/WB Add 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 Read Address PC Read Data Address Write Addr ALU Read Data 2 Write Data Write Data Sign Extend 16 32 MIPS Pipeline Control Path Modifications • All control signals can be determined during Decode • and held in the state registers between pipeline stages

DMem Reg Reg IMem ALU Control Settings

Control Signals’ propagation

Pipeline Stages' Registers

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Forwarding add $1,… I n s t r. O r d e r sub $4,$1,$5 and $6,$7,$1 or $8,$1,$1 sw $4,4($1)

DMem Reg Reg IMem ALU Data Forwarding (aka Bypassing) • Take the result from the point that it exists in any of the pipeline state registers and forward it to the functional unit (e.g., the ALU) that needs it that cycle • For ALU functional unit: the inputs can come from any pipeline register rather than just from ID/EX by • add multiplexors to the inputs of the ALU • connect the result data in EX/MEM or MEM/WB to both of the EX’s stage Rs and Rt ALU mux inputs • add the proper control hardware to control the new muxes • Other functional units may need similar forwarding logic (e.g., the DMem) • With forwarding can achieve a CPI of almost 1 even in the presence of data dependencies

Datapath with Forwarding Hardware PCSrc ID/EX EX/MEM Control IF/ID Add MEM/WB Branch Add 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 Read Address PC Read Data Address Write Addr ALU Read Data 2 Write Data Write Data ALU cntrl 16 32 Sign Extend Forward Unit

Data Forwarding Control Conditions • EX/MEM hazard: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 Forwards the result from the previous instr. to either input of the ALU • MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 Forwards the result from the second previous instr. to either input of the ALU

ID/EX EX/MEM Control IF/ID Add MEM/WB Branch Add 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 Read Address PC Read Data Address Write Addr ALU Read Data 2 Write Data Write Data ALU cntrl 16 32 Sign Extend EX/MEM.RegisterRd ID/EX.RegisterRt Forward Unit MEM/WB.RegisterRd ID/EX.RegisterRs Datapath with Forwarding Hardware - 1 PCSrc

ID/EX EX/MEM Control IF/ID Add MEM/WB Branch Add 4 Shift left 2 Read Addr 1 Instruction Memory Data Memory Register File Read Data 1 Read Addr 2 Read Address PC Read Data Address Write Addr ALU Read Data 2 Write Data Write Data ALU cntrl 16 32 Sign Extend EX/MEM.RegisterRd ID/EX.RegisterRt Forward Unit MEM/WB.RegisterRd ID/EX.RegisterRs Datapath with Forwarding Hardware - 2 PCSrc

Summary • All modern day processors use pipelining • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Potential speedup: a CPI of 1 • Pipeline clock cycle determined/limited by slowest pipeline stage • Unbalanced pipe stages cause inefficiencies • The time to “fill” pipeline and time to “drain” it can impact speedup for deep pipelines and short code runs • Must detect and resolve hazards • Stalling negatively affects CPI (makes CPI less than the ideal of 1)

Review: Pipeline Hazards • Structural hazards • Design pipeline to eliminate structural hazards • Data hazards – read after write - RAW • Use data forwarding inside the pipeline • For those cases that forwarding won’t solve (e.g., load-use) include hazard hardware to insert stalls/bubbles • Control hazards – beq, bne,j,jr,jal • Stall – hurts performance • Move decision point as early in the pipeline as possible – reduces number of stalls at the cost of additional hardware • Delay decision (requires compiler support) – “Delayed Branch” • Predict outcome of Branch • Static prediction – e.g., always not-taken • Dynamic prediction – prediction per branch in program

Extracting Yet More Performance • Two options: • Increase the depth of the pipeline to increase the clock rate – superpipelining • Fetch (and execute) more than one instructions at one time (expand every pipeline stage to accommodate multiple instructions) – multiple-issue • Launching multiple instructions per stage allows the instruction execution rate, CPI, to be less than 1 • So instead we use IPC: instructions per clock cycle • E.g., a 3 GHz, four-way multiple-issue processor can execute at a peak rate of 12 billion instructions per second with a best case CPI of 0.25 or a best case IPC of 4 • If the datapath has a five stage pipeline, how many instructions are active in the pipeline at any given time?

ECE232: Hardware Organization and Design