CSCE 212 Chapter 6 Enhancing Performance with Pipelining

CSCE 212Chapter 6Enhancing Performance with Pipelining Instructor: Jason D. Bakos

Pipelining

MIPS Pipeline • Basic idea: • Execute multiple instructions in parallel • Split instruction execution into 5 stages • Instructions execute in “assembly-line” fetch decode execute memory write back op/func ctrl/NOOP control MemoryDataIn address A MemRead MemWrite Address MemoryOut MemoryIn PC RegFile rs/rt ALU R B SE/imm SE/imm*4 4 SHAMT A, B registers control for: execute/memory/wb rs/rt/rd instruction register R register control for: memory/wb rs/rt/rd MDR register control for: wb rs/rt/rd

Pipelined MIPS

Pipelined Control

MIPS ISA • MIPS pipeline stages • Fetch (F) • read next instruction from memory, increment address counter • assume 1 cycle to access memory • Decode (D) • read register operands, resolve instruction in control signals, compute branch target • Execute (E) • execute arithmetic/resolve branches • Memory (M) • perform load/store accesses to memory, take branches • assume 1 cycle to access memory • Write back (W) • write arithmetic results to register file

Hazards • Hazards are data flow problems that arise as a result of pipelining • Limits the amount of parallelism, sometimes induces “penalties” that prevent one instruction per clock cycle • Structural hazards • Two operations require a single piece of hardware • Structural hazards can be overcome by adding additional hardware • Control hazards • Conditional control instructions are not resolved until late in the pipeline, requiring subsequent instruction fetches to be predicted • Flushed if prediction does not hold (make sure no state change) • Branch hazards can use dynamic prediction/speculation, branch delay slot • Data hazards • Instruction from one pipeline stage is “dependant” of data computed in another pipeline stage

Hazards • Data hazards • Register values “read” in decode, written during write-back • RAW hazard occurs when dependent inst. separated by less than 2 slots • Examples: • ADD $2,$X,$X (E) ADD $2,$X,$X (M) ADD $2,$3,$4 (W) • ADD $X,$2,$X (D) … … • … ADD $X,$2,$X (D) … • … … ADD $X,$2,$3 (D) • In most cases, data generated in same stage as data is required (EX) • Data forwarding • ADD $2,$X,$X (M) ADD $2,$X,$X (W) ADD $2,$3,$4 (out-of-pipe) • ADD $X,$2,$X (E) … … • … ADD $X,$2,$X (E) … • … … ADD $X,$2,$3 (E)

“Load” Hazards • Stalls required when data is not produced in same stage as it is needed for a subsequent instruction • Example: • LW $2, 0($X) (M) • ADD $X, $2 (E) • When this occurs, insert a “bubble” into EX state, stall F and D • LW $2, 0($X) (W) • NOOP (M) • ADD $X, $2 (E) • Forward from W to E

Data Hazards: Forwarding

Data Hazards: Stalling for Load Hazard

Control Hazards • Need to make a branch decision based on data that has yet to be produced: • add $2,$3,$4 • beqz $2,loop • Which stage is branch resolved? • Approaches: • stall • insert bubbles after all branches • always predict untaken • if taken, instructions entering DEC and EX (and MEM?) transfer as NOOPs • branch delay slot • instruction following branch is always executed • dynamic branch predictors

Control Hazards • Instructions are fetched every clock cycle • Branch decisions happen in the EX stage • Solutions: • Assume branch not taken (performs a flush of IF, ID, EX by inserting a nop into the pipeline registers on the clock edge) • Reduce the delay by moving the branch decision up • Requires additional hardware (comparators, etc.) • Might increase cycle time, since register read and resolution are now in series and must be performed in half a cycle to allow for parallel register writes! • Requires forwarding and stall hardware for new data hazards

F D E M W F D E M W F D E M W F F F F F D D D D D E E E E E M M M M M W W W W W Example add $6,$5,$2 lw $7,0($6) addi $7,$7,10 add $6,$4,$2 sw $7,0($6) addi $2,$2,4 blt $2,$3,loop add $6,$5,$2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 instructions, 15 - 4 cycles, CPI = 11/8

Moving up Branch Resolution

Scheduling the Branch Delay Slot

Dynamic Branch Prediction • Assume taken/not-taken (static) • Loops have branches that are usually taken • When wrong, we flush pipeline stages • Deeper pipelines have higher branch penalties (misprediction penalty) • Solution: • Look up address of branch to check if branch was previously taken • One-bit schemes • Two-bit schemes (must be wrong twice to change prediction)

CSCE 212 Chapter 6 Enhancing Performance with Pipelining