Chap.6: Enhancing Performance with Pipelining

Chap.6: Enhancing Performance with Pipelining Jen-Chang Liu, Spring 2006 Parts of the slides are duplicated from inst.eecs.berkeley.edu/~cs61c

Review Datapath (1/3) • Datapath is the hardware that performs operations necessary to execute programs. • Control instructs datapath on what to do next. • Datapath needs: • access to storage (general purpose registers and memory) • computational ability (ALU) • helper hardware (local registers and PC)

Review Datapath (2/3) • Five stages of datapath (executing an instruction): 1. Instruction Fetch (Increment PC) 2. Instruction Decode (Read Registers) 3. ALU (Computation) 4. Memory Access 5. Write to Registers • ALL instructions must go through ALL five stages.

ALU 2. Decode/ Register Read 5. WriteBack 1. Instruction Fetch 4. Memory 3. Execute Review Datapath (3/3) rd instruction memory PC rs registers Data memory rt +4 imm

Review • Single cycle datapath • Multi-cycle datapath Instruction fetch Data/register read Instruction execution Memory/register read/write Register write Instruction fetch Data/register read Instruction execution Memory/register read/write Register write

Outline • Overview • A pipeline datapath • Pipelined control • Data hazards • Forwarding • Stalls • Branch hazards • Superscalar and dynamic pipelining

What’s pipelining? 洗衣乾衣折疊收藏 non-pipelined Use different resources at the same time pipelining

Pipelining • Definition: an implementation technique in which multiple instructions are overlapped in execution • How to achieve pipelining? • An instruction is divided into steps (stages) • We have separate resources for each stage

Single-cycle implementation Pipelined implementation

Time between inst. non-pipelined Time between inst. pipelined = Number of pipe stages What does pipelining improve? • Pipelining improves performance by • Increasing the instruction throughput 單位時間完成的指令數目增加 • Decreasing the execution time of an individual instruction 單一指令執行時間不變 • Ideal conditions for saving

Design instruction set for pipelining • MIPS has been designed for pipelining 1. Instructions are of the same length • Easier to fetch and decode • Ex. 80x86 inst.s ranges from 1 to 17 bytes 2. A few instruction formats, with the source register fields located in the same place 可以在決定是甚麼指令前讀暫存器

0 1 2 3 Aligned Not Aligned Design instruction set for pipelining (cont.) 3. Memory operands only appear for loads and stores • Calculate memory address in execution stage 4. Operands are aligned in memory • One memory access for data transfer Instruction fetch Instruction Decoding, Data/register read Instruction execution Memory/register read/write Register write

Pipeline hazards 危險 • Hazards: the situations in pipelining when the next instruction cannot execute in the following clock cycle 使下一個指令無法在下一個cycle執行甚麼時候下個指令無法執行？

1. Structural hazards 硬體結構問題 • The hardware cannot support the combination of instructions that we want to execute in the same clock cycle 2 memory access at the same clock?

Structural Hazard #1: Single Memory • Solution: • Second memory • infeasible and inefficient to create • Two Level 1 Caches • Cache: a temporary smaller [of usually most recently used] copy of memory • have both an L1 Instruction Cache and an L1 Data Cache • need more complex hardware to control when both caches miss

1. Structural hazard #2: single register Can’t read and write to registers simultaneously?

Structural Hazard #2: Registers • Fact: Register access is VERY fast: takes less than half the time of ALU stage • Solution: introduce convention • always Write to Registers during first half of each clock cycle • always Read from Registers during second half of each clock cycle • Result: can perform Read and Write during same clock cycle

2. Control hazards • The need to make a decision based on the results of one instruction while others are executing • Ex. Branch instruction 下一個指令需等待前面指令的執行結果 add $4, $5, $6 beq $1, $2, 40 lw $3, 300($0) 無法判斷下一個指令是那個… ? … 40: or $7, $8, $9

Solution to control hazards #1: stall 拖延 • stall (bubble): 暫停下一個指令執行 nop: no operation nop: no operation completion of branch Filling two nops after branch is inefficient!

Solution to control hazards #1: stall • Stall (bubble): 暫停下一個指令執行假設 branch 的位址計算，比較都可用另外的 hardware 在此 stage 完成

Solution to control hazards #2: delayed branch add $4, $5, $6 beq $1, $2, 40 lw $3, 300($0) ? … 40: or $7, $8, $9

Solution to control hazards #2: delayed branch (cont.) • Redefine branches • Old definition: if we take the branch, none of the instructions after the branch get executed by accident • New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (called the branch-delay slot) • The term “Delayed Branch” meanswe always execute inst after branch

Solution to control hazards #2: delayed branch (cont.) • Notes on Branch-Delay Slot • Worst-Case Scenario: can always put a no-op in the branch-delay slot • Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program • re-ordering instructions is a common method of speeding up programs • compiler must be very smart in order to find instructions to do this • usually can find such an instruction at least 50% of the time

Solution to control hazards #3: predict 預測 • 預測下一個可能執行的指令. • Ex. 猜 branch always not taken (not taken 沒有跳走) (taken)

Solution to control hazards: predict (cont.) • Example: easy to predict • Dynamic prediction hardware • Record the history of branch results • 90% accuracy Loop: … … beq $1, $2, Loop

3. Data hazard • An instruction depends on the results of a previous instruction still in the pipeline • Solution 1: complier (assembler) 下一個指令需要的資料需等待前一個指令執行結果 Example: add $s0, $t0, $t1 sub $t2, $s0, $t3

Standard notation for steps in MIPS 寫入讀出 Instruction fetch Instruction Decoding, Data/register read Instruction execution Memory/register read/write Register write

Solution to data hazards: forwarding • Forwarding: Getting the missing item early from the internal resource add $s0, $t0, $t1 sub $t2, $s0, $t3 等待指令執行完畢？ $t0+$t1

Another example of forwarding Mem[20($t1)] ?

lw $t2, 4($t1) sw $t2, 0($t1) Example: Reordering codes lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) Swap 0($t1) and 4($t1) What’s wrong with pipelining?

Example: Reordering codes lw $t0, 0($t1) lw $t2, 4($t1) sw $t0, 4($t1) sw $t2, 0($t1) lw $t2, 4($t1) sw $t0, 4($t1) sw $t2, 0($t1)

Peer Instruction ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT • Thanks to pipelining, I have reduced the time it took me to wash my shirt. • Longer pipelines are always a win (since less work per stage & a faster clock). • We can rely on compilers to help us avoid data hazards by reordering instrs.

Outline • Overview • A pipeline datapath: How to build? • Pipelined control • Data hazards • Forwarding • Stalls • Branch hazards • Superscalar and dynamic pipelining

Multiple instructions execute using pipelining How to combine these multiple datapaths? • Shared units during one clock cycle • add buffer (registers) to hold data (as we do in multi-cycle)

Divide single-cycle datapath into stages 2 1 Data flow

64bits 128b 97b 64b Pipelined datapath with pipeline registers (Fig 6.11)

Example: (1) instruction fetch

Example: (2) instruction decode

Example: (3) execute

Example: (4) memory access

Example: (5) write back 寫回暫存器的號碼對嗎？ ?

add 5-bit buffer Preserve the Destination Register number

Trace another pipelining • Trace helps you understand how pipelining works !!! • Single-clock-cycle pipeline diagram example: lw $10, 20($1) sub $11, $2, $3

Multiple-clock-cycle pipeline diagram

Chap.6: Enhancing Performance with Pipelining