Computer Organization and Architecture Chapter 6 Enhancing Performance with Pipelining

Computer Organization and ArchitectureChapter 6 Enhancing Performance with Pipelining Yu-Lun Kuo Computer Sciences and Information Engineering University of Tunghai, Taiwan sscc6991@gmail.com

Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw Waste multicycle clock slower than 1/5th of single cycle clock due to stage register overhead Multiple Cycle Implementation: IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 lw sw R-type Review: Single Cycle vs. Multiple Cycle Timing

How Can We Make It Even Faster? • Split the multiple instruction cycle into smaller and smaller steps • There is a point of diminishing returns where as much time is spent loading the state registers as doing the work • Pipelining • Multiple instructions are overlapped in execution • Key to making processors fast

Example: Laundry • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes

Sequential Laundry

Pipelined Laundry

Example: Laundry

MIPS Instructions • Classically take five steps: • Fetch instruction from (instruction) memory (IF) • Read register while decoding the instruction (ID) • Execute the operation or calculate an address (EX) • Access an operand in data memory (MEM) • Write the result into a register (WB) • Five stages

EX The schematic view IF ID Mem WB uses the memory uses the register file uses the register file uses the memory uses the ALU Very important to remember the content of this slide

IFetch IFetch IFetch Exec Exec Exec Mem Mem Mem WB WB WB A Pipelined MIPS Processor • Start the next instruction before the current one has completed • Improves throughput • Total amount of work done in a given time Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Dec lw Dec sw Dec R-type • clock cycle (pipeline stage time) is limited by the slowest stage • for some instructions, some stages are wasted cycles

Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw sw R-type IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Pipeline Implementation: IFetch Dec Exec Mem WB lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type Single Cycle vs. Multiple Cycle vs. Pipeline Multiple Cycle Implementation:

Pipelined Execution Representation

Single-cycle vs. Pipelined Performance (p.372) • Single-cycle (non-pipeline) • Must allow for the lowest instruction (lw) • Required for every instruction is 800 ps • The time between the first and fourth instructions in • the non-pipelined design • 3 * 800 = 2400 ps

Figure 6.3

Single-cycle vs. Pipelined Performance (p.333) • Pipeline • All the pipeline stages take a single clock cycle • The clock cycle must be long enough to accommodate the slowest operation • Execution clock cycle must have the worst-case clock cycle of 200 ps • The time is 600 ps + 4 * 200 ps = 1400 ps

Pipelining Speedup (p.334) • Under ideal conditions and with a large number of instructions • The speedup from pipelining is approximately equal to the number of pipeline stages • Five-stage pipeline is nearly five times faster • The above example? • Pipeline time : 1400 ps • Non-pipeline time: 2400 ps • It is not reflected in the total execution time for the three instructions

Pipelining Speedup (p.334) • Pipelining involves some overhead • The source of which will be more clear shortly • Thus, the time per instruction in the pipelined processor will exceed the minimum possible • The speedup will be less than the number of pipeline stages • The number of instruction is not large • If we increased the number of instructions • Add 1,000,000 instructions

Pipeline Hazards (管路危障) • Pipeline Hazards • When the next instruction cannot execute in the following clock cycle • Three different types • Structural hazards (結構危障): • what if we had only one memory? • Data hazards(資料危障): • what if an instruction’s input operands depend on the output of a previous instruction? • Control hazards(控制危障): • what about branches?

Structural Hazards (1/2) • The hardware cannot support the combination of instructions that we want to execute in the same clock cycle • Hardware resource is not enough!!! • 硬體資源不夠多，而導致在同一時間內要執行的多個指令卻無法執行 • Ex. The laundry room • Washer-dryer vs. separate washer and dryer

Structural Hazard (2/2) • Suppose, single memory instead of two memories • If the pipeline in Figure 6.3 had a fourth instruction • That in the same clock cycle • The first instruction is accessing data from memory • The fourth instruction is fetching an instruction from the same memory • Without two memories, pipeline could have a structural hazard

writing data from memory Mem Mem Mem Mem Mem Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg Mem Mem Mem Mem Mem ALU ALU ALU ALU ALU Reading instruction from memory Structural Hazard: Single Memory Time (clock cycles) lw I n s t r. O r d e r Inst 1 Inst 2 Inst 3 Inst 4

Data Hazard • The planned instruction cannot execute in the proper clock cycle • Because data that is needed to execute the instruction is not yet available • The pipeline must be stalled (Bubble) • Because one step must wait for another to complete • Ex. add $s0, $t0, $t1 • sub $t2, $s0, $t3 • Have to add three bubbles to the pipeline

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU How About Register File Access? Time (clock cycles) add $1, I n s t r. O r d e r Inst 1 Inst 2 Inst 3 add $2,$1,

DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM ALU ALU ALU ALU clock edge that controls loading of pipeline state registers clock edge that controls register writing How About Register File Access? Time (clock cycles) Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half add $1, I n s t r. O r d e r Inst 1 Inst 2 add $2,$1,

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Register Usage Can Cause Data Hazards • Dependencies backward in time cause hazards add $1, sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 • Read before writedata hazard

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Loads Can Cause Data Hazards • Dependencies backward in time cause hazards lw $1,4($2) I n s t r. O r d e r sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 • Load-usedata hazard

DM DM DM Reg Reg Reg Reg Reg Reg stall IM IM IM ALU ALU ALU stall sub $4,$1,$5 and $6,$1,$7 One Way to “Fix” a Data Hazard Can fix data hazard by waiting – stall – but impacts CPI add $1, I n s t r. O r d e r

Forwarding (前饋) • Also called bypassing • Resolving a data hazard by retrieving the missing data element from internal buffers • Ex. lw $s0, 20($t1) sub $t2, $s0, $t3 Still need one stall

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Another Way to “Fix” a Data Hazard Fix data hazards by forwarding results as soon as they are available to where they are needed add $1, I n s t r. O r d e r sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5

DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Forwarding with Load-use Data Hazards • Will still need one stall cycle even with forwarding lw $1,4($2) I n s t r. O r d e r sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5

Control Hazard (1/2) • Also called branch hazard • Make a decision based on the results of one instruction while others are executing • 發生在其他指令正在執行時，需要依據另一指令的結果來做出一些決定的時候就會發生【控制危障】 • Solve 1: Stall (bubble) • If branch  stall first • Put in enough extra hardware • We can test registers • Calculate the branch address and update PC during the second stage of the pipeline • Slow and high cost

DM DM DM Reg Reg Reg Reg Reg Reg IM IM IM IM ALU ALU ALU ALU beq DM Reg Reg Branch Instructions Cause Control Hazards • Dependencies backward in time cause hazards I n s t r. O r d e r lw Inst 3 Inst 4

DM DM Reg Reg Reg Reg IM IM IM ALU ALU ALU stall stall stall lw DM Reg Inst 3 One Way to “Fix” a Control Hazard Fix branch hazard by waiting – stall – but affects CPI beq I n s t r. O r d e r

Control Hazard (2/2) • Solve 2: Predict • Always predict that branches will be untaken • When you’re right  proceeds at full speed • Not jump to branch target address • Only when branches are taken  pipeline stall • need to add hardware for flushing instructions if we are wrong

Branch Prediction (1/2) • Resolving a branch hazard that • Assumes a given outcome for the branch and proceeds from that assumption rather than waiting to ascertain the actual outcome • Dynamic prediction of branches • Keeping a history for each branch as taken or untaken • Using the recent past behavior to predict the future • Correctly predict branches with over 90% accuracy

Branch Prediction (2/2) • If predict is wrong • Pipeline control must ensure that the instruction following the wrongly guessed branch have no effect • Restart the pipeline from the proper branch address • Keeping the history • Branch history table (分支歷史表) • Branch prediction buffer(分支預測緩衝器)

Pipeline Hazards Illustrated

Pipelined Datapath • IF: Instruction fetch • ID: Instruction decode and register file read • EX: Execution or address calculation • MEM: Data memory access • WB: Write back

Pipeline Execution • Assume • Register file is written in the first half of the clock cycle • Register file is read during the second half

Five stages of lw (1/3) • Instruction fetch • Reading memory using the address in the PC • Placed in the IF/ID pipeline register • PC address: PC+4 (ready for next clock cycle) • Instruction decode and register file read • IF/ID pipeline register supplying the 16-bits immediate field • Which is sign-extended to 32-bits • The register numbers to read the two register • All values are stored in the ID/EX pipeline register

Five stages of lw (2/3) • Execute and address calculation • Reads the content of register1 • The sign-extended immediate from the ID/EX pipeline register • Add them using the ALU • placed in the EX/MEM pipeline register • Memory access • Reading the data memory using the address from the EX/MEM pipeline register • Loading the data into the MEN/WB pipeline register

Five stages of lw (3/3) • Write back • Final step • Reading the data from the MEM/WB pipeline register • Writing it into the register file

指令 lw的5個階段 • 指令擷取： • 我們以程式計數器 (PC) 中儲存的位址到記憶體中讀取指令並將其放到IF/ID管路暫存器 (這是由於電腦一開始並不曉得哪種形態的指令會被擷取) • 指令解碼與暫存器讀取： • 暫存器的號碼, 暫存器的內容, 16位元的立即欄位， ID/EX 暫存器置入遞增後的程式計數器 (PC)的值 • 執行或有效記憶體計算： • 載入指令讀取從ID/EX管路暫存器讀取符號擴充後的位址與暫存器1 的內容。使用ALU將這兩個值相加後放到EX/MEM管路暫存器中。 • 記憶體存取 • 載入指令使用EX/MEM管路暫存器內的位址到資料記憶體讀取資料 • 寫回

Figure 6.12 IF/ID

Figure 6.12 EX

Figure 6.13 EX

Figure 6.14 MEM

Figure 6.14 WB

6.3 Pipeline Control (1/2)

Computer Organization and Architecture Chapter 6 Enhancing Performance with Pipelining