720 likes | 746 Views
Learn about single cycle, multiple cycle, and pipeline implementations for faster processor performance, including pipeline stages and hazards.
E N D
Computer Organization and ArchitectureChapter 6 Enhancing Performance with Pipelining Yu-Lun Kuo Computer Sciences and Information Engineering University of Tunghai, Taiwan sscc6991@gmail.com
Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw Waste multicycle clock slower than 1/5th of single cycle clock due to stage register overhead Multiple Cycle Implementation: IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Clk Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 lw sw R-type Review: Single Cycle vs. Multiple Cycle Timing
How Can We Make It Even Faster? • Split the multiple instruction cycle into smaller and smaller steps • There is a point of diminishing returns where as much time is spent loading the state registers as doing the work • Pipelining • Multiple instructions are overlapped in execution • Key to making processors fast
Example: Laundry • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes
MIPS Instructions • Classically take five steps: • Fetch instruction from (instruction) memory (IF) • Read register while decoding the instruction (ID) • Execute the operation or calculate an address (EX) • Access an operand in data memory (MEM) • Write the result into a register (WB) • Five stages
EX The schematic view IF ID Mem WB uses the memory uses the register file uses the register file uses the memory uses the ALU Very important to remember the content of this slide
IFetch IFetch IFetch Exec Exec Exec Mem Mem Mem WB WB WB A Pipelined MIPS Processor • Start the next instruction before the current one has completed • Improves throughput • Total amount of work done in a given time Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Dec lw Dec sw Dec R-type • clock cycle (pipeline stage time) is limited by the slowest stage • for some instructions, some stages are wasted cycles
Single Cycle Implementation: Cycle 1 Cycle 2 Clk lw sw Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw sw R-type IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Pipeline Implementation: IFetch Dec Exec Mem WB lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type Single Cycle vs. Multiple Cycle vs. Pipeline Multiple Cycle Implementation:
Single-cycle vs. Pipelined Performance (p.372) • Single-cycle (non-pipeline) • Must allow for the lowest instruction (lw) • Required for every instruction is 800 ps • The time between the first and fourth instructions in • the non-pipelined design • 3 * 800 = 2400 ps
Single-cycle vs. Pipelined Performance (p.333) • Pipeline • All the pipeline stages take a single clock cycle • The clock cycle must be long enough to accommodate the slowest operation • Execution clock cycle must have the worst-case clock cycle of 200 ps • The time is 600 ps + 4 * 200 ps = 1400 ps
Pipelining Speedup (p.334) • Under ideal conditions and with a large number of instructions • The speedup from pipelining is approximately equal to the number of pipeline stages • Five-stage pipeline is nearly five times faster • The above example? • Pipeline time : 1400 ps • Non-pipeline time: 2400 ps • It is not reflected in the total execution time for the three instructions
Pipelining Speedup (p.334) • Pipelining involves some overhead • The source of which will be more clear shortly • Thus, the time per instruction in the pipelined processor will exceed the minimum possible • The speedup will be less than the number of pipeline stages • The number of instruction is not large • If we increased the number of instructions • Add 1,000,000 instructions
Pipeline Hazards (管路危障) • Pipeline Hazards • When the next instruction cannot execute in the following clock cycle • Three different types • Structural hazards (結構危障): • what if we had only one memory? • Data hazards(資料危障): • what if an instruction’s input operands depend on the output of a previous instruction? • Control hazards(控制危障): • what about branches?
Structural Hazards (1/2) • The hardware cannot support the combination of instructions that we want to execute in the same clock cycle • Hardware resource is not enough!!! • 硬體資源不夠多,而導致在同一時間內要執行的多個指令卻無法執行 • Ex. The laundry room • Washer-dryer vs. separate washer and dryer
Structural Hazard (2/2) • Suppose, single memory instead of two memories • If the pipeline in Figure 6.3 had a fourth instruction • That in the same clock cycle • The first instruction is accessing data from memory • The fourth instruction is fetching an instruction from the same memory • Without two memories, pipeline could have a structural hazard
writing data from memory Mem Mem Mem Mem Mem Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg Mem Mem Mem Mem Mem ALU ALU ALU ALU ALU Reading instruction from memory Structural Hazard: Single Memory Time (clock cycles) lw I n s t r. O r d e r Inst 1 Inst 2 Inst 3 Inst 4
Data Hazard • The planned instruction cannot execute in the proper clock cycle • Because data that is needed to execute the instruction is not yet available • The pipeline must be stalled (Bubble) • Because one step must wait for another to complete • Ex. add $s0, $t0, $t1 • sub $t2, $s0, $t3 • Have to add three bubbles to the pipeline
DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU How About Register File Access? Time (clock cycles) add $1, I n s t r. O r d e r Inst 1 Inst 2 Inst 3 add $2,$1,
DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM ALU ALU ALU ALU clock edge that controls loading of pipeline state registers clock edge that controls register writing How About Register File Access? Time (clock cycles) Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half add $1, I n s t r. O r d e r Inst 1 Inst 2 add $2,$1,
DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Register Usage Can Cause Data Hazards • Dependencies backward in time cause hazards add $1, sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 • Read before writedata hazard
DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Loads Can Cause Data Hazards • Dependencies backward in time cause hazards lw $1,4($2) I n s t r. O r d e r sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5 • Load-usedata hazard
DM DM DM Reg Reg Reg Reg Reg Reg stall IM IM IM ALU ALU ALU stall sub $4,$1,$5 and $6,$1,$7 One Way to “Fix” a Data Hazard Can fix data hazard by waiting – stall – but impacts CPI add $1, I n s t r. O r d e r
Forwarding (前饋) • Also called bypassing • Resolving a data hazard by retrieving the missing data element from internal buffers • Ex. lw $s0, 20($t1) sub $t2, $s0, $t3 Still need one stall
DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Another Way to “Fix” a Data Hazard Fix data hazards by forwarding results as soon as they are available to where they are needed add $1, I n s t r. O r d e r sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5
DM DM DM DM DM Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg IM IM IM IM IM ALU ALU ALU ALU ALU Forwarding with Load-use Data Hazards • Will still need one stall cycle even with forwarding lw $1,4($2) I n s t r. O r d e r sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 xor $4,$1,$5
Control Hazard (1/2) • Also called branch hazard • Make a decision based on the results of one instruction while others are executing • 發生在其他指令正在執行時,需要依據另一指令的結果來做出一些決定的時候就會發生【控制危障】 • Solve 1: Stall (bubble) • If branch stall first • Put in enough extra hardware • We can test registers • Calculate the branch address and update PC during the second stage of the pipeline • Slow and high cost
DM DM DM Reg Reg Reg Reg Reg Reg IM IM IM IM ALU ALU ALU ALU beq DM Reg Reg Branch Instructions Cause Control Hazards • Dependencies backward in time cause hazards I n s t r. O r d e r lw Inst 3 Inst 4
DM DM Reg Reg Reg Reg IM IM IM ALU ALU ALU stall stall stall lw DM Reg Inst 3 One Way to “Fix” a Control Hazard Fix branch hazard by waiting – stall – but affects CPI beq I n s t r. O r d e r
Control Hazard (2/2) • Solve 2: Predict • Always predict that branches will be untaken • When you’re right proceeds at full speed • Not jump to branch target address • Only when branches are taken pipeline stall • need to add hardware for flushing instructions if we are wrong
Branch Prediction (1/2) • Resolving a branch hazard that • Assumes a given outcome for the branch and proceeds from that assumption rather than waiting to ascertain the actual outcome • Dynamic prediction of branches • Keeping a history for each branch as taken or untaken • Using the recent past behavior to predict the future • Correctly predict branches with over 90% accuracy
Branch Prediction (2/2) • If predict is wrong • Pipeline control must ensure that the instruction following the wrongly guessed branch have no effect • Restart the pipeline from the proper branch address • Keeping the history • Branch history table (分支歷史表) • Branch prediction buffer(分支預測緩衝器)
Pipelined Datapath • IF: Instruction fetch • ID: Instruction decode and register file read • EX: Execution or address calculation • MEM: Data memory access • WB: Write back
Pipeline Execution • Assume • Register file is written in the first half of the clock cycle • Register file is read during the second half
Five stages of lw (1/3) • Instruction fetch • Reading memory using the address in the PC • Placed in the IF/ID pipeline register • PC address: PC+4 (ready for next clock cycle) • Instruction decode and register file read • IF/ID pipeline register supplying the 16-bits immediate field • Which is sign-extended to 32-bits • The register numbers to read the two register • All values are stored in the ID/EX pipeline register
Five stages of lw (2/3) • Execute and address calculation • Reads the content of register1 • The sign-extended immediate from the ID/EX pipeline register • Add them using the ALU • placed in the EX/MEM pipeline register • Memory access • Reading the data memory using the address from the EX/MEM pipeline register • Loading the data into the MEN/WB pipeline register
Five stages of lw (3/3) • Write back • Final step • Reading the data from the MEM/WB pipeline register • Writing it into the register file
指令 lw的5個階段 • 指令擷取: • 我們以程式計數器 (PC) 中儲存的位址到記憶體中讀取指令並將 其放到IF/ID管路暫存器 (這是由於電腦一開始並不曉得哪種形態 的指令會被擷取) • 指令解碼與暫存器讀取: • 暫存器的號碼, 暫存器的內容, 16位元的立即欄位, ID/EX 暫存器 置入遞增後的程式計數器 (PC)的值 • 執行或有效記憶體計算: • 載入指令讀取從ID/EX管路暫存器讀取符號擴充後的位址與暫存器1 的內容。使用ALU將這兩個值相加後放到EX/MEM管路暫存器中。 • 記憶體存取 • 載入指令使用EX/MEM管路暫存器內的位址到資料記憶體讀取資料 • 寫回