390 likes | 560 Views
The Basics: Pipelining. J. Nelson Amaral University of Alberta. The Pipeline Concept. Bauer p. 32. 5 ns. 4 ns. 5 ns. 10 ns. 4 ns. Pipeline Throughput and Latency. IF. ID. EX. MEM. WB. Consider the pipeline above with the indicated delays. We want to know what is the pipeline
E N D
The Basics: Pipelining J. Nelson Amaral University of Alberta
The Pipeline Concept Bauer p. 32
5 ns 4 ns 5 ns 10 ns 4 ns Pipeline Throughput and Latency IF ID EX MEM WB Consider the pipeline above with the indicated delays. We want to know what is the pipeline throughput and the pipeline latency. Pipeline throughput: instructions completed per second. Pipeline latency: how long does it take to execute a single instruction in the pipeline.
5 ns 4 ns 5 ns 10 ns 4 ns Pipeline Throughput and Latency IF ID EX MEM WB Pipeline throughput: how often is an instruction completed? Pipeline latency: how long does it take to execute an instruction in the pipeline? Is this right?
5 ns 4 ns 5 ns 10 ns 4 ns L(I1) = 28ns I1 IF ID EX MEM WB I2 IF I3 IF I4 IF Pipeline Throughput and Latency IF ID EX MEM WB Simply adding the latencies to compute the pipeline latency, only would work for an isolated instruction L(I2) = 33ns ID EX MEM WB L(I3) = 38ns ID EX MEM WB ID EX MEM WB L(I5) = 43ns We are in trouble! The latency is not constant. This happens because this is an unbalanced pipeline. The solution is to make every stage the same length as the longest one.
5 ns 4 ns 5 ns 10 ns 4 ns Pipeline Throughput and Latency IF ID EX MEM WB The slowest pipeline state also limits the latency!! I1 IF ID EX MEM WB L(I2) = 50ns I2 IF ID EX MEM WB I3 IF ID EX MEM WB I4 IF ID EX MEM 0 10 20 30 40 50 60 L(I1) = L(I2) = L(I3) = L(I4) = 50ns
5 ns 4 ns 5 ns 10 ns 4 ns Pipeline Throughput and Latency IF ID EX MEM WB How long does it take to execute 20000 instructions in this pipeline? (disregard bubbles caused by branches, cache misses, and hazards) How long would it take using the same modules without pipelining? What is the speedup due to pipelining?
5 ns 4 ns 5 ns 10 ns 4 ns Pipeline Throughput and Latency IF ID EX MEM WB The speedup that we got from the pipeline is: How can we improve this pipeline design? We need to reduce the unbalance to increase the clock speed.
Pipeline Throughput and Latency IF ID EX MEM1 MEM2 WB 5 ns 4 ns 5 ns 5 ns 5 ns 4 ns Now we have one more pipeline stage. What is the throughput now? What is the new latency for a single instruction?
Pipeline Throughput and Latency IF ID EX MEM1 MEM2 WB 5 ns 4 ns 5 ns 5 ns 5 ns 4 ns I1 IF ID EX MEM1 MEM1 WB I2 IF ID EX MEM1 MEM1 WB I3 IF ID EX MEM1 MEM1 WB I4 IF ID EX MEM1 MEM1 WB I5 IF ID EX MEM1 MEM1 WB I6 IF ID EX MEM1 MEM1 WB I7 IF ID EX MEM1 MEM1 WB
Pipeline Throughput and Latency IF ID EX MEM1 MEM2 WB 5 ns 4 ns 5 ns 5 ns 5 ns 4 ns How long does it take to execute 20000 instructions in this pipeline? (disregard bubles caused by branches, cache misses, etc, for now) What is the speedup that we get from pipelining?
Pipeline Throughput and Latency IF ID EX MEM1 MEM2 WB 5 ns 4 ns 5 ns 5 ns 5 ns 4 ns What have we learned from this example? 1. It is important to balance the delays in the stages of the pipeline 2. The throughput of a pipeline is 1/max(delay). 3. The latency is Nmax(delay), where N is the number of stages in the pipeline.
Execution Snapshot Bauer p. 33
Pipeline with Control Unit Bauer p. 34
Data Hazards and Forwarding Example 1: i: R7 ← R12 + R15 i+1: R8 ← R7 – R12 i+2: R15 ← R8 + R7 Read-After-Write (RAW) dependencies (true dependencies) Write-After-Read (WAR) dependencies (anti dependencies) Bauer p. 35
Data Hazards and Forwarding v v v Bauer p. 36
Forwarding Bauer p. 37
Load-ALU RAW Dependency Example 2: i: R6 ← Mem[R2] i+1: R7 ← R6 + R4 The data from the load is not available until the Mem/WB of instruction i, but it is needed at the ID/EX of instruction i+1 Cannot forward back on time! Bauer p. 36
Bubble because of load Bauer p. 38
Priority on Forwarding The RAW from i+1 to i+2 must take priority over the RAW from i to i+2. Example: i: R10 ← R4 + R5 i+1: R10 ← R4 – R10 i+2: R8 ← R10 + R7 Bauer p. 38
Forwarding from Mem/WB to Mem Example: i: R5 ← Mem[R6] i+1: Mem[R8] ← R5 After the load, the contents of the Mem/WB register must be forwarded to be written to memory (not only to R5). Bauer p. 39
Pipelining with Forwarding and Stall Bauer p. 38
Control Hazards (branches) Bauer p. 40
Control Hazards: Exceptions and Interruptions • Exceptions can occur in any stage (except WB) • IF: page faults • ID: Illegal opcodes • EX: arithmetic exceptions • Mem: illegal address, page faults • Interruptions: • I/O termination, time-outs • Power failures Bauer p. 40
Handling Exceptions/Interruptions Save the Process State Clear Exception Condition ? Abort Program “Correct” Exception Perform Unrelated Task Schedule Process Restart Bauer p. 41
Precise Exceptions in a Pipeline • If an exceptions happens in instruction i: • Instructions i-1, i-2, … complete normally and contribute to the saved state of the process • Instructions i, i+1, i+2, … become no-ops • After the exception is handled, execution re-starts at instruction i • The PC saved is the PC of instruction i. ⋅⋅⋅ Complete normally i-2 i-1 i Exception happens here → ←Execution re-starts here no-op i+1 no-op i+2 no-op ⋅⋅⋅ Bauer p. 41 no-op
Implementing Precise Exceptions in the Pipeline • Flag the pipeline register at the right of the stage where exception was detected • This Flag moves along the pipeline • Set all control lines at a stage with the flag to transform the instruction into a no-op • Stop instruction fetching • When the flag reaches the Mem/WB stage, save the PC of that instruction as the exception PC Bauer p. 41
Program Order X Temporal Order divide-by-zero exception page-fault exception Which exception occurs first in time? Which exception should be handled first? Bauer p. 41
Can’t avoid Load/ALU instr. bubble Design Issues: Branch resolution in EX stage → Two-cycle branch penalty Mem stage unused for ALU instr Bauer p. 38
Alternative Pipelining Design:Avoiding the load latency penalty Example: i: R4 ← Mem[R8] i+1: R7 ← R4 + R5 Bauer p. 43
Avoiding the load latency penalty Example: i: R4 ← Mem[R8] i+1: R7 ← R4 + R5 Bauer p. 43
Address Generation Latency Penalty Example: i: R5 ← R6 + R7 i+1: R9 ← Mem[R5] Can’t forward from future. Has to stall. Bauer p. 43
Other changes AG used for branch resolution AG unused for ALU operations Bauer p. 43
Avoids load/ALU bubble X additional ALU unit Tradeoffs: Move branch resolution to AG → same penalty AG stage unused for ALU operations Stalls for ALU/Store instr. dependency Bauer p. 43
Which one is better? MIPS Intel 486 Bauer p. 44
Pipelining Functional Units: the EX stage • Parameters of interest: • number of stages • minimum number of cycles before two independent (no RAW) instructions of the same type can enter the functional unit Bauer p. 44
1 8 23 S E F Single-PrecisionFloating Point Representation Most standard floating point representation use: 1 bit for the sign (positive or negative) 8 bits for the range (exponent field) 23 bits for the precision (fraction field) sign exponent fraction Bauer p. 45 P-H. p. 245 From: Patt and Patel, pp. 33
Special Floating Point Representations In the 8-bit field of the exponent we can represent numbers from 0 to 255. We studied how to read numbers with exponents from 0 to 254. What is the value represented when the exponent is 255 (i.e. 111111112)? An exponent equal 255 = 111111112 in a floating point representation indicates a special value. When the exponent is equal 255 = 111111112 and the fraction is 0, the value represented is infinity. When the exponent is equal 255 = 111111112 and the fraction is non-zero, the value represented is Not a Number (NaN). Bauer p. 45 P-H. p. 246 Hen/Patt, pp. 301
Floating Point Addition (S1, E1, F1) (S2, E2, F2) E1 < E2 yes swap operands Stage 1 Insert 1 to left of F1 and to left of F2 S1 ≠ S2 yes replace F2 by its 2-complement D = E1 – E2 F2 ← F2 << D Stage 2-3 add mantissas Stage 4 Normalize and round off Bauer p. 46