500 likes | 524 Views
Appendix A Pipelining: Basic and Intermediate Concepts. Pipelining. An implementation technique whereby multiple instructions are overlapped in execution. Each step in the pipeline (called a pipe stage ) completes a part of an instruction.
E N D
Pipelining • An implementation technique whereby multiple instructions are overlapped in execution. • Each step in the pipeline (called a pipe stage) completes a part of an instruction. • Because all stages proceed at the same time, the length of a processor (clock) cycle is determined by the time required for the slowest pipe stage. CSCE 614 Fall 2009
Pipelining • Designer’s goal: Balancing the length of each pipeline stage. • If the stages are perfectly balanced, the time per instruction on the pipelined processor is, Time per instruction on unpipelined machine Number of pipe stages Speedup from pipelining = number of pipe stages CSCE 614 Fall 2009
RISC Instruction Set (MIPS64) • 64-bit version of the MIPS instruction set. • 32 registers • 3 classes of instructions • ALU instructions: DADD, DSUB, … • Load and store instructions: LD, SD, … • Branches and jumps CSCE 614 Fall 2009
Implementation of a RISC (Unpipelined, Multicycle) • Implementation of an integer subset of a RISC architecture that takes at most 5 clock cycles. • Instruction Fetch (IF) • Instruction Decode/Register Fetch (ID) • Execution/Effective Address Calculation (EX) • Memory Access (MEM) • Write-Back (WB) CSCE 614 Fall 2009
OP rs rd sa funct rt OP rs rt immediate OP jump target Instruction Format (32-bit Version) • All MIPS instructions are 32 bits long. R-format (add, sub, …) I-format (lw, sw, …) J-format (j) CSCE 614 Fall 2009
Instruction Fetch Cycle (IF) • Send the program counter (PC) to memory. • Fetch the current instruction from memory. • Update the PC to the next sequential PC by adding 4 to the PC. CSCE 614 Fall 2009
Instruction Decode/Register Fetch Cycle (ID) • Decode the instruction and read the registers from the register file. • Do the equality test on the registers for a possible branch. • Sign-extend the offset field of the instruction in case it is needed. • Compute the possible branch target address by adding the sign-extended offset to the incremented PC. CSCE 614 Fall 2009
Execution/Effective Address Calculation (EX) • The ALU operates on the operands prepared in the prior cycle. • Memory reference instructions: The ALU adds the base register and the offset to form the effective address. • Register-Register: The ALU performs the operation specified by the ALU opcode on the values from the register file. • Register-Immediate: The ALU performs the operation specified by the opcode on the first value from the register file and the sign-extended immediate. CSCE 614 Fall 2009
Memory Access (MEM) • If the instruction is a load, memory does a read using the effective address computed in the previous cycle. • If it is a store, then the memory writes the data from the second register read from the register file using the effective address. CSCE 614 Fall 2009
Write-Back cycle (WB) • Register-Register ALU instruction or Load instruction: Write the result into the register file. CSCE 614 Fall 2009
In this implementation, branch instructions require 2 cycles, store instructions require 4 cycles, and all other instructions require 5 cycles. • Assuming a branch frequency of 12% and a store frequency of 10%, What is the overall CPI? CSCE 614 Fall 2009
Classic 5 Stage Pipeline for a RISC Processor CSCE 614 Fall 2009
Performance Issues in Pipelining • Pipelining increases the CPU instruction throughput. • Throughput: the number of instructions completed per unit of time. • Pipelining does not decrease the execution time of an individual instruction. • It increases the execution time due to overhead (clock skew and pipeline register delay) in the control of the pipeline. CSCE 614 Fall 2009
Example (p. A-10) • Consider the unpipelined processor. Assume that it has a 1ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline? CSCE 614 Fall 2009
Classic 5 Stage Pipeline for a RISC Processor CSCE 614 Fall 2009
Classic 5-Stage Pipeline • What happens in the pipeline? • One resource cannot be used for two different operations on the same clock cycle. => Separate instruction and data memories. • The register file is used in two stages: ID (two reads) and WB (one write). => Register write in the first half of the clock cycle and register read in the second half. CSCE 614 Fall 2009
Pipeline Hazards • Situations that prevent the next instructions in the instruction stream from executing during its designated clock cycle. • Hazards reduce the performance from the ideal speedup gained by pipelining. • Structural Hazards • Data Hazards • Control Hazards • Hazards can make it necessary to stall the pipeline. CSCE 614 Fall 2009
Pipeline Hazards • When an instruction is stalled, all instructions issued later than the stalled instruction are also stalled. • No new instructions are fetched during the stall. CSCE 614 Fall 2009
Structural Hazards • Hardware cannot support the combination of instructions that we want to execute in the same clock cycle. • Suppose we have a single memory instead of two memories. CSCE 614 Fall 2009
Control Hazards • This arises from the need to make a decision based on the results of one instruction while others are executing. • branch instruction • Pipeline stall (or bubble) • How can we overcome this problem? CSCE 614 Fall 2009
Branch Hazards • To minimize the branch penalty, put in enough hardware so that we can test registers, calculate the branch target address, and update the PC during the second stage. CSCE 614 Fall 2009
Example • Estimate the impact on the CPI of stalling on branches. Assume all other instructions have a CPI of 1. CSCE 614 Fall 2009
Branch Prediction • Computers do indeed use prediction to handle branches. • Simplest: Always predict that branches will fail. • If you’re right, the pipeline proceeds at full speed. • Dynamic hardware predictors make their guesses depending on the behavior of each branch. • Popular: Keeping a history for each branch as taken or untaken, and then using the past to predict the future. => about 90% accuracy CSCE 614 Fall 2009
Branch Prediction When the guess is wrong, the pipeline must make sure that the instruction following the wrongly guessed branch have no effect and must restart the pipeline from the proper branch address. CSCE 614 Fall 2009
Delayed Branch • Delayed decision • Used in MIPS • The delayed branch always executes the next sequential instruction, with the branch taking place after that one instruction delay. CSCE 614 Fall 2009
MIPS software will place an instruction immediately after the delayed branch instruction that is not affected by the branch, and a taken branch changes the address of the instruction that follows this safe instruction. • Compilers typically fill about 50% of the branch delay slots with useful instructions. CSCE 614 Fall 2009
Data Hazards • An instruction depends on the results of a previous instruction still in the pipeline. • e.g. add $s0, $t0, $t1 sub $t2, $s0, $t3 The add instruction doesn’t write the result until the 5th stage. => 3 bubbles CSCE 614 Fall 2009
Solution • forwarding (or bypassing): getting the missing item early from the internal resources. • e.g. as soon as the ALU creates the sum for the add, we can supply it as the input for the subtract. CSCE 614 Fall 2009
Load-Use Data Hazard CSCE 614 Fall 2009
Even with forwarding, we still have to stall one stage for a load-use data hazard. • Delayed loads: to follow a load with an instruction independent of that load. CSCE 614 Fall 2009
Implementation of the MIPS Datapath CSCE 614 Fall 2009
Events on Every Pipe Stage of the MIPS Pipeline • See Figure A.19 on page A-32. CSCE 614 Fall 2009
Revised Datapath CSCE 614 Fall 2009
Revised Pipeline Structure • See Figure A.25 on page A-39. CSCE 614 Fall 2009
Floating-Point Operations • The floating-point pipeline will allow for a longer latency for operations. • the EX cycle may be repeated as many times as needed to complete the operation. • The number of repetitions can vary for different operations. • There may be multiple floating-point functional units. CSCE 614 Fall 2009
Assumptions • Main integer unit: handles loads and stores, integer ALU operations, and branches. • FP and integer multiplier. • FP adder: handles FP add, subtract, and conversion. • FP and integer divider. • The EX stages of these functional units are not pipelined. CSCE 614 Fall 2009
MIPS with 3 FP Functional Units CSCE 614 Fall 2009
Because EX is not pipelined, no other instruction using that functional unit may issue until the previous instruction leaves EX. • Instruction issue (p. A-33): the process of letting an instruction move from the ID stage into the EX stage of the pipeline. • If an instruction cannot proceed to the EX stage, the entire pipeline behind that instruction will be stalled. CSCE 614 Fall 2009
Latency: the number of intervening cycles between an instruction that produces a result and an instruction that uses the result. • Initiation interval: the number of cycles that must elapse between issuing two operations of a given type. CSCE 614 Fall 2009
Example (Figure A.30) CSCE 614 Fall 2009
Since most operations consume their operands at the beginning of EX stage, the latency is usually the number of stages after EX that an instruction produces a result. • 0 for Integer ALU operations. • 1 for loads. • Pipeline latency is essentially equal to 1 cycle less than the depth of the execution pipeline, which is the number of stages from the EX stage to the stage that produces the result. CSCE 614 Fall 2009
To achieve a higher clock rate, fewer logic levels are put in each pipe stage. => The number of pipe stages required for more complex operations is larger. • The penalty for the faster clock rate is longer latency for operations. CSCE 614 Fall 2009
Supporting Multiple FP Operations unpipelined CSCE 614 Fall 2009