1 / 49

Appendix A Pipelining: Basic and Intermediate Concepts

Appendix A Pipelining: Basic and Intermediate Concepts. Pipelining. An implementation technique whereby multiple instructions are overlapped in execution. Each step in the pipeline (called a pipe stage ) completes a part of an instruction.

Download Presentation

Appendix A Pipelining: Basic and Intermediate Concepts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Appendix APipelining: Basic and Intermediate Concepts

  2. Pipelining • An implementation technique whereby multiple instructions are overlapped in execution. • Each step in the pipeline (called a pipe stage) completes a part of an instruction. • Because all stages proceed at the same time, the length of a processor (clock) cycle is determined by the time required for the slowest pipe stage. CSCE 614 Fall 2009

  3. Pipelining • Designer’s goal: Balancing the length of each pipeline stage. • If the stages are perfectly balanced, the time per instruction on the pipelined processor is, Time per instruction on unpipelined machine Number of pipe stages Speedup from pipelining = number of pipe stages CSCE 614 Fall 2009

  4. RISC Instruction Set (MIPS64) • 64-bit version of the MIPS instruction set. • 32 registers • 3 classes of instructions • ALU instructions: DADD, DSUB, … • Load and store instructions: LD, SD, … • Branches and jumps CSCE 614 Fall 2009

  5. Implementation of a RISC (Unpipelined, Multicycle) • Implementation of an integer subset of a RISC architecture that takes at most 5 clock cycles. • Instruction Fetch (IF) • Instruction Decode/Register Fetch (ID) • Execution/Effective Address Calculation (EX) • Memory Access (MEM) • Write-Back (WB) CSCE 614 Fall 2009

  6. OP rs rd sa funct rt OP rs rt immediate OP jump target Instruction Format (32-bit Version) • All MIPS instructions are 32 bits long. R-format (add, sub, …) I-format (lw, sw, …) J-format (j) CSCE 614 Fall 2009

  7. Instruction Fetch Cycle (IF) • Send the program counter (PC) to memory. • Fetch the current instruction from memory. • Update the PC to the next sequential PC by adding 4 to the PC. CSCE 614 Fall 2009

  8. Instruction Decode/Register Fetch Cycle (ID) • Decode the instruction and read the registers from the register file. • Do the equality test on the registers for a possible branch. • Sign-extend the offset field of the instruction in case it is needed. • Compute the possible branch target address by adding the sign-extended offset to the incremented PC. CSCE 614 Fall 2009

  9. Execution/Effective Address Calculation (EX) • The ALU operates on the operands prepared in the prior cycle. • Memory reference instructions: The ALU adds the base register and the offset to form the effective address. • Register-Register: The ALU performs the operation specified by the ALU opcode on the values from the register file. • Register-Immediate: The ALU performs the operation specified by the opcode on the first value from the register file and the sign-extended immediate. CSCE 614 Fall 2009

  10. Memory Access (MEM) • If the instruction is a load, memory does a read using the effective address computed in the previous cycle. • If it is a store, then the memory writes the data from the second register read from the register file using the effective address. CSCE 614 Fall 2009

  11. Write-Back cycle (WB) • Register-Register ALU instruction or Load instruction: Write the result into the register file. CSCE 614 Fall 2009

  12. In this implementation, branch instructions require 2 cycles, store instructions require 4 cycles, and all other instructions require 5 cycles. • Assuming a branch frequency of 12% and a store frequency of 10%, What is the overall CPI? CSCE 614 Fall 2009

  13. Classic 5 Stage Pipeline for a RISC Processor CSCE 614 Fall 2009

  14. Performance Issues in Pipelining • Pipelining increases the CPU instruction throughput. • Throughput: the number of instructions completed per unit of time. • Pipelining does not decrease the execution time of an individual instruction. • It increases the execution time due to overhead (clock skew and pipeline register delay) in the control of the pipeline. CSCE 614 Fall 2009

  15. Example (p. A-10) • Consider the unpipelined processor. Assume that it has a 1ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline? CSCE 614 Fall 2009

  16. Classic 5 Stage Pipeline for a RISC Processor CSCE 614 Fall 2009

  17. Classic 5-Stage Pipeline • What happens in the pipeline? • One resource cannot be used for two different operations on the same clock cycle. => Separate instruction and data memories. • The register file is used in two stages: ID (two reads) and WB (one write). => Register write in the first half of the clock cycle and register read in the second half. CSCE 614 Fall 2009

  18. Pipeline Hazards

  19. Pipeline Hazards • Situations that prevent the next instructions in the instruction stream from executing during its designated clock cycle. • Hazards reduce the performance from the ideal speedup gained by pipelining. • Structural Hazards • Data Hazards • Control Hazards • Hazards can make it necessary to stall the pipeline. CSCE 614 Fall 2009

  20. Pipeline Hazards • When an instruction is stalled, all instructions issued later than the stalled instruction are also stalled. • No new instructions are fetched during the stall. CSCE 614 Fall 2009

  21. Structural Hazards • Hardware cannot support the combination of instructions that we want to execute in the same clock cycle. • Suppose we have a single memory instead of two memories. CSCE 614 Fall 2009

  22. Control Hazards • This arises from the need to make a decision based on the results of one instruction while others are executing. • branch instruction • Pipeline stall (or bubble) • How can we overcome this problem? CSCE 614 Fall 2009

  23. Branch Hazards • To minimize the branch penalty, put in enough hardware so that we can test registers, calculate the branch target address, and update the PC during the second stage. CSCE 614 Fall 2009

  24. Example • Estimate the impact on the CPI of stalling on branches. Assume all other instructions have a CPI of 1. CSCE 614 Fall 2009

  25. Branch Prediction • Computers do indeed use prediction to handle branches. • Simplest: Always predict that branches will fail. • If you’re right, the pipeline proceeds at full speed. • Dynamic hardware predictors make their guesses depending on the behavior of each branch. • Popular: Keeping a history for each branch as taken or untaken, and then using the past to predict the future. => about 90% accuracy CSCE 614 Fall 2009

  26. Branch Prediction When the guess is wrong, the pipeline must make sure that the instruction following the wrongly guessed branch have no effect and must restart the pipeline from the proper branch address. CSCE 614 Fall 2009

  27. Delayed Branch • Delayed decision • Used in MIPS • The delayed branch always executes the next sequential instruction, with the branch taking place after that one instruction delay. CSCE 614 Fall 2009

  28. CSCE 614 Fall 2009

  29. MIPS software will place an instruction immediately after the delayed branch instruction that is not affected by the branch, and a taken branch changes the address of the instruction that follows this safe instruction. • Compilers typically fill about 50% of the branch delay slots with useful instructions. CSCE 614 Fall 2009

  30. Data Hazards • An instruction depends on the results of a previous instruction still in the pipeline. • e.g. add $s0, $t0, $t1 sub $t2, $s0, $t3 The add instruction doesn’t write the result until the 5th stage. => 3 bubbles CSCE 614 Fall 2009

  31. Solution • forwarding (or bypassing): getting the missing item early from the internal resources. • e.g. as soon as the ALU creates the sum for the add, we can supply it as the input for the subtract. CSCE 614 Fall 2009

  32. CSCE 614 Fall 2009

  33. Load-Use Data Hazard CSCE 614 Fall 2009

  34. Even with forwarding, we still have to stall one stage for a load-use data hazard. • Delayed loads: to follow a load with an instruction independent of that load. CSCE 614 Fall 2009

  35. CSCE 614 Fall 2009

  36. Implementation of the MIPS Datapath CSCE 614 Fall 2009

  37. Events on Every Pipe Stage of the MIPS Pipeline • See Figure A.19 on page A-32. CSCE 614 Fall 2009

  38. Revised Datapath CSCE 614 Fall 2009

  39. Revised Pipeline Structure • See Figure A.25 on page A-39. CSCE 614 Fall 2009

  40. Extending the MIPS to Handle Multicycle Operations

  41. Floating-Point Operations • The floating-point pipeline will allow for a longer latency for operations. • the EX cycle may be repeated as many times as needed to complete the operation. • The number of repetitions can vary for different operations. • There may be multiple floating-point functional units. CSCE 614 Fall 2009

  42. Assumptions • Main integer unit: handles loads and stores, integer ALU operations, and branches. • FP and integer multiplier. • FP adder: handles FP add, subtract, and conversion. • FP and integer divider. • The EX stages of these functional units are not pipelined. CSCE 614 Fall 2009

  43. MIPS with 3 FP Functional Units CSCE 614 Fall 2009

  44. Because EX is not pipelined, no other instruction using that functional unit may issue until the previous instruction leaves EX. • Instruction issue (p. A-33): the process of letting an instruction move from the ID stage into the EX stage of the pipeline. • If an instruction cannot proceed to the EX stage, the entire pipeline behind that instruction will be stalled. CSCE 614 Fall 2009

  45. Latency: the number of intervening cycles between an instruction that produces a result and an instruction that uses the result. • Initiation interval: the number of cycles that must elapse between issuing two operations of a given type. CSCE 614 Fall 2009

  46. Example (Figure A.30) CSCE 614 Fall 2009

  47. Since most operations consume their operands at the beginning of EX stage, the latency is usually the number of stages after EX that an instruction produces a result. • 0 for Integer ALU operations. • 1 for loads. • Pipeline latency is essentially equal to 1 cycle less than the depth of the execution pipeline, which is the number of stages from the EX stage to the stage that produces the result. CSCE 614 Fall 2009

  48. To achieve a higher clock rate, fewer logic levels are put in each pipe stage. => The number of pipe stages required for more complex operations is larger. • The penalty for the faster clock rate is longer latency for operations. CSCE 614 Fall 2009

  49. Supporting Multiple FP Operations unpipelined CSCE 614 Fall 2009

More Related