1 / 104

Maximizing Processor Performance Through Pipelining

Explore the efficiency benefits of processor pipelining, understanding hazards, performance comparisons, and code scheduling strategies. Learn about structured, data, and control hazards in pipeline design.

sgutierrez
Download Presentation

Maximizing Processor Performance Through Pipelining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 4 Pipelining and Instruction Level Parallelism

  2. Pipelining Analogy • Pipelined laundry: overlapping execution • Parallelism improves performance §4.5 An Overview of Pipelining • Four loads: • Speedup= 16/7=8/3.5 = 2.3 • Non-stop: (for large n) • Speedup= • {(4*0.5)*n)}/{1.5+0.5*n}≈ 4= number of stages Chapter 4 — The Processor — 2

  3. MIPS Pipeline • Five stages, one step per stage • IF: Instruction fetch from memory • ID: Instruction decode & register read • EX: Execute operation or calculate address • MEM: Access memory operand • WB: Write result back to register Chapter 4 — The Processor — 3

  4. Pipeline Performance • Assume time for stages is • 100ps for register read or write • 200ps for other stages • Compare pipelined datapath with single-cycle datapath Chapter 4 — The Processor — 4

  5. Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) Chapter 4 — The Processor — 5

  6. Pipeline Speedup • If all stages are balanced • i.e., all take the same time • Time between instructionspipelined= Time between instructionsnonpipelined Number of stages • If not balanced, speedup is less • Speedup due to increased throughput • Latency (time for each instruction) does not decrease Chapter 4 — The Processor — 6

  7. Pipelining and ISA Design • MIPS ISA designed for pipelining • All instructions are 32-bits • Easier to fetch and decode in one cycle • cf. x86: 1- to 17-byte instructions • Few and regular instruction formats • Can decode and read registers in one step • Load/store addressing • Can calculate address in 3rd stage, access memory in 4th stage • Alignment of memory operands • Memory access takes only one cycle Chapter 4 — The Processor — 7

  8. Hazards • Situations that prevent starting the next instruction in the next cycle • Structure hazards • A required resource is busy • Data hazard • Need to wait for a previous instruction to complete its data read/write • Control hazard • Deciding on control action depends on previous instruction Chapter 4 — The Processor — 8

  9. Structure Hazards • Conflict for use of a resource • In MIPS pipeline with a single memory • Load/store requires data access • Instruction fetch would have to stall for that cycle • Would cause a pipeline “bubble” • Hence, pipelined datapaths require separate instruction/data memories • Or separate instruction/data caches Chapter 4 — The Processor — 9

  10. Data Hazards • An instruction depends on completion of data access by a previous instruction • add $s0, $t0, $t1sub $t2, $s0, $t3 Chapter 4 — The Processor — 10

  11. Forwarding (aka Bypassing) • Use result when it is computed • Don’t wait for it to be stored in a register • Requires extra connections in the datapath Chapter 4 — The Processor — 11

  12. Load-Use Data Hazard • Can’t always avoid stalls by forwarding • If value not computed when needed • Can’t forward backward in time! Chapter 4 — The Processor — 12

  13. Code Scheduling to Avoid Stalls • Reorder code to avoid use of load result in the next instruction • C code for A = B + E; C = B + F; Code Order 1# lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) stall stall 13 cycles Chapter 4 — The Processor — 13

  14. 2 Stalls: 13 Cycles Chapter 4 — The Processor — 14

  15. Code Scheduling to Avoid Stalls • Reorder code to avoid use of load result in the next instruction • C code for A = B + E; C = B + F; Code Order #2 lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 11 cycles Chapter 4 — The Processor — 15

  16. Reodering of Code; No Hazards: 11 Cycles lw $t2, 4($t0) lw $t4, 8($t0) add $t3,$t1,$t2 add $t5, $t1,$t4 Chapter 4 — The Processor — 16

  17. Control Hazards • Branch determines flow of control • Deciding which instruction to fetch next depends on the branch instruction outcome: what is the next instruction which should be executed next? • Pipeline can’t always fetch correct instruction because it fetches the instruction which is next in the sequence of instructions • Pipeline is still working on ID (Instruction Decode) stage of the branch instruction • In MIPS pipeline: need to compare registers and compute address of target instruction early in the pipeline • Add hardware to do the next instruction address calculation in the ID stage Chapter 4 — The Processor — 17

  18. Option: Stall on Branch • Wait until branch outcome is determined before fetching next instruction Chapter 4 — The Processor — 18

  19. Branch Prediction • Longer pipelines can’t readily determine branch outcome early • Stall penalty becomes unacceptable • Predict outcome of branch • Only stall if prediction is wrong • In MIPS pipeline • Can predict branches not taken • Fetch instruction after branch, with no delay Chapter 4 — The Processor — 19

  20. MIPS with Predict Not Taken Prediction correct Prediction incorrect Chapter 4 — The Processor — 20

  21. More-Realistic Branch Prediction • Static branch prediction • Based on typical branch behavior • Example: loop and if-statement branches • Predict backward branches taken • Predict forward branches not taken • Dynamic branch prediction • Hardware measures actual branch behavior • e.g., record recent history of each branch • Assume future behavior will continue the trend • When wrong, stall while re-fetching, and update history Chapter 4 — The Processor — 21

  22. Pipeline Summary • Pipelining improves performance by increasing instruction throughput • Executes multiple instructions in parallel • Each instruction has the same latency • Subject to hazards • Structure, data, control • Instruction set design affects complexity of pipeline implementation The BIG Picture Chapter 4 — The Processor — 22

  23. MIPS Datapath with Pipeline Stages Identified §4.6 Pipelined Datapath and Control Example instructions: lw store reg-reg ALU Chapter 4 — The Processor — 23

  24. MIPS Datapath with Pipeline Stages Identified up to 5 instructions will be in execution during a clock cycle Instruction and data move from left to right §4.6 Pipelined Datapath and Control Selection of next value of PC WB 2 exception to left-to right flow Right-to-left flow affects later instructions & can lead to data or control hazards Chapter 4 — The Processor — 24

  25. Pipeline registers • For pipelining to work, need registers between stages • To hold information produced in previous cycle and needed in a later cycle (e.g. Instruction Register, IR, information) Up to 5 instructions will be in execution during a clock cycle Chapter 4 — The Processor — 25

  26. Pipeline Operation • Cycle-by-cycle flow of instructions through the pipelined datapath. • 2 types of diagrams show instruction flow • “Single-clock-cycle” pipeline diagram • Shows pipeline usage in a single cycle • Highlight resources used by different instructions in different pipeline stages • c.f. “multi-clock-cycle” diagram • Graph of pipeline operations over time • We’ll look at “single-clock-cycle” diagrams for load & store (active in 5 stages of the pipeline) Chapter 4 — The Processor — 26

  27. IF for Load, Store, … Instruction addressed by PC (i.e. IR) is written to the (IF/ID).IR PC < PC + 4 (also PC+4 written in IF/ID register; could be used later by a beqinstruction) Chapter 4 — The Processor — 27

  28. Guideline In every clock cycle transfer to the following pipeline register all information that might be needed by any instruction during a later clock cycle Chapter 4 — The Processor — 28

  29. ID for Load, Store, … (IF/ID).IR supplies 3 values: 16-bit immediate field (sign-extended to 32 bits) & register #s to read 2 registers from the register file. The 32-bit immediate field, 2 values read from 2 registers, and PC+4 are stored in the ID/IE register Chapter 4 — The Processor — 29

  30. EX for Load 32-bit immediate field & the value of register 1 are read from the ID/EX pipeline register and added. The sum is stored in (EX/MEM).Address field of the EX/MEM pipeline register. Chapter 4 — The Processor — 30

  31. MEM for Load Data in memory location specified by (EX/MEM).Address field of the EX/MEM pipeline register is loaded into (MEM/WB).Data field of the MEM/WB pipeline register Chapter 4 — The Processor — 31

  32. WB for Load Data in (MEM/WB).Data field of the MEM/WB pipeline register is written back into the register file (register # used must be the one originally specified by the Load instruction) Wrongregisternumber Chapter 4 — The Processor — 32

  33. Corrected Datapath for Load Both data & the register # to be used for write back into the register file come from the MEM/WB pipeline register. The 5 bits write back register number is passed from IF/ID register when the Load instruction is read until it reaches the MEM/WB register. These 5 bits are added to the last 3 pipeline registers. Chapter 4 — The Processor — 33

  34. EX for Store 32-bit immediate field & the value of register 1 are read from the ID/EX pipeline register and added. The sum is stored in (EX/MEM).Address field of the EX/MEM pipeline register. Chapter 4 — The Processor — 34

  35. MEM for Store The data of register 2 read from the register file in the ID stage is stored in the ID/EX register and passed to the EX/MEM register so that it will be stored in memory at the location specified by (EX/MEM).Address field of the EX/MEM pipeline register. Chapter 4 — The Processor — 35

  36. WB for Store For the Store instruction, nothing happens in the write back pipeline stage. Hence the Store instruction spends a clock cycle in the WB stage doing nothing (later instructions are progressing at maximum speed). Chapter 4 — The Processor — 36

  37. Multi-Cycle Pipeline Diagram • Form showing resource usage Chapter 4 — The Processor — 37

  38. Multi-Cycle Pipeline Diagram • Traditional form Chapter 4 — The Processor — 38

  39. Single-Cycle Pipeline Diagram • State of pipeline in a given cycle Chapter 4 — The Processor — 39

  40. Pipelined Control Review the information in the following figures in the book: 4.47 (copy of Figure 4.12) 4.48 (copy of Figure 4.16) 4.49 (same information as in Figure 4.18) Chapter 4 — The Processor — 40

  41. Chapter 4 — The Processor — 41

  42. Pipelined Control (Simplified) Chapter 4 — The Processor — 42

  43. Pipelined Control • Control signals derived from instruction • As in single-cycle implementation Chapter 4 — The Processor — 43

  44. Chapter 4 — The Processor — 44

  45. Pipelined Control Chapter 4 — The Processor — 45

  46. Data Hazards in ALU Instructions • Consider this sequence: sub $2, $1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2) • We can resolve hazards with forwarding • How do we detect when to forward? §4.7 Data Hazards: Forwarding vs. Stalling Chapter 4 — The Processor — 46

  47. Dependencies & Forwarding Fwd from EX/MEM pipeline reg 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt Fwd from EX/MEM pipeline reg 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt Chapter 4 — The Processor — 47

  48. Detecting the Need to Forward • Pass register numbers along pipeline • e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register • ALU operand register numbers in EX stage are given by • ID/EX.RegisterRs, ID/EX.RegisterRt • Data hazards when 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt Fwd fromEX/MEMpipeline reg Fwd fromMEM/WBpipeline reg Chapter 4 — The Processor — 48

  49. Detecting the Need to Forward • But only if forwarding instruction will write to a register! • EX/MEM.RegWrite, MEM/WB.RegWrite • And only if Rd for that instruction is not $zero • EX/MEM.RegisterRd ≠ 0,MEM/WB.RegisterRd ≠ 0 Chapter 4 — The Processor — 49

  50. Forwarding Paths Chapter 4 — The Processor — 50

More Related