1 / 47

Pipelining (Chapter 8)

Pipelining (Chapter 8). http://www.pds.ewi.tudelft.nl/~iosup/Courses/2011_ti1400_10.ppt. 1. T U -Delft. TI1400/11-PDS. Basic idea (1). I1. I2. I3. I4. F1. E1. F2. E2. F3. E3. F4. E4. time. sequential execution. buffer. B1. Instruction fetch unit. Execution unit. 2.

vail
Download Presentation

Pipelining (Chapter 8)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pipelining(Chapter 8) http://www.pds.ewi.tudelft.nl/~iosup/Courses/2011_ti1400_10.ppt 1 TU-Delft TI1400/11-PDS

  2. Basic idea (1) I1 I2 I3 I4 F1 E1 F2 E2 F3 E3 F4 E4 time sequential execution buffer B1 Instruction fetch unit Execution unit 2

  3. Basic idea (2): Overlap Clock cycle 1 2 3 4 5 I1 F1 E1 I2 F2 E2 F3 E3 I3 F4 E4 I4 time pipelined execution 3

  4. Instruction phases • F Fetch instruction • D Decode instruction and fetch operands • O Perform operation • W Write result 4

  5. Four-stage pipeline Clock cycle 1 2 3 4 5 I1 F1 D1 O1 W1 I2 F2 D2 O2 W2 F3 D3 O3 W3 I3 F4 D4 O4 W4 I4 time pipelined execution 5

  6. Hardware organization (1) B3 B1 B2 Decode and fetch oper. Write unit Fetch unit Oper unit 6

  7. Hardware organization (2) During cycle 4, the buffers contain: • B1: • instruction I3 • B2: • the source operands of I2 • the specification of the operation • the specification of the destination operand • B3: • the result of the operation of I1 • the specification of the destination operand 7

  8. Hardware organization (3) B3 B1 B2 Decode and fetch oper. Write unit Fetch unit Oper unit I3 Operands I2 Operation I2 Result I1 8

  9. Pipeline stall (1) • Pipeline stall: delay in a stage of the pipeline due to an instruction • Reasons for pipeline stall: • Cache miss • Long operation (for example, division) • Dependency between successive instructions • Branching 9

  10. Pipeline stall (2): Cache miss 1 2 3 4 5 6 7 8 Clock cycle I1 F1 D1 O1 W1 I2 F2 D2 O2 W2 W3 F3 D3 O3 I3 Cache miss in I2 time 10

  11. Pipeline stall (3): Cache miss Clock cycle 1 2 3 4 5 6 7 8 F F1 F2 F2 F2 F2 F3 D D1 idle idle idle D2 D3 O1 idle idle idle O2 O3 O W1 idle idle idle W2 W3 W Effect of cache miss in F2 11

  12. Pipeline stall (4): Long operation 1 2 3 4 5 6 7 8 Clock cycle I1 F1 D1 O1 W1 I2 F2 D2 O2 W2 F3 D3 O3 W3 I3 F4 D4 O4 W4 I4 time 12

  13. Pipeline stall (5): Dependencies • Instructions: ADD R1, 3(R1) ADD R4, 4(R1) cannotbe done inparallel • Instructions: ADD R2, 3(R1) ADD R4, 4(R3) can be done in parallel 13

  14. Pipeline stall (6): Branch only start fetching instructions after branch has been executed (branch) Ii Fi Ei Fk Ek Ik time Pipeline stall due to Branch 14

  15. Data dependency (1): example MUL R2,R3,R4 /* R4 destination */ ADD R5,R4,R6 /* R6 destination */ New value of R4 must be available before ADD instruction uses it 15

  16. Data dependency (2): example I1 F1 D1 O1 W1 MUL time F2 D2 O2 W2 ADD I2 F3 D3 O3 W3 I3 I4 F4 D4 O4 W4 Pipeline stall due to data dependence between W1 and D2 16

  17. Branching: Instruction queue instruction queue Fetch ........ Operation Write Dispatch 17

  18. Idling at branch (branch) Ij Fj Ej Ij+1 Fj+1 idle Ik Fk Ek Ik+1 Fk+1 Ek+1 time 18

  19. Branch with instruction queue I1 F1 E1 Branch folding: execute a later branch instruction simultaneously (i.e., compute target) I2 F2 E2 branch I3 F3 E3 I4 discarded I4 F4 Ij Fj Ej time Ij+1 Fj+1 Ej+1 Ij+2 Fj+2 Ej+2 Ij+3 Fj+3 Ej+3 19

  20. Delayed branch (1): reordering LOOP Shift_left R1 Decrement R2 Branch_if>0 LOOP NEXT Add R1,R3 always loose a cycle Original LOOP Decrement R2 Branch_if>0 LOOP Shift_left R1 NEXT Add R1,R3 always executed Reordered 20

  21. Delayed branch (2): execution timing F E Decrement F E Branch F E Shift F E Decrement F E Branch F E Shift F E Add 21

  22. Branch prediction (1) I1 Compare F1 D1 E1 W1 I2 Branch-if> F2 E2 I3 F3 D3 E3 X I4 F4 D4 X Ik Fk Dk Effect of incorrect branch prediction 22

  23. Branch prediction (2) Possible implementation: • use a single bit • bit records previous choice of branch • bit tells from which location to fetch next instructions 23

  24. Data paths of CPU (1) Source 1 Source 2 SRC1 SRC2 Register file ALU RSLT Destination Operand forwarding 24

  25. Data paths of CPU (2) SRC1 SRC2 RSLT Operation Write ALU register file forwarding data path 25

  26. Pipelined operation result of Add has to be available I1 Add F R1 R2 + R3  I2 Shift F R3 shift R3  I3 F D O W I4 F D O W I1: Add R1, R2, R3 I2: Shift_left R3 26

  27. Short pipeline I1 F R1 R2 + R3  fwd, shift F D R3  - I2 F D O W I3 27

  28. Long pipeline 1 2 3 I1 F D O O O W fwd 1 2 3 I2 F D O O O W 1 2 3 I3 F D O O O W 28

  29. Compiler solution I1: Add R1, R2, R3 I2: Shift_left R3 insert no-operations to wait for result I1: Add R1, R2, R3 NOP NOP I2: Shift_left R3 29

  30. Side effects Other form of (implicit) data dependency: instructions can have side effects that are used by the next instruction I2: ADD D1, D2 I3: ADDX D3, D4 carry copy 30

  31. Complex addressing mode X in instruction Load F D D X+[R1] [X+[R1]] [[X+[R1]]] R2  Next instruct. F D D fwd,O D D W Load (X(R1)), R2 Cause pipe line stall 31

  32. Simple addressing modes Add #X,R1,R2 Load (R2),R2 Load (R2),R2 Add F D D X+[R1] R2  Load F D D [X+[R1]] R2  Load F D D [[X+[R1]]] R2  Next instruction F D D fwd,O D W Build up from simple instructions: same amount of time 32

  33. Addressing modes • Requirements addressing modes with pipelining: • operand access not more than one memory access • only load and store instructions access memory • addressing modes do not have side effects • Possible addressing modes: • register • register indirect • index 33

  34. Condition codes (1) • Problemsin RISC with condition codes (CCs): • do instructions after reordering have access to the right CC values? • are CCs already available at the next instruction? • Solutions: • compiler detection • no automatic use of CCs, only when explicitly given in instruction 34

  35. Explicit specification of CCs Increment R5 Add R2, R4 Add-with-increment R1, R3 double precision addition ADDI R5, R5, 1 ADDC R4, R2, R4 ADDE R3, R1, R3 PowerPC instructions (C: change carry flag, E: use carry flag) 35

  36. Two execution units instruction queue Fetch ........ FP Unit Dispatch Unit Write Integer Unit 36

  37. Instruction flow (superscalar) I1 Fadd F1 D1 O1 O1 O1 W1 I2 Add F2 D2 O2 W2 I3 Fsub F3 D3 O3 O3 O3 W3 F4 D4 O4 W4 I4 Sub Simultaneous execution of floating point and integer operations 37

  38. Completion in program order I1 Fadd F1 D1 O1 O1 O1 W1 I2 Add F2 D2 O2 W2 I3 Fsub F3 D3 O3 O3 O3 W3 F4 D4 O4 W4 I4 Sub wait until previous instruction has completed 38

  39. Consequences completion order When an exception occurs: • writes not necessarily in order of instructions: imprecise exceptions • writes in order: precise exceptions 39

  40. PowerPC pipeline Data cache Instr. cache Instr. fetch Branch unit Instruction queue Dispatcher LSU FPU IU store queue Completion queue 40

  41. Performance Effects (1) • Execution time of a program: T • Dynamic instruction count:N • Number of cycles per instruction: S • Clock rate: R • Without pipelining: T = (N x S) / R • With an n-stage pipeline: T’ = T / n ??? 41

  42. Performance Effects (2) • Cycle time: 2 ns(R is 500 MHz) • Cache hit (miss) ratio instructions: 0.95 (0.05) • Cache hit (miss) ratio data:0.90 (0.10) • Fraction of instructions that need data from memory:0.30 • Cache miss penalty:17 cycles • Average extra delay per instruction: (0.05 + 0.3 x 0.1) x 17 = 1.36 cycles, so slow down by a factor of more than 2!! 42

  43. Performance Effects (3) • On average, the fetch stage takes, due to instruction cache misses: 1 + (0.05 x 17) = 1.85 cycles • On average, the decode stage takes, due to operand cache misses: 1 + (0.3 x 0.1 x 17) = 1.51 cycles • For a total additional cost of 1.36 cycles 43

  44. Performance Effects (4) • If only one stage takes longer, the additional time should be counted relative to one stage, not relative to the complete instruction: • In other words: here, the pipeline is as slow as the slowest stage F1 D1 O1 W1 F1 D1 O1 W1 44

  45. Performance Effects (5) • Delay of 1 cycle every 4 instructions in only one stage: average penalty: 0.25 • Average inter-completion time: (3x1 + 1x2)/4=1.25 F1 D1 O1 W1 F2 D2 O2 W2 F3 D3 O3 W3 F4 D4 O4 W4 F5 W5 D5 O5 45

  46. Performance Effects (6) • Delays in two stages: • k % of the instructions in one stage, penalty s cycles • l % of the instructions in another stage, penalty t cycles • Average inter-completion time: ((100-k-l) x 1 + k(1+s) + l(1+t))/100 = (100+ ks +lt)/100 • In example (k=5, l=3, s=t=17):2.36 46

  47. Performance Effects (7) • Large number of pipeline stages seems advantageous, but: • more instructions simultaneously being processed, so more opportunity for conflicts • branch penalty becomes larger • ALU is usually bottleneck, no use having smaller time steps 47

More Related