Pipelining (Chapter 8)

Pipelining(Chapter 8) http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_8.ppt Course website: http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_results.htm 1 TU-Delft TI1400/12-PDS

Basic idea (1) I1 I2 I3 I4 F1 E1 F2 E2 F3 E3 F4 E4 time sequential execution buffer B1 Instruction fetch unit Execution unit 2

Basic idea (2): Overlap Clock cycle 1 2 3 4 5 I1 F1 E1 I2 F2 E2 F3 E3 I3 F4 E4 I4 time pipelined execution 3

Instruction phases • F Fetch instruction • D Decode instruction and fetch operands • O Perform operation • W Write result 4

Four-stage pipeline Clock cycle 1 2 3 4 5 I1 F1 D1 O1 W1 I2 F2 D2 O2 W2 F3 D3 O3 W3 I3 F4 D4 O4 W4 I4 time pipelined execution 5

Hardware organization (1) B3 B1 B2 Decode and fetch oper. Write unit Fetch unit Oper unit 6

Hardware organization (2) During cycle 4, the buffers contain: • B1: • instruction I3 • B2: • the source operands of I2 • the specification of the operation • the specification of the destination operand • B3: • the result of the operation of I1 • the specification of the destination operand 7

Hardware organization (3) B3 B1 B2 Decode and fetch oper. Write unit Fetch unit Oper unit I3 Operands I2 Operation I2 Result I1 8

Pipeline stall (1) • Pipeline stall: delay in a stage of the pipeline due to an instruction • Reasons for pipeline stall: • Cache miss • Long operation (for example, division) • Dependency between successive instructions • Branching 9

Pipeline stall (2): Cache miss 1 2 3 4 5 6 7 8 Clock cycle I1 F1 D1 O1 W1 I2 F2 D2 O2 W2 W3 F3 D3 O3 I3 Cache miss in I2 time 10

Pipeline stall (3): Cache miss Clock cycle 1 2 3 4 5 6 7 8 F F1 F2 F2 F2 F2 F3 D D1 idle idle idle D2 D3 O1 idle idle idle O2 O3 O W1 idle idle idle W2 W3 W Effect of cache miss in F2 11

Pipeline stall (4): Long operation 1 2 3 4 5 6 7 8 Clock cycle I1 F1 D1 O1 W1 I2 F2 D2 O2 W2 F3 D3 O3 W3 I3 F4 D4 O4 W4 I4 time 12

Pipeline stall (5): Dependencies • Instructions: ADD R1, 3(R1) ADD R4, 4(R1) cannotbe done inparallel • Instructions: ADD R2, 3(R1) ADD R4, 4(R3) can be done in parallel 13

Pipeline stall (6): Branch only start fetching instructions after branch has been executed (branch) Ii Fi Ei Fk Ek Ik time Pipeline stall due to Branch 14

Data dependency (1): example MUL R2,R3,R4 /* R4 destination */ ADD R5,R4,R6 /* R6 destination */ New value of R4 must be available before ADD instruction uses it 15

Data dependency (2): example I1 F1 D1 O1 W1 MUL time F2 D2 O2 W2 ADD I2 F3 D3 O3 W3 I3 I4 F4 D4 O4 W4 Pipeline stall due to data dependence between W1 and D2 16

Branching: Instruction queue instruction queue Fetch ........ Operation Write Dispatch 17

Idling at branch (branch) Ij Fj Ej Ij+1 Fj+1 idle Ik Fk Ek Ik+1 Fk+1 Ek+1 time 18

Branch with instruction queue I1 F1 E1 Branch folding: execute a later branch instruction simultaneously (i.e., compute target) I2 F2 E2 branch I3 F3 E3 I4 discarded I4 F4 Ij Fj Ej time Ij+1 Fj+1 Ej+1 Ij+2 Fj+2 Ej+2 Ij+3 Fj+3 Ej+3 19

Delayed branch (1): reordering LOOP Shift_left R1 Decrement R2 Branch_if>0 LOOP NEXT Add R1,R3 always loose a cycle Original LOOP Decrement R2 Branch_if>0 LOOP Shift_left R1 NEXT Add R1,R3 always executed Reordered 20

Delayed branch (2): execution timing F E Decrement F E Branch F E Shift F E Decrement F E Branch F E Shift F E Add 21

Branch prediction (1) I1 Compare F1 D1 E1 W1 I2 Branch-if> F2 E2 I3 F3 D3 E3 X I4 F4 D4 X Ik Fk Dk Effect of incorrect branch prediction 22

Branch prediction (2) Possible implementation: • use a single bit • bit records previous choice of branch • bit tells from which location to fetch next instructions 23

Data paths of CPU (1) Source 1 Source 2 SRC1 SRC2 Register file ALU RSLT Destination Operand forwarding 24

Data paths of CPU (2) SRC1 SRC2 RSLT Operation Write ALU register file forwarding data path 25

Pipelined operation result of Add has to be available I1 Add F R1 R2 + R3  I2 Shift F R3 shift R3  I3 F D O W I4 F D O W I1: Add R1, R2, R3 I2: Shift_left R3 26

Short pipeline I1 F R1 R2 + R3  fwd, shift F D R3  - I2 F D O W I3 27

Long pipeline 1 2 3 I1 F D O O O W fwd 1 2 3 I2 F D O O O W 1 2 3 I3 F D O O O W 28

Compiler solution I1: Add R1, R2, R3 I2: Shift_left R3 insert no-operations to wait for result I1: Add R1, R2, R3 NOP NOP I2: Shift_left R3 29

Side effects Other form of (implicit) data dependency: instructions can have side effects that are used by the next instruction I2: ADD D1, D2 I3: ADDX D3, D4 carry copy 30

Complex addressing mode X in instruction Load F D D X+[R1] [X+[R1]] [[X+[R1]]] R2  Next instruct. F D D fwd,O D D W Load (X(R1)), R2 Cause pipe line stall 31

Simple addressing modes Add #X,R1,R2 Load (R2),R2 Load (R2),R2 Add F D D X+[R1] R2  Load F D D [X+[R1]] R2  Load F D D [[X+[R1]]] R2  Next instruction F D D fwd,O D W Build up from simple instructions: same amount of time 32

Addressing modes • Requirements addressing modes with pipelining: • operand access not more than one memory access • only load and store instructions access memory • addressing modes do not have side effects • Possible addressing modes: • register • register indirect • index 33

Condition codes (1) • Problemsin RISC with condition codes (CCs): • do instructions after reordering have access to the right CC values? • are CCs already available at the next instruction? • Solutions: • compiler detection • no automatic use of CCs, only when explicitly given in instruction 34

Explicit specification of CCs Increment R5 Add R2, R4 Add-with-increment R1, R3 double precision addition ADDI R5, R5, 1 ADDC R4, R2, R4 ADDE R3, R1, R3 PowerPC instructions (C: change carry flag, E: use carry flag) 35

Two execution units instruction queue Fetch ........ FP Unit Dispatch Unit Write Integer Unit 36

Instruction flow (superscalar) I1 Fadd F1 D1 O1 O1 O1 W1 I2 Add F2 D2 O2 W2 I3 Fsub F3 D3 O3 O3 O3 W3 F4 D4 O4 W4 I4 Sub Simultaneous execution of floating point and integer operations 37

Completion in program order I1 Fadd F1 D1 O1 O1 O1 W1 I2 Add F2 D2 O2 W2 I3 Fsub F3 D3 O3 O3 O3 W3 F4 D4 O4 W4 I4 Sub wait until previous instruction has completed 38

Consequences completion order When an exception occurs: • writes not necessarily in order of instructions: imprecise exceptions • writes in order: precise exceptions 39

PowerPC pipeline Data cache Instr. cache Instr. fetch Branch unit Instruction queue Dispatcher LSU FPU IU store queue Completion queue 40

Performance Effects (1) • Execution time of a program: T • Dynamic instruction count:N • Number of cycles per instruction: S • Clock rate: R • Without pipelining: T = (N x S) / R • With an n-stage pipeline: T’ = T / n ??? 41

Performance Effects (2) • Cycle time: 2 ns(R is 500 MHz) • Cache hit (miss) ratio instructions: 0.95 (0.05) • Cache hit (miss) ratio data:0.90 (0.10) • Fraction of instructions that need data from memory:0.30 • Cache miss penalty:17 cycles • Average extra delay per instruction: (0.05 + 0.3 x 0.1) x 17 = 1.36 cycles, so slow down by a factor of more than 2!! 42

Performance Effects (3) • On average, the fetch stage takes, due to instruction cache misses: 1 + (0.05 x 17) = 1.85 cycles • On average, the decode stage takes, due to operand cache misses: 1 + (0.3 x 0.1 x 17) = 1.51 cycles • For a total additional cost of 1.36 cycles 43

Performance Effects (4) • If only one stage takes longer, the additional time should be counted relative to one stage, not relative to the complete instruction: • In other words: here, the pipeline is as slow as the slowest stage F1 D1 O1 W1 F1 D1 O1 W1 44

Performance Effects (5) • Delay of 1 cycle every 4 instructions in only one stage: average penalty: 0.25 • Average inter-completion time: (3x1 + 1x2)/4=1.25 F1 D1 O1 W1 F2 D2 O2 W2 F3 D3 O3 W3 F4 D4 O4 W4 F5 W5 D5 O5 45

Performance Effects (6) • Delays in two stages: • k % of the instructions in one stage, penalty s cycles • l % of the instructions in another stage, penalty t cycles • Average inter-completion time: ((100-k-l) x 1 + k(1+s) + l(1+t))/100 = (100+ ks +lt)/100 • In example (k=5, l=3, s=t=17):2.36 46

Performance Effects (7) • Large number of pipeline stages seems advantageous, but: • more instructions simultaneously being processed, so more opportunity for conflicts • branch penalty becomes larger • ALU is usually bottleneck, no use having smaller time steps 47

Pipelining (Chapter 8)

Pipelining (Chapter 8)

Presentation Transcript

Chapter Six Pipelining

Chapter 3 Pipelining

Chapter 8. Pipelining

Pipelining

Pipelining and Vector Processing

Pipelining ( Week 8 )

Pipelining

Pipelining

Lecture 8 Software Pipelining

Chapter 8. Pipelining

Lec 8: Pipelining

Pipelining (Chapter 8)

Chapter 6: Pipelining

Pipelining

Pipelining Chapter 6

Intro to Pipelining

Pipelining and Vector Processing

Chapter 8. Pipelining

Pipelining

Chapter 3 Pipelining

Pipelining

Pipelining