500 likes | 710 Views
Pipelining (Chapter 8). http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_8.ppt. Course website: http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_results.htm. 1. T U -Delft. TI1400/12-PDS. Basic idea (1). I1. I2. I3. I4. F1. E1. F2. E2. F3. E3. F4. E4. time.
E N D
Pipelining(Chapter 8) http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_8.ppt Course website: http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_results.htm 1 TU-Delft TI1400/12-PDS
Basic idea (1) I1 I2 I3 I4 F1 E1 F2 E2 F3 E3 F4 E4 time sequential execution buffer B1 Instruction fetch unit Execution unit 2
Basic idea (2): Overlap Clock cycle 1 2 3 4 5 I1 F1 E1 I2 F2 E2 F3 E3 I3 F4 E4 I4 time pipelined execution 3
Instruction phases • F Fetch instruction • D Decode instruction and fetch operands • O Perform operation • W Write result 4
Four-stage pipeline Clock cycle 1 2 3 4 5 I1 F1 D1 O1 W1 I2 F2 D2 O2 W2 F3 D3 O3 W3 I3 F4 D4 O4 W4 I4 time pipelined execution 5
Hardware organization (1) B3 B1 B2 Decode and fetch oper. Write unit Fetch unit Oper unit 6
Hardware organization (2) During cycle 4, the buffers contain: • B1: • instruction I3 • B2: • the source operands of I2 • the specification of the operation • the specification of the destination operand • B3: • the result of the operation of I1 • the specification of the destination operand 7
Hardware organization (3) B3 B1 B2 Decode and fetch oper. Write unit Fetch unit Oper unit I3 Operands I2 Operation I2 Result I1 8
Pipeline stall (1) • Pipeline stall: delay in a stage of the pipeline due to an instruction • Reasons for pipeline stall: • Cache miss • Long operation (for example, division) • Dependency between successive instructions • Branching 9
Pipeline stall (2): Cache miss 1 2 3 4 5 6 7 8 Clock cycle I1 F1 D1 O1 W1 I2 F2 D2 O2 W2 W3 F3 D3 O3 I3 Cache miss in I2 time 10
Pipeline stall (3): Cache miss Clock cycle 1 2 3 4 5 6 7 8 F F1 F2 F2 F2 F2 F3 D D1 idle idle idle D2 D3 O1 idle idle idle O2 O3 O W1 idle idle idle W2 W3 W Effect of cache miss in F2 11
Pipeline stall (4): Long operation 1 2 3 4 5 6 7 8 Clock cycle I1 F1 D1 O1 W1 I2 F2 D2 O2 W2 F3 D3 O3 W3 I3 F4 D4 O4 W4 I4 time 12
Pipeline stall (5): Dependencies • Instructions: ADD R1, 3(R1) ADD R4, 4(R1) cannotbe done inparallel • Instructions: ADD R2, 3(R1) ADD R4, 4(R3) can be done in parallel 13
Pipeline stall (6): Branch only start fetching instructions after branch has been executed (branch) Ii Fi Ei Fk Ek Ik time Pipeline stall due to Branch 14
Data dependency (1): example MUL R2,R3,R4 /* R4 destination */ ADD R5,R4,R6 /* R6 destination */ New value of R4 must be available before ADD instruction uses it 15
Data dependency (2): example I1 F1 D1 O1 W1 MUL time F2 D2 O2 W2 ADD I2 F3 D3 O3 W3 I3 I4 F4 D4 O4 W4 Pipeline stall due to data dependence between W1 and D2 16
Branching: Instruction queue instruction queue Fetch ........ Operation Write Dispatch 17
Idling at branch (branch) Ij Fj Ej Ij+1 Fj+1 idle Ik Fk Ek Ik+1 Fk+1 Ek+1 time 18
Branch with instruction queue I1 F1 E1 Branch folding: execute a later branch instruction simultaneously (i.e., compute target) I2 F2 E2 branch I3 F3 E3 I4 discarded I4 F4 Ij Fj Ej time Ij+1 Fj+1 Ej+1 Ij+2 Fj+2 Ej+2 Ij+3 Fj+3 Ej+3 19
Delayed branch (1): reordering LOOP Shift_left R1 Decrement R2 Branch_if>0 LOOP NEXT Add R1,R3 always loose a cycle Original LOOP Decrement R2 Branch_if>0 LOOP Shift_left R1 NEXT Add R1,R3 always executed Reordered 20
Delayed branch (2): execution timing F E Decrement F E Branch F E Shift F E Decrement F E Branch F E Shift F E Add 21
Branch prediction (1) I1 Compare F1 D1 E1 W1 I2 Branch-if> F2 E2 I3 F3 D3 E3 X I4 F4 D4 X Ik Fk Dk Effect of incorrect branch prediction 22
Branch prediction (2) Possible implementation: • use a single bit • bit records previous choice of branch • bit tells from which location to fetch next instructions 23
Data paths of CPU (1) Source 1 Source 2 SRC1 SRC2 Register file ALU RSLT Destination Operand forwarding 24
Data paths of CPU (2) SRC1 SRC2 RSLT Operation Write ALU register file forwarding data path 25
Pipelined operation result of Add has to be available I1 Add F R1 R2 + R3 I2 Shift F R3 shift R3 I3 F D O W I4 F D O W I1: Add R1, R2, R3 I2: Shift_left R3 26
Short pipeline I1 F R1 R2 + R3 fwd, shift F D R3 - I2 F D O W I3 27
Long pipeline 1 2 3 I1 F D O O O W fwd 1 2 3 I2 F D O O O W 1 2 3 I3 F D O O O W 28
Compiler solution I1: Add R1, R2, R3 I2: Shift_left R3 insert no-operations to wait for result I1: Add R1, R2, R3 NOP NOP I2: Shift_left R3 29
Side effects Other form of (implicit) data dependency: instructions can have side effects that are used by the next instruction I2: ADD D1, D2 I3: ADDX D3, D4 carry copy 30
Complex addressing mode X in instruction Load F D D X+[R1] [X+[R1]] [[X+[R1]]] R2 Next instruct. F D D fwd,O D D W Load (X(R1)), R2 Cause pipe line stall 31
Simple addressing modes Add #X,R1,R2 Load (R2),R2 Load (R2),R2 Add F D D X+[R1] R2 Load F D D [X+[R1]] R2 Load F D D [[X+[R1]]] R2 Next instruction F D D fwd,O D W Build up from simple instructions: same amount of time 32
Addressing modes • Requirements addressing modes with pipelining: • operand access not more than one memory access • only load and store instructions access memory • addressing modes do not have side effects • Possible addressing modes: • register • register indirect • index 33
Condition codes (1) • Problemsin RISC with condition codes (CCs): • do instructions after reordering have access to the right CC values? • are CCs already available at the next instruction? • Solutions: • compiler detection • no automatic use of CCs, only when explicitly given in instruction 34
Explicit specification of CCs Increment R5 Add R2, R4 Add-with-increment R1, R3 double precision addition ADDI R5, R5, 1 ADDC R4, R2, R4 ADDE R3, R1, R3 PowerPC instructions (C: change carry flag, E: use carry flag) 35
Two execution units instruction queue Fetch ........ FP Unit Dispatch Unit Write Integer Unit 36
Instruction flow (superscalar) I1 Fadd F1 D1 O1 O1 O1 W1 I2 Add F2 D2 O2 W2 I3 Fsub F3 D3 O3 O3 O3 W3 F4 D4 O4 W4 I4 Sub Simultaneous execution of floating point and integer operations 37
Completion in program order I1 Fadd F1 D1 O1 O1 O1 W1 I2 Add F2 D2 O2 W2 I3 Fsub F3 D3 O3 O3 O3 W3 F4 D4 O4 W4 I4 Sub wait until previous instruction has completed 38
Consequences completion order When an exception occurs: • writes not necessarily in order of instructions: imprecise exceptions • writes in order: precise exceptions 39
PowerPC pipeline Data cache Instr. cache Instr. fetch Branch unit Instruction queue Dispatcher LSU FPU IU store queue Completion queue 40
Performance Effects (1) • Execution time of a program: T • Dynamic instruction count:N • Number of cycles per instruction: S • Clock rate: R • Without pipelining: T = (N x S) / R • With an n-stage pipeline: T’ = T / n ??? 41
Performance Effects (2) • Cycle time: 2 ns(R is 500 MHz) • Cache hit (miss) ratio instructions: 0.95 (0.05) • Cache hit (miss) ratio data:0.90 (0.10) • Fraction of instructions that need data from memory:0.30 • Cache miss penalty:17 cycles • Average extra delay per instruction: (0.05 + 0.3 x 0.1) x 17 = 1.36 cycles, so slow down by a factor of more than 2!! 42
Performance Effects (3) • On average, the fetch stage takes, due to instruction cache misses: 1 + (0.05 x 17) = 1.85 cycles • On average, the decode stage takes, due to operand cache misses: 1 + (0.3 x 0.1 x 17) = 1.51 cycles • For a total additional cost of 1.36 cycles 43
Performance Effects (4) • If only one stage takes longer, the additional time should be counted relative to one stage, not relative to the complete instruction: • In other words: here, the pipeline is as slow as the slowest stage F1 D1 O1 W1 F1 D1 O1 W1 44
Performance Effects (5) • Delay of 1 cycle every 4 instructions in only one stage: average penalty: 0.25 • Average inter-completion time: (3x1 + 1x2)/4=1.25 F1 D1 O1 W1 F2 D2 O2 W2 F3 D3 O3 W3 F4 D4 O4 W4 F5 W5 D5 O5 45
Performance Effects (6) • Delays in two stages: • k % of the instructions in one stage, penalty s cycles • l % of the instructions in another stage, penalty t cycles • Average inter-completion time: ((100-k-l) x 1 + k(1+s) + l(1+t))/100 = (100+ ks +lt)/100 • In example (k=5, l=3, s=t=17):2.36 46
Performance Effects (7) • Large number of pipeline stages seems advantageous, but: • more instructions simultaneously being processed, so more opportunity for conflicts • branch penalty becomes larger • ALU is usually bottleneck, no use having smaller time steps 47