180 likes | 307 Views
Lecture 16: Pipelining & Instruction Level Parallelism. Computer Engineering 585 Fall 2001. For FP Adder. ID. A1. A2. A3. A4. M. WB. Mult. Add. Mult. Mult. Add. Add. Mult. Add. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 1.
E N D
Lecture 16: Pipelining & Instruction Level Parallelism Computer Engineering 585 Fall 2001
For FP Adder ID A1 A2 A3 A4 M WB Mult Add Mult Mult Add Add Mult Add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 1 1 1 1 1 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Assume 22 cycle Div, 7 cycle Mult, 4 cycle Add Assume 22 cycle Div, 7 cycle Mult, 4 cycle Add Assume 22 cycle Div, 7 cycle Mult, 4 cycle Add Assume 22 cycle Div, 7 cycle Mult, 4 cycle Add Stall at ID Stage • Make a reservation for the FP RF write port in ID stage. Div
MULTF F0, F2, F4 IF ID M1 M2 M3 M4 M5 M6 M7 M WB MULTF F0, F2, F4 IF ID M1 M2 M3 M4 M5 M6 M7 M WB ADDF F0, F2, F4 IF ID A1 A2 A3 A4 M WB ADDF F0, F2, F4 IF ID ID A1 ID A2 ID A3 A4 M WB ADD ADD IF ID WB ADD Reservation Table for FP Reg File IF WB ADD 1 F0 F1 F2 F3 F4 F5 F30 F31 WAW Hazard Resolution Mult erases a reservation on entry into MEM. Mult writes into the table at entry into M1.
FP – Int Hazards • Data Move instructions: MOVFP2I • FP Load/Store instructions: LDD F0, 10(R1) • Similar forwarding mechanism: A4/MEM forwarded to ID/A1, or ID/M1 etc.
Issue Conditions • structural hazards: the needed function unit is free and register write port available when needed. • RAW data: source registers are not pending destinations of ID/A1, A1/A2, … or A4/MEM registers. • WAW: if an inst in A1, A2, …,M7 has the same dest register as this instruction.
1.7 1.7 doduc 3.7 15.4 2.0 1.6 2.0 ear 2.5 12.4 0.0 2.3 2.5 FP SPEC# hydro2d 3.2 benchmarks 0.4 0.0 2.1 mdljdp 1.2 2.9 24.5 0.0 0.7 1.5 su2cor 1.6 18.6 0.6 0.0 5.0 10.0 15.0 20.0 25.0 Number of stalls Add/subtract/convert Compares Multiply Divide Divide structural FP Stalls per FP Op Statistics Avg stalls/FP Add = 1.7 cycles (56% of 3 cycles) Avg stalls/mult = 2.8 (46%); Div = 14.2 (59%)
FP Instruction Types • Arithmetic: ADDF, MULTD • Conversion: CVTF2D, CVTI2F • Comparison: LTD, GEF • Branch: BFPT, BFPF • Load/Store: LD, SF
0.98 0.07 doduc 0.08 0.08 0.52 0.09 ear 0.07 0.00 0.54 FP SPEC# 0.22 hydro2d 0.04 benchmarks 0.00 0.88 mdljdp 0.10 0.03 0.00 0.61 0.02 su2cor 0.01 0.01 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Number of stalls FP result stalls FP compare stalls Branch/load stalls FP structural FP & Int Stalls for FP benchmarks Range: .65 (su2cor) to 1.21 for doduc FP result stalls: .71/inst (82%); compares: .1
MULTF F0, F2, F4 IF ID M1 M2 M3 M4 M5 M6 M7 M WB ADDF F2, F2, F4 IF ID A1 A2 A3 A4 M WB ADD WB ADD Precise Exceptions • Out-of-order completion. • Forget about precise exceptions! • Do not commit the results of later instructions until guaranteed to have no exceptions in earlier instruction.
Precise Exceptions contd. • History file for each register – restore the history if needed. (3) Keep track of sufficiently many PCs. On restart, do not restart already completed instructions. (4) Before issue guarantee that none of the preceding instructions will raise an exception. Determine exception situations early in a function unit pipeline.
Case Study: MIPS R4000 (200 MHz) • 8 Stage Pipeline: • IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. • IS–second half of access to instruction cache. • RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection. • EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. • DF–data fetch, first half of access to data cache. • DS–second half of access to data cache. • TC–tag check, determine whether the data cache access hit. • WB–write back for loads and register-register operations. • 8 Stages: What is impact on Load delay? Branch delay? Why?
DF, DS TC Physically Addressed Caches Virtual Page# Page offset TLB Physical Page# Page offset D-Cache Comparator Tag Data
Case Study: MIPS R4000 IF IS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF TWO Cycle Load Latency IF IS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF THREE Cycle Branch Latency (conditions evaluated during EX phase) Delay slot plus two stalls Branch likely cancels delay slot if not taken
MIPS R4000 Floating Point • FP Adder, FP Multiplier, FP Divider • Last step of FP Multiplier/Divider uses FP Adder HW • 8 kinds of stages in FP units: Stage Functional unit Description A FP adder Mantissa ADD stage D FP divider Divide pipeline stage E FP multiplier Exception test stage M FP multiplier First stage of multiplier N FP multiplier Second stage of multiplier R FP adder Rounding stage S FP adder Operand shift stage U Unpack FP numbers
MIPS FP Pipe Stages 4, 3 8, 4 FP Instr 1 2 3 4 5 6 7 8 … Add, Subtract U S+A A+R R+S Multiply U E+M M M M N N+A R Divide U A R D28 … D+A D+R, D+R, D+A, D+R, A, R Square root U E (A+R)108 … A R Negate U S Absolute value U S FP compare U A R Stages: M First stage of multiplier N Second stage of multiplier R Rounding stage S Operand shift stage U Unpack FP numbers • A Mantissa ADD stage • D Divide pipeline stage • E Exception test stage
R4000 Performance • Not ideal CPI of 1: • Load stalls (1 or 2 clock cycles) • Branch stalls (2 cycles + unfilled slots) • FP result stalls: RAW data hazard (latency) • FP structural stalls: Not enough FP hardware (parallelism)
Advanced Pipelining and Instruction Level Parallelism (ILP) • ILP: Overlap execution of unrelated instructions • Pipelining supports limited kind of parallelism • Pipeline CPI = Ideal CPI + structural stalls + RAW stalls + WAR stalls + WAW stalls + control stalls+ memory stalls. • Reduce RHS factors through HW and SW solutions.