1 / 22

CSC 4250 Computer Architectures

CSC 4250 Computer Architectures. September 26, 2006 Appendix A. Pipelining. Checks before Instruction Issue. Check for structural hazards ─ FP divide, register write port

phuc
Download Presentation

CSC 4250 Computer Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSC 4250Computer Architectures September 26, 2006Appendix A. Pipelining

  2. Checks before Instruction Issue • Check for structural hazards ─ FP divide, register write port • Check for RAW hazard ─ check if source registers of instruction in ID is listed as a destination in ID/A1, A1/A2, A2/A3, ID/M1, M1/M2, M2/M3, …, M5/M6, D • Check for WAW hazard ─ check if instruction in ID has the same register destination as any instruction in A1, …, A4, M1, …, M7, D • Have we forgotten about WAR hazard?

  3. Example on why no WAR hazards • Modify Figure A.33: • L.D F4,O(R2) • MUL.D F0,F4,F6 • ADD.D F4,F0,F8 • S.D F4,O(R2) Clock cycle Number In. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1. IF ID EX ME WB 2. IF 3. IF 4. IF Fill in the blanks

  4. How Imprecise Exception May Arise • Consider code: DIV.D F0,F2,F4 ADD.D F10,F10,F8 SUB.D F12,F12,F14 • Will get out-of-order completion • Suppose SUB.D causes a FP exception after ADD.D completes (but before DIV.D finishes) • Next, DIV.D causes a FP exception • Cannot restore the state to before DIV.D, as ADD.D has destroyed one of its operands

  5. Performance of MIPS FP Pipeline • The MIPS FP pipeline generates both structural stalls for the divide unit and stalls for RAW hazards (it can also have WAW hazards, but this rarely occurs in practice). • Figure A.35 shows the number of stall cycles for each type of FP operation. The stall cycles per operation track the latency of the FP operations, varying from 46% to 59% of the latency of the FU.

  6. Fig. A.35. Stalls per FP operation for FP SPEC89

  7. Stalls for SPEC89 FP Benchmarks • Consider data in Figure A.35 • Except for the divide structural hazards, these data do not depend on the frequency of an operation, only on its latency and the # of cycles before the result is used • The number of stalls from RAW hazards roughly tracks the latency of the FP unit. For example, the average number of stalls per FP add, subtract, or convert is 1.7 cycles, or 56% of the latency (3 cycles). Likewise, the average number of stalls for multiplies and divides are 2.8 and 14.2, resp., or 46% and 59% of the corr. latency • Structural hazards for divides are rare, since the divide frequency is low

  8. Fig. A.36. Stalls per instr. for FP SPEC89

  9. MIPS R4000 Pipeline • Implements MIPS64 but uses a deeper pipeline • Achieve higher clock rates by decomposing five-stage integer pipeline into eight stages • Extra pipeline stages from decomposing memory access • Sometimes called superpipelining

  10. Eight-stage Pipeline Structure of R4000 • It uses pipelined instruction and data caches

  11. Functions of the Eight Stages (1) • IF ─ • First half of instruction fetch • PC selection • Initiation of instruction cache access • IS ─ • Second half of instruction fetch • Complete instruction cache access • RF ─ • Instruction decode and register fetch • Hazard check • Instruction cache hit detection • EX ─ • Execution (including effective address calculation, ALU operation, branch-target calc. and condition evaluation.)

  12. Functions of the Eight Stages (2) • DF ─ Data fetch, 1st half of data cache access • DS ─ 2nd half of data fetch, Complete data cache access • TC ─ Tag check, Determine if data cache access is hit • WB ─ Write back for loads and RR operations

  13. Two-cycle Load Delay for R4000

  14. Example of Load 2-cycle Stall Clock number Instruction no. 1 2 3 4 5 6 7 8 9 LD R1,… IF IS RF EX DF DS TC WB DADD R2,R1,… IF IS RF st st EX DF DS DSUB R3,R1,… IF IS st st RF EX DF OR R4,R1 IF st st IS RF EX

  15. Three-cycle Branch Delay for R4000 • Evaluate branch condition during EX

  16. Example of Taken Branch Clock number Instruction no. 1 2 3 4 5 6 7 8 9 Branch instruction IF IS RF EX DF DS TC WB Delay slot IF IS RF EX DF DS TC WB Stall st st st st st st st Stall st st st st st st Branch instruction IF IS RF EX DF

  17. Example of Untaken Branch Clock number Instruction no. 1 2 3 4 5 6 7 8 9 Branch instruction IF IS RF EX DF DS TC WB Delay slot IF IS RF EX DF DS TC WB Branch instruction+2 IF IS RF EX DF DS TC Branch instruction+3 IF IS RF EX DF DS • R4000 uses a predicted-not-taken strategy for the remaining two cycles of the branch delay. • Advantage over predicted-taken strategy?

  18. R4000 FP Pipeline

  19. Major Causes of R4000 Pipeline Stalls • Load stalls ─ use of a load result 1 or 2 cycles after the load • Branch stalls ─ 2-cycle stall on taken branch plus unfilled or cancelled branch delay slots • FP result stalls ─ RAW hazards for operand • FP structural stalls ─ conflicts for func. units • WAW stalls are not common

  20. Pipeline CPI for 10 SPEC92 Benchmarks • The pipeline CPI varies from 1.2 to 2.8. The left five programs are integer programs, and branch delays are the major CPI contributor. The right five programs are FP, and FP result stalls are the major contributors.

  21. Pipeline CPI and Major Sources of Stalls

  22. MIPS R4300 Pipeline • Manufactured by NEC • 64 bit processor implements MIPS64 IS • Embedded applications • Nintendo-64 game processor • High-end color laser printer • Multiple EX stages for FP operations • Instructions complete out of order • FP instruction generates exception after a following integer instruction has completed, leading to an imprecise exception

More Related