1 / 29

Lecture 8 Advanced Pipeline

Pipeline Complications. CS510 Computer Architectures. Lecture 8 - 2. Interrupts. Interrupts: 5 instructions executing in 5-stage pipelineHow to stop the pipeline?How to restart the pipeline?Who caused the interrupt? . StageExceptional ConditionsIFPage fault on instruction fetch;

braden
Download Presentation

Lecture 8 Advanced Pipeline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Pipeline Complications CS510 Computer Architectures Lecture 8 - 1 Lecture 8 Advanced Pipeline

    2. Pipeline Complications CS510 Computer Architectures Lecture 8 - 2 Interrupts Interrupts: 5 instructions executing in 5-stage pipeline How to stop the pipeline? How to restart the pipeline? Who caused the interrupt?

    3. Pipeline Complications CS510 Computer Architectures Lecture 8 - 3 Simultaneous Exceptions in More Than One Pipe Stages Simultaneous exceptions in more than one pipeline stage, e.g. LD followed by ADD LD with data page(DM) fault in MEM stage ADD with instruction page(IM) fault in IF stage ADD fault will happen BEFORE Load fault Solution #1 Interrupt status vector per instruction Defer check till the last stage, and kill machine state update if exception

    4. Pipeline Complications CS510 Computer Architectures Lecture 8 - 4 Simultaneous Exceptions Complex Addressing Modes and Instructions Address modes: Auto-increment causes register change during the instruction execution - Register write in EX stage instead of WB stage Interrupts? Need to restore register state Adds WAR and WAW hazards since writes in a register in EX, no longer in WB stage Memory-Memory Move Instructions Must be able to handle multiple page faults Long-lived instructions: partial state save on interrupt Condition Codes

    5. Pipeline Complications CS510 Computer Architectures Lecture 8 - 5 Extending the DLX to Handle Multi-cycle Operations

    6. Pipeline Complications CS510 Computer Architectures Lecture 8 - 6 Multicycle Operations

    7. Pipeline Complications CS510 Computer Architectures Lecture 8 - 7 Latency and Initiation Interval MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB ADDD IF ID AI A2 A3 A4 MEM WB LD* IF ID EX MEM WB SD* IF ID EX MEM WB

    8. Pipeline Complications CS510 Computer Architectures Lecture 8 - 8 Floating Point Operations FP Instruction Latency Initiation Interval (MIPS R4000) Add, Subtract 4 3 Multiply 8 4 Divide 36 35 Square root 112 111 Negate 2 1 Absolute value 2 1 FP compare 3 2

    9. Pipeline Complications CS510 Computer Architectures Lecture 8 - 9 Complications Due to FP Operations Divide, Square Root take @ 10X to @ 30X longer than Add exceptions? Adds WAR and WAW hazards since pipelines are no longer same length

    10. Pipeline Complications CS510 Computer Architectures Lecture 8 - 10 Summary of Pipelining Basics Hazards limit performance Structural: need more HW resources Data: need forwarding, compiler scheduling Control: early evaluation of PC, delayed branch, prediction Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency Interrupts, FP Instruction Set makes pipelining harder Compilers reduce cost of data and control hazards Load delay slots Branch delay slots Branch prediction

    11. Pipeline Complications CS510 Computer Architectures Lecture 8 - 11 Case Study: MIPS R4000 and Introduction to Advanced Pipelining

    12. Pipeline Complications CS510 Computer Architectures Lecture 8 - 12 Case Study: MIPS R4000 (100 MHz to 200 MHz) 8 Stage Pipeline: IF First half of fetching of instruction PC selection Initiation of instruction cache access IS - Second half of fetching of instruction Access to instruction cache RF Instruction decode, register fetch, hazard checking also instruction cache hit detection(tag check) EX Execution Effective address calculation ALU operation Branch target computation and condition evaluation DF - First half of access to data cache DS - Second half of access to data cache TC - Tag check for data cache hit WB -Write back for loads and register-register operations Answer is 3 stages between branch and new instruction fetch and 2 stages between load and use (even though if looked at red insertions that it would be 3 for load and 2 for branch) Reasons: 1) Load: TC just does tag check, data available after DS; thus supply the data & forward it, restarting the pipeline on a data cache miss 2) EX phase does the address calculation even though just added one phase; presumed reason is that since want fast clockc cycle don? want to sitck RF phase with reading regisers AND testing for zero, so just moved it back on phaseAnswer is 3 stages between branch and new instruction fetch and 2 stages between load and use (even though if looked at red insertions that it would be 3 for load and 2 for branch) Reasons: 1) Load: TC just does tag check, data available after DS; thus supply the data & forward it, restarting the pipeline on a data cache miss 2) EX phase does the address calculation even though just added one phase; presumed reason is that since want fast clockc cycle don? want to sitck RF phase with reading regisers AND testing for zero, so just moved it back on phase

    13. Pipeline Complications CS510 Computer Architectures Lecture 8 - 13 The Pipeline Structure of the R4000

    14. Pipeline Complications CS510 Computer Architectures Lecture 8 - 14 Case Study: MIPS R4000 LOAD Latency

    15. Pipeline Complications CS510 Computer Architectures Lecture 8 - 15 Case Study: MIPS R4000 LOAD Followed by ALU Instructions

    16. Pipeline Complications CS510 Computer Architectures Lecture 8 - 16 Case Study: MIPS R4000 LOAD Followed by ALU Instructions

    17. Pipeline Complications CS510 Computer Architectures Lecture 8 - 17 Case Study: MIPS R4000 Branch Latency

    18. Pipeline Complications CS510 Computer Architectures Lecture 8 - 18

    19. Pipeline Complications CS510 Computer Architectures Lecture 8 - 19 MIPS R4000 FP Unit FP Adder, FP Multiplier, FP Divider Last step of FP Multiplier/Divider uses FP Adder HW 8 kinds of stages in FP units:

    20. Pipeline Complications CS510 Computer Architectures Lecture 8 - 20 MIPS R4000 FP Pipe Stages FP Instr 1 2 3 4 5 6 7 8 Add, Subtract U S+A A+R R+S Multiply U E+M M M M N N+A R Divide U A D28 D+A D+R, D+A, D+R, A, R Square root U E (A+R)108 A R Negate U S Absolute value U S FP compare U A R Stages: M First stage of multiplier N Second stage of multiplier R Rounding stage A Mantissa ADD stage S Operand shift stage D Divide pipeline stage U Unpack FP numbers E Exception test stage

    21. Pipeline Complications CS510 Computer Architectures Lecture 8 - 21 MIPS R4000 FP Pipe Stages

    22. Pipeline Complications CS510 Computer Architectures Lecture 8 - 22 R4000 Performance Not an ideal pipeline CPI of 1: Load stalls: (1 or 2 clock cycles) Branch stalls: (2 cycles for taken br. + unfilled branch slots) FP result stalls: RAW data hazard (latency) FP structural stalls: Not enough FP hardware (parallelism)

    23. Pipeline Complications CS510 Computer Architectures Lecture 8 - 23

    24. Pipeline Complications CS510 Computer Architectures Lecture 8 - 24 Advanced Pipeline And Instruction Level Parallelism

    25. Pipeline Complications CS510 Computer Architectures Lecture 8 - 25 Advanced Pipelining and Instruction Level Parallelism gcc 17% control transfer 5 instructions + 1 branch Beyond single block to get more instruction level parallelism Loop level parallelism is one opportunity, SW and HW

    26. Pipeline Complications CS510 Computer Architectures Lecture 8 - 26 Advanced Pipelining and Instruction Level Parallelism

    27. Pipeline Complications CS510 Computer Architectures Lecture 8 - 27 Basic Pipeline Scheduling and Loop Unrolling FP unit latencies Instruction producing Instruction using Latency in result result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double* FP ALU op 1 Load double* Store double 0

    28. Pipeline Complications CS510 Computer Architectures Lecture 8 - 28 FP Loop Hazards Where are the stalls?

    29. Pipeline Complications CS510 Computer Architectures Lecture 8 - 29 FP Loop Showing Stalls

    30. Pipeline Complications CS510 Computer Architectures Lecture 8 - 30 Reducing Stalls

    31. Pipeline Complications CS510 Computer Architectures Lecture 8 - 31 Revised FP Loop to Minimize Stalls

    32. Pipeline Complications CS510 Computer Architectures Lecture 8 - 32 Unroll Loop 4 Times

    33. Pipeline Complications CS510 Computer Architectures Lecture 8 - 33 Unrolled Loop to Minimize Stalls

    34. Pipeline Complications CS510 Computer Architectures Lecture 8 - 34 Compiler Perspectives on Code Movement Definitions: Compiler is concerned about dependencies in the program, whether this causes a HW hazard or not depends on a given pipeline Data dependencies (RAW if a hazard for HW): Instruction j is data dependent on instruction i if either Instruction i produces a result used by instruction j, or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. Easy to determine for registers (fixed names) Hard for memory: Does 100(R4) = 20(R6)? From different loop iterations, does 20(R6) = 20(R6)?

    35. Pipeline Complications CS510 Computer Architectures Lecture 8 - 35 Compiler Perspectives on Code Movement Name Dependence: Two instructions use the same name(register or memory location) but they do not exchange data Two kinds of Name Dependence Instruction i precedes instruction j Antidependence (WAR if a hazard for HW) Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first Output dependence (WAW if a hazard for HW) Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.

    36. Pipeline Complications CS510 Computer Architectures Lecture 8 - 36 Again Hard for Memory Accesses Does 100(R4) = 20(R6)? From different loop iterations, does 20(R6) = 20(R6)? Our example required compiler to know that if R1 doesnt change then: 0(R1) -8(R1) -16(R1) -24(R1) 1 There were no dependencies between some loads and stores, so they could be moved by each other. Compiler Perspectives on Code Movement

    37. Pipeline Complications CS510 Computer Architectures Lecture 8 - 37 Compiler Perspectives on Code Movement Control Dependence Example if p1 {S1;}; if p2 {S2;} S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.

    38. Pipeline Complications CS510 Computer Architectures Lecture 8 - 38 Compiler Perspectives on Code Movement Two (obvious) constraints on control dependencies: An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch. Control dependencies may be relaxed in some systems to get parallelism; get the same effect if preserve the order of exceptions and data flow

    39. Pipeline Complications CS510 Computer Architectures Lecture 8 - 39 When Safe to Unroll Loop? Example: When a loop is unrolled, where are data dependencies? (A,B,C distinct, non-overlapping) for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */ 1. S2 uses the value A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a loop-carried dependence between iterations Implies that iterations are dependent, and cant be executed in parallel Not the case for our example; each iteration was distinct

    40. Pipeline Complications CS510 Computer Architectures Lecture 8 - 40 When Safe to Unroll Loop? Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping) Following looks like there is a loop carried dependence for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i];} /* S2 */ However, we can rewrite it as follows for loop carried dependence-free A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1];} B[101] = C[100]+D[100];

    41. Pipeline Complications CS510 Computer Architectures Lecture 8 - 41 Summary Instruction Level Parallelism in SW or HW Loop level parallelism is easiest to see SW parallelism dependencies defined for a program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops

More Related