Lecture 8 Advanced Pipeline

1. Pipeline Complications CS510 Computer Architectures Lecture 8 - 1 Lecture 8Advanced Pipeline

2. Pipeline Complications CS510 Computer Architectures Lecture 8 - 2 Interrupts Interrupts: 5 instructions executing in 5-stage pipeline How to stop the pipeline? How to restart the pipeline? Who caused the interrupt?

3. Pipeline Complications CS510 Computer Architectures Lecture 8 - 3 Simultaneous Exceptions in More Than One Pipe Stages Simultaneous exceptions in more than one pipeline stage, e.g. LD followed by ADD LD with data page(DM) fault in MEM stage ADD with instruction page(IM) fault in IF stage ADD fault will happen BEFORE Load fault Solution #1 Interrupt status vector per instruction Defer check till the last stage, and kill machine state update if exception

4. Pipeline Complications CS510 Computer Architectures Lecture 8 - 4 Simultaneous Exceptions Complex Addressing Modes and Instructions Address modes: Auto-increment causes register change during the instruction execution - Register write in EX stage instead of WB stage Interrupts? Need to restore register state Adds WAR and WAW hazards since writes in a register in EX, no longer in WB stage Memory-Memory Move Instructions Must be able to handle multiple page faults Long-lived instructions: partial state save on interrupt Condition Codes

5. Pipeline Complications CS510 Computer Architectures Lecture 8 - 5 Extending the DLX to Handle Multi-cycle Operations

6. Pipeline Complications CS510 Computer Architectures Lecture 8 - 6 Multicycle Operations

7. Pipeline Complications CS510 Computer Architectures Lecture 8 - 7 Latency and Initiation Interval MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB ADDD IF ID AI A2 A3 A4 MEM WB LD* IF ID EX MEM WB SD* IF ID EX MEM WB

8. Pipeline Complications CS510 Computer Architectures Lecture 8 - 8 Floating Point Operations FP Instruction Latency Initiation Interval (MIPS R4000) Add, Subtract 4 3 Multiply 8 4 Divide 36 35 Square root 112 111 Negate 2 1 Absolute value 2 1 FP compare 3 2

9. Pipeline Complications CS510 Computer Architectures Lecture 8 - 9 Complications Due to FP Operations Divide, Square Root take @ 10X to @ 30X longer than Add exceptions? Adds WAR and WAW hazards since pipelines are no longer same length

10. Pipeline Complications CS510 Computer Architectures Lecture 8 - 10 Summary of Pipelining Basics Hazards limit performance Structural: need more HW resources Data: need forwarding, compiler scheduling Control: early evaluation of PC, delayed branch, prediction Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency Interrupts, FP Instruction Set makes pipelining harder Compilers reduce cost of data and control hazards Load delay slots Branch delay slots Branch prediction

11. Pipeline Complications CS510 Computer Architectures Lecture 8 - 11 Case Study:MIPS R4000 and Introduction to Advanced Pipelining

12. Pipeline Complications CS510 Computer Architectures Lecture 8 - 12 Case Study: MIPS R4000 (100 MHz to 200 MHz) 8 Stage Pipeline: IF First half of fetching of instruction PC selection Initiation of instruction cache access IS - Second half of fetching of instruction Access to instruction cache RF Instruction decode, register fetch, hazard checking also instruction cache hit detection(tag check) EX Execution Effective address calculation ALU operation Branch target computation and condition evaluation DF - First half of access to data cache DS - Second half of access to data cache TC - Tag check for data cache hit WB -Write back for loads and register-register operations Answer is 3 stages between branch and new instruction fetch and 2 stages between load and use (even though if looked at red insertions that it would be 3 for load and 2 for branch) Reasons: 1) Load: TC just does tag check, data available after DS; thus supply the data & forward it, restarting the pipeline on a data cache miss 2) EX phase does the address calculation even though just added one phase; presumed reason is that since want fast clockc cycle don? want to sitck RF phase with reading regisers AND testing for zero, so just moved it back on phaseAnswer is 3 stages between branch and new instruction fetch and 2 stages between load and use (even though if looked at red insertions that it would be 3 for load and 2 for branch) Reasons: 1) Load: TC just does tag check, data available after DS; thus supply the data & forward it, restarting the pipeline on a data cache miss 2) EX phase does the address calculation even though just added one phase; presumed reason is that since want fast clockc cycle don? want to sitck RF phase with reading regisers AND testing for zero, so just moved it back on phase

13. Pipeline Complications CS510 Computer Architectures Lecture 8 - 13 The Pipeline Structure of the R4000

14. Pipeline Complications CS510 Computer Architectures Lecture 8 - 14 Case Study: MIPS R4000LOAD Latency

15. Pipeline Complications CS510 Computer Architectures Lecture 8 - 15 Case Study: MIPS R4000LOAD Followed by ALU Instructions

16. Pipeline Complications CS510 Computer Architectures Lecture 8 - 16 Case Study: MIPS R4000LOAD Followed by ALU Instructions

17. Pipeline Complications CS510 Computer Architectures Lecture 8 - 17 Case Study: MIPS R4000Branch Latency

18. Pipeline Complications CS510 Computer Architectures Lecture 8 - 18

19. Pipeline Complications CS510 Computer Architectures Lecture 8 - 19 MIPS R4000 FP Unit FP Adder, FP Multiplier, FP Divider Last step of FP Multiplier/Divider uses FP Adder HW 8 kinds of stages in FP units:

20. Pipeline Complications CS510 Computer Architectures Lecture 8 - 20 MIPS R4000 FP Pipe Stages FP Instr 1 2 3 4 5 6 7 8 � Add, Subtract U S+A A+R R+S Multiply U E+M M M M N N+A R Divide U A D28 � D+A D+R, D+A, D+R, A, R Square root U E (A+R)108 � A R Negate U S Absolute value U S FP compare U A R Stages: M First stage of multiplier N Second stage of multiplier R Rounding stage A Mantissa ADD stage S Operand shift stage D Divide pipeline stage U Unpack FP numbers E Exception test stage

21. Pipeline Complications CS510 Computer Architectures Lecture 8 - 21 MIPS R4000 FP Pipe Stages

22. Pipeline Complications CS510 Computer Architectures Lecture 8 - 22 R4000 Performance Not an ideal pipeline CPI of 1: Load stalls: (1 or 2 clock cycles) Branch stalls: (2 cycles for taken br. + unfilled branch slots) FP result stalls: RAW data hazard (latency) FP structural stalls: Not enough FP hardware (parallelism)

23. Pipeline Complications CS510 Computer Architectures Lecture 8 - 23

24. Pipeline Complications CS510 Computer Architectures Lecture 8 - 24 Advanced PipelineAndInstruction Level Parallelism

25. Pipeline Complications CS510 Computer Architectures Lecture 8 - 25 Advanced Pipelining and Instruction Level Parallelism gcc 17% control transfer 5 instructions + 1 branch Beyond single block to get more instruction level parallelism Loop level parallelism is one opportunity, SW and HW

26. Pipeline Complications CS510 Computer Architectures Lecture 8 - 26 Advanced Pipelining and Instruction Level Parallelism

27. Pipeline Complications CS510 Computer Architectures Lecture 8 - 27 Basic Pipeline Scheduling and Loop Unrolling FP unit latencies Instruction producing Instruction using Latency in result result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double* FP ALU op 1 Load double* Store double 0

28. Pipeline Complications CS510 Computer Architectures Lecture 8 - 28 FP Loop Hazards Where are the stalls?

29. Pipeline Complications CS510 Computer Architectures Lecture 8 - 29 FP Loop Showing Stalls

30. Pipeline Complications CS510 Computer Architectures Lecture 8 - 30 Reducing Stalls

31. Pipeline Complications CS510 Computer Architectures Lecture 8 - 31 Revised FP Loop to Minimize Stalls

32. Pipeline Complications CS510 Computer Architectures Lecture 8 - 32 Unroll Loop 4 Times

33. Pipeline Complications CS510 Computer Architectures Lecture 8 - 33 Unrolled Loop to Minimize Stalls

34. Pipeline Complications CS510 Computer Architectures Lecture 8 - 34 Compiler Perspectives on Code Movement Definitions: Compiler is concerned about dependencies in the program, whether this causes a HW hazard or not depends on a given pipeline Data dependencies (RAW if a hazard for HW): Instruction j is data dependent on instruction i if either Instruction i produces a result used by instruction j, or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. Easy to determine for registers (fixed names) Hard for memory: Does 100(R4) = 20(R6)? From different loop iterations, does 20(R6) = 20(R6)?

35. Pipeline Complications CS510 Computer Architectures Lecture 8 - 35 Compiler Perspectives on Code Movement Name Dependence: Two instructions use the same name(register or memory location) but they do not exchange data Two kinds of Name Dependence Instruction i precedes instruction j Antidependence (WAR if a hazard for HW) Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first Output dependence (WAW if a hazard for HW) Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.

36. Pipeline Complications CS510 Computer Architectures Lecture 8 - 36 Again Hard for Memory Accesses Does 100(R4) = 20(R6)? From different loop iterations, does 20(R6) = 20(R6)? Our example required compiler to know that if R1 doesn�t change then:0(R1) � -8(R1) � -16(R1) � -24(R1) 1 There were no dependencies between some loads and stores, so they could be moved by each other. Compiler Perspectives on Code Movement

37. Pipeline Complications CS510 Computer Architectures Lecture 8 - 37 Compiler Perspectives on Code Movement Control Dependence Example if p1 {S1;}; if p2 {S2;} S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.

38. Pipeline Complications CS510 Computer Architectures Lecture 8 - 38 Compiler Perspectives on Code Movement Two (obvious) constraints on control dependencies: An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch. Control dependencies may be relaxed in some systems to get parallelism; get the same effect if preserve the order of exceptions and data flow

39. Pipeline Complications CS510 Computer Architectures Lecture 8 - 39 When Safe to Unroll Loop? Example: When a loop is unrolled, where are data dependencies? (A,B,C distinct, non-overlapping) for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */ 1. S2 uses the value A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a loop-carried dependence between iterations Implies that iterations are dependent, and can�t be executed in parallel Not the case for our example; each iteration was distinct

40. Pipeline Complications CS510 Computer Architectures Lecture 8 - 40 When Safe to Unroll Loop? Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping)Following looks like there is a loop carried dependence for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i];} /* S2 */ However, we can rewrite it as follows for loop carried dependence-free A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1];} B[101] = C[100]+D[100];

41. Pipeline Complications CS510 Computer Architectures Lecture 8 - 41 Summary Instruction Level Parallelism in SW or HW Loop level parallelism is easiest to see SW parallelism dependencies defined for a program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops

Lecture 8 Advanced Pipeline

Lecture 8 Advanced Pipeline

Presentation Transcript

Lecture 8

Lecture 8

Lecture 8 Advanced Rendering Ray Tracing, Radiosity NPR

Advanced Pipeline Issues

Lecture 8

Lecture #8

Lecture 8

Lecture 8 Advanced Pipeline

Advanced Knowledge 8

Advanced Bioinformatics Lecture 8: Short presentation B

Lecture 8 Advanced Pipeline

Lecture 8 Advanced Pipeline

Lecture 8 Advanced Pipeline

Lecture 8

CPE5021 Advanced Network Security -- Advanced Wireless Network Security--- Lecture 8

Lecture 8

Lecture 8

Lecture 7 Pipeline Hazards

Lecture 7 Pipeline Hazards

Lecture 16 ITK Pipeline

Advanced Bioinformatics Lecture 8: Short presentation B

Lecture 8