290 likes | 685 Views
Pipeline Complications. CS510 Computer Architectures. Lecture 8 - 2. Interrupts. Interrupts: 5 instructions executing in 5-stage pipelineHow to stop the pipeline?How to restart the pipeline?Who caused the interrupt? . StageExceptional ConditionsIFPage fault on instruction fetch;
E N D
1. Pipeline Complications CS510 Computer Architectures Lecture 8 - 1 Lecture 8Advanced Pipeline
2. Pipeline Complications CS510 Computer Architectures Lecture 8 - 2 Interrupts Interrupts: 5 instructions executing in 5-stage pipeline
How to stop the pipeline?
How to restart the pipeline?
Who caused the interrupt?
3. Pipeline Complications CS510 Computer Architectures Lecture 8 - 3 Simultaneous Exceptions in More Than One Pipe Stages Simultaneous exceptions in more than one pipeline stage, e.g. LD followed by ADD
LD with data page(DM) fault in MEM stage
ADD with instruction page(IM) fault in IF stage
ADD fault will happen BEFORE Load fault
Solution #1
Interrupt status vector per instruction
Defer check till the last stage, and kill machine state update if exception
4. Pipeline Complications CS510 Computer Architectures Lecture 8 - 4 Simultaneous Exceptions Complex Addressing Modes and Instructions
Address modes: Auto-increment causes register change during the instruction execution - Register write in EX stage instead of WB stage
Interrupts? Need to restore register state
Adds WAR and WAW hazards since writes in a register in EX, no longer in WB stage
Memory-Memory Move Instructions
Must be able to handle multiple page faults
Long-lived instructions: partial state save on interrupt
Condition Codes
5. Pipeline Complications CS510 Computer Architectures Lecture 8 - 5 Extending the DLX to Handle Multi-cycle Operations
6. Pipeline Complications CS510 Computer Architectures Lecture 8 - 6 Multicycle Operations
7. Pipeline Complications CS510 Computer Architectures Lecture 8 - 7 Latency and Initiation Interval MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
ADDD IF ID AI A2 A3 A4 MEM WB
LD* IF ID EX MEM WB
SD* IF ID EX MEM WB
8. Pipeline Complications CS510 Computer Architectures Lecture 8 - 8 Floating Point Operations FP Instruction Latency Initiation Interval (MIPS R4000)
Add, Subtract 4 3
Multiply 8 4
Divide 36 35
Square root 112 111
Negate 2 1
Absolute value 2 1
FP compare 3 2
9. Pipeline Complications CS510 Computer Architectures Lecture 8 - 9 Complications Due to FP Operations Divide, Square Root take @ 10X to @ 30X longer than Add
exceptions?
Adds WAR and WAW hazards since pipelines are no longer same length
10. Pipeline Complications CS510 Computer Architectures Lecture 8 - 10 Summary of Pipelining Basics Hazards limit performance
Structural: need more HW resources
Data: need forwarding, compiler scheduling
Control: early evaluation of PC, delayed branch, prediction
Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency
Interrupts, FP Instruction Set makes pipelining harder
Compilers reduce cost of data and control hazards
Load delay slots
Branch delay slots
Branch prediction
11. Pipeline Complications CS510 Computer Architectures Lecture 8 - 11 Case Study:MIPS R4000 and Introduction to Advanced Pipelining
12. Pipeline Complications CS510 Computer Architectures Lecture 8 - 12 Case Study: MIPS R4000 (100 MHz to 200 MHz) 8 Stage Pipeline:
IF First half of fetching of instruction
PC selection
Initiation of instruction cache access
IS - Second half of fetching of instruction
Access to instruction cache
RF Instruction decode, register fetch, hazard checking also instruction cache hit detection(tag check)
EX Execution
Effective address calculation
ALU operation
Branch target computation and condition evaluation
DF - First half of access to data cache
DS - Second half of access to data cache
TC - Tag check for data cache hit
WB -Write back for loads and register-register operations
Answer is 3 stages between branch and new instruction fetch and 2 stages between load and use (even though if looked at red insertions that it would be 3 for load and 2 for branch)
Reasons:
1) Load: TC just does tag check, data available after DS; thus supply the data & forward it, restarting the pipeline on a data cache miss
2) EX phase does the address calculation even though just added one phase; presumed reason is that since want fast clockc cycle don? want to sitck RF phase with reading regisers AND testing for zero, so just moved it back on phaseAnswer is 3 stages between branch and new instruction fetch and 2 stages between load and use (even though if looked at red insertions that it would be 3 for load and 2 for branch)
Reasons:
1) Load: TC just does tag check, data available after DS; thus supply the data & forward it, restarting the pipeline on a data cache miss
2) EX phase does the address calculation even though just added one phase; presumed reason is that since want fast clockc cycle don? want to sitck RF phase with reading regisers AND testing for zero, so just moved it back on phase
13. Pipeline Complications CS510 Computer Architectures Lecture 8 - 13 The Pipeline Structure of the R4000
14. Pipeline Complications CS510 Computer Architectures Lecture 8 - 14 Case Study: MIPS R4000LOAD Latency
15. Pipeline Complications CS510 Computer Architectures Lecture 8 - 15 Case Study: MIPS R4000LOAD Followed by ALU Instructions
16. Pipeline Complications CS510 Computer Architectures Lecture 8 - 16 Case Study: MIPS R4000LOAD Followed by ALU Instructions
17. Pipeline Complications CS510 Computer Architectures Lecture 8 - 17 Case Study: MIPS R4000Branch Latency
18. Pipeline Complications CS510 Computer Architectures Lecture 8 - 18
19. Pipeline Complications CS510 Computer Architectures Lecture 8 - 19 MIPS R4000 FP Unit FP Adder, FP Multiplier, FP Divider
Last step of FP Multiplier/Divider uses FP Adder HW
8 kinds of stages in FP units:
20. Pipeline Complications CS510 Computer Architectures Lecture 8 - 20 MIPS R4000 FP Pipe Stages FP Instr 1 2 3 4 5 6 7 8
Add, Subtract U S+A A+R R+S
Multiply U E+M M M M N N+A R
Divide U A D28 D+A D+R, D+A, D+R, A, R
Square root U E (A+R)108 A R
Negate U S
Absolute value U S
FP compare U A R
Stages:
M First stage of multiplier N Second stage of multiplier
R Rounding stage A Mantissa ADD stage
S Operand shift stage D Divide pipeline stage
U Unpack FP numbers E Exception test stage
21. Pipeline Complications CS510 Computer Architectures Lecture 8 - 21 MIPS R4000 FP Pipe Stages
22. Pipeline Complications CS510 Computer Architectures Lecture 8 - 22 R4000 Performance Not an ideal pipeline CPI of 1:
Load stalls: (1 or 2 clock cycles)
Branch stalls: (2 cycles for taken br. + unfilled branch slots)
FP result stalls: RAW data hazard (latency)
FP structural stalls: Not enough FP hardware (parallelism)
23. Pipeline Complications CS510 Computer Architectures Lecture 8 - 23
24. Pipeline Complications CS510 Computer Architectures Lecture 8 - 24 Advanced PipelineAndInstruction Level Parallelism
25. Pipeline Complications CS510 Computer Architectures Lecture 8 - 25 Advanced Pipelining and Instruction Level Parallelism gcc 17% control transfer
5 instructions + 1 branch
Beyond single block to get more instruction level parallelism
Loop level parallelism is one opportunity, SW and HW
26. Pipeline Complications CS510 Computer Architectures Lecture 8 - 26 Advanced Pipelining and Instruction Level Parallelism
27. Pipeline Complications CS510 Computer Architectures Lecture 8 - 27 Basic Pipeline Scheduling and Loop Unrolling FP unit latencies
Instruction producing Instruction using Latency in
result result clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double* FP ALU op 1
Load double* Store double 0
28. Pipeline Complications CS510 Computer Architectures Lecture 8 - 28 FP Loop Hazards Where are the stalls?
29. Pipeline Complications CS510 Computer Architectures Lecture 8 - 29 FP Loop Showing Stalls
30. Pipeline Complications CS510 Computer Architectures Lecture 8 - 30 Reducing Stalls
31. Pipeline Complications CS510 Computer Architectures Lecture 8 - 31 Revised FP Loop to Minimize Stalls
32. Pipeline Complications CS510 Computer Architectures Lecture 8 - 32 Unroll Loop 4 Times
33. Pipeline Complications CS510 Computer Architectures Lecture 8 - 33 Unrolled Loop to Minimize Stalls
34. Pipeline Complications CS510 Computer Architectures Lecture 8 - 34 Compiler Perspectives on Code Movement Definitions: Compiler is concerned about dependencies in the program, whether this causes a HW hazard or not depends on a given pipeline
Data dependencies (RAW if a hazard for HW): Instruction j is data dependent on instruction i if either
Instruction i produces a result used by instruction j, or
Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.
Easy to determine for registers (fixed names)
Hard for memory:
Does 100(R4) = 20(R6)?
From different loop iterations, does 20(R6) = 20(R6)?
35. Pipeline Complications CS510 Computer Architectures Lecture 8 - 35 Compiler Perspectives on Code Movement Name Dependence: Two instructions use the same name(register or memory location) but they do not exchange data
Two kinds of Name Dependence
Instruction i precedes instruction j
Antidependence (WAR if a hazard for HW)
Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first
Output dependence (WAW if a hazard for HW)
Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.
36. Pipeline Complications CS510 Computer Architectures Lecture 8 - 36 Again Hard for Memory Accesses
Does 100(R4) = 20(R6)?
From different loop iterations, does 20(R6) = 20(R6)?
Our example required compiler to know that if R1 doesnt change then:0(R1) -8(R1) -16(R1) -24(R1) 1
There were no dependencies between some loads and stores, so they could be moved by each other. Compiler Perspectives on Code Movement
37. Pipeline Complications CS510 Computer Architectures Lecture 8 - 37 Compiler Perspectives on Code Movement Control Dependence
Example
if p1 {S1;};
if p2 {S2;}
S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.
38. Pipeline Complications CS510 Computer Architectures Lecture 8 - 38 Compiler Perspectives on Code Movement Two (obvious) constraints on control dependencies:
An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch.
An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch.
Control dependencies may be relaxed in some systems to get parallelism; get the same effect if preserve the order of exceptions and data flow
39. Pipeline Complications CS510 Computer Architectures Lecture 8 - 39 When Safe to Unroll Loop? Example: When a loop is unrolled, where are data dependencies? (A,B,C distinct, non-overlapping)
for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */
1. S2 uses the value A[i+1], computed by S1 in the same iteration.
2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a loop-carried dependence between iterations
Implies that iterations are dependent, and cant be executed in parallel
Not the case for our example; each iteration was distinct
40. Pipeline Complications CS510 Computer Architectures Lecture 8 - 40 When Safe to Unroll Loop? Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping)Following looks like there is a loop carried dependence
for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i];} /* S2 */
However, we can rewrite it as follows for loop carried dependence-free
A[1] = A[1] + B[1];
for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1];}
B[101] = C[100]+D[100];
41. Pipeline Complications CS510 Computer Architectures Lecture 8 - 41 Summary Instruction Level Parallelism in SW or HW
Loop level parallelism is easiest to see
SW parallelism dependencies defined for a program, hazards if HW cannot resolve
SW dependencies/compiler sophistication determine if compiler can unroll loops