370 likes | 505 Views
Lecture 8 Advanced Pipeline. EX int unit. IF ID. IF ID. EX FP/int Multiply. EX. MEM WB. MEM WB. EX FP adder. EX FP/Int divider. Extending the DLX to Handle Multi-cycle Operations. DLX pipeline with 3 additional unpipelined,
E N D
Lecture 8Advanced Pipeline CS510 Computer Architectures
EX intunit IF ID IF ID EX FP/int Multiply EX MEM WB MEM WB EX FP adder EX FP/Int divider Extending the DLX to Handle Multi-cycle Operations DLX pipeline with 3 additional unpipelined, floadting-point functional units CS510 Computer Architectures
integer unit EX FP/integer multiply M1 M2 M3 M4 M6 M7 M5 IF ID MEM WB FP adder A1 A2 A3 A4 FP/integer divider DIV Multicycle Operations 24 clock cycles CS510 Computer Architectures
Data neededResult available Example latency initiation interval Latency and Initiation Interval * FP LD and ST are same as integer by having 64-bit path to memory. MULTD IF IDM1M2 M3 M4 M5 M6M7 MEM WB ADDD IF IDAIA2 A3A4 MEM WB LD* IF IDEXMEMWB SD* IF IDEXMEMWB Latency: Number of intervening cycles between an instruction that produces a result and an instruction that uses the result Initiation Interval:number of cycles that must elapse between issuing of two operations of a given type Integer ALU 0 1 Load 1 1 FP add 3 1 FP mul 6 1 FP div 24 25 CS510 Computer Architectures
Reality: MIPS R4000 Floating Point Operations Floating Point: long execution time Also, pipeline FP execution unit may initiate new instructions without waiting full latency FP Instruction Latency Initiation Interval (MIPS R4000) Add, Subtract 4 3 Multiply 8 4 Divide 36 35 Square root 112 111 Negate 2 1 Absolute value 2 1 FP compare 3 2 Cycles before using result Cycles before issuing instr of the same type CS510 Computer Architectures
Complications Due to FP Operations (in DLX) • Because the divide unit is not fully pipelined, structural hazards can ocur • WAW hazards are possible, since instructions no longer reach WB in order. (WAR hazards are not possible, since register reads always occur in ID) • Instructions can complete in a different order than they were issued, causing problems with exceptions • Because of longer latency of operations, stalls for RAW hazards will be more frequent CS510 Computer Architectures
Summary of Pipelining Basics • Hazards limit performance • Structural: need more HW resources • Data: need forwarding, compiler scheduling • Control: early evaluation of PC, delayed branch, prediction • Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency • Interrupts, FP Instruction Set makes pipelining harder • Compilers reduce cost of data and control hazards • Load delay slots • Branch delay slots • Branch prediction CS510 Computer Architectures
Case Study:MIPS R4000 and Introduction to Advanced Pipelining CS510 Computer Architectures
Case Study:MIPS R4000 Pipeline 8 Stage Pipeline: IFFirst half of fetching of instruction • PC selection • Initiation of instruction cache access IS- Second half of fetching of instruction • Access to instruction cache RFInstruction decode, register fetch, hazard checking, and also instruction cache hit detection(tag check) EX Execution • Effective address calculation • ALU operation • Branch target computation and condition evaluation DF - First half of access to data cache DS-Second half of access to data cache TC-Tag checkfor data cache hit WB -Write backfor loads and register-register operations CS510 Computer Architectures
IF IS RF EX DF DS TC WB Instruction Memory load data available Instruction is available Tag check The Pipeline Structure of the R4000 ALU REG REG Data Memory CS510 Computer Architectures
Load data available with forwarding EX IF IS RF EX DF DS TC . . . EX Load data needed 2 Stall Cycles Case Study: MIPS R4000LOAD Latency LD R1, X IF IS RF EX DF DS TC WB ADD R3, R1, R2 IF IS RF EX DF DS TC WB IF IS RF EX DF DS IF IS RF EX DF DS . . . IF IS RF EX DF . . . 2 Cycle Load Latency CS510 Computer Architectures
IF IS IF RF IS IF EX RF IS IF DF stall stall stall DS stall stall stall TC EX RF IS WB DF ... EX ... RF ... LWR1 ADD R2,R1 SUB R3,R1 OR R4,R1 Forwarding Case Study: MIPS R4000LOAD Followed by ALU Instructions 2 cycle Load Latency with Forwarding Circuit CS510 Computer Architectures
R4000 uses PredictNOT TAKEN TAKENBr Delay Slot Stall Stall Br Target instr IF IS IF RF IS RF DF EX DS DF IS TC DS RF WB TC ... EX ... EX IF NOT TAKENBr Delay Slot Br instr +2 Br instr +3 Br instr +4 IF IS IF RF IS IF RF IS IF DF EX RF IS DS DF EX RF IS WB TC ... DS ... DF ... EX ... TC DS DF EX RF EX IF Case Study: MIPS R4000Branch Latency Branch target address available after EX stage Delay Slot plus 2 stall cycles Predict NOT TAKEN strategy NOT TAKEN:one-cycle delayed slot TAKEN:one-cycle delayed slot followedby two stalls - 3 cycle latency CS510 Computer Architectures
Integer Unit(EX) FP Multiplier FP/integer multiply IF ID MEM WB FP Adder FP Divider Extending DLX to Handle Floating Point Operations CS510 Computer Architectures
Stage Functional unit Description MIPS R4000 FP Unit • FP Adder, FP Multiplier, FP Divider • Last step of FP Multiplier/Divider uses FP Adder HW • 8 kinds of stages in FP units: (single copy of each) • A FP adder Mantissa ADD stage • D FP divider Divide pipeline stage • E FP multiplier Exception test stage • M FP multiplier First stage of multiplier • N FP multiplier Second stage of multiplier • R FP adder Rounding stage • S FP adder Operand shift stage • U Unpack FP numbers CS510 Computer Architectures
MIPS R4000 FP Pipe Stages FP Instr 1 2 3 4 5 6 7 8 ¼ latency Add, SubtractU S+A A+R R+S 4 MultiplyU E+M M M M N N+A R 8 DivideU A R D27 … D+A D+R, D+A, D+R, A, R 36 Square rootU E (A+R)108¼ A R 112 NegateU S 2 Absolute value U S 2 FP compareU A R 3 Stages: MFirst stage of multiplierN Second stage of multiplier R Rounding stage A Mantissa ADD stage S Operand shift stage D Divide pipeline stage U Unpack FP numbers E Exception test stage CS510 Computer Architectures
Latency and Initiation Intervals FP Instruction Latency Initiation Interval Add, Subtract 4 3 Multiply 8 4 Divide 36 35 Square root 112 111 Negate 2 1 Absolute value 2 1 FP compare 3 2 CS510 Computer Architectures
Stall clock cycle Operation Issue/stall 0 1 2 3 4 5 6 7 8 9 10 11 12 A A A A ADD issued at4 cycles after Multiply will stall2 cycles. ADD issued at5 cyclesafter Multiply will stall1 cycle. MIPS R4000 FP Pipe Stages Multiply Issue U M M M M N N+ A R Add Issue U S+A A+R R+S Add Issue U S+A A+R R+S Add Issue U S+A A+R R+S Add Stall U S+A A + R R +S Stall Add Stall U S + A A +R R +S Add Issue U S+A A+R R+S Add Issue U S+A A+R R+S CS510 Computer Architectures
Pipeline CPI 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 li gcc eqntott doduc su2cor tomcatv nasa7 ora espresso spice2g6 Integer programs Floating Point programs Base Load stalls Branch stalls FP result stalls FP structural stalls R4000 Performance Not an ideal pipelineCPI of 1: • Load stalls • Branch stalls: (2 cycles for taken br. + unfilled branch slots or cancelled branch delay slots) • FP result stalls: RAW data hazard (latency) • FP structural stalls: Not enough FP hardware (parallelism) CS510 Computer Architectures
Advanced PipelineAndInstruction Level Parallelism CS510 Computer Architectures
Block of Code . . . Branch Target . . . Branch instruction . . . . . . Any instruction . . . Branch instruction . . . Advanced Pipelining and Instruction Level Parallelism • gcc 17% control transfer • 5 instructions + 1 branch • Beyond single block to get more instruction level parallelism • Loop level parallelism is one opportunity, SW and HW CS510 Computer Architectures
Technique Reduces Advanced Pipelining and Instruction Level Parallelism Loop unrolling Control stalls Basic pipeline scheduling RAW stalls Dynamic scheduling with scoreboarding RAW stalls Dynamic scheduling with register renaming WAR and WAW stalls Dynamic branch prediction Control stalls Issuing multiple instructions per cycle Ideal CPI Compiler dependence analysis IdealCPI and data stalls Software pipelining and trace scheduling Ideal CPI and data stalls Speculation All data and control stalls Dynamic memory disambiguation RAW stalls involving memory CS510 Computer Architectures
Basic Pipeline Scheduling and Loop Unrolling FP unit latencies Instruction producing Instruction using Latency in result result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double* FP ALU op 1 Load double* Store double 0 * Same as integer Load since there is a 64-bit data path from/to memory. Fully pipelined or replicated --- no structural hazards, issue on every clock cycle for ( i =1; i <= 1000; i++) x[i] = x[i] + s; CS510 Computer Architectures
Instruction Instruction Latency inproducing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double2 Load double FP ALU op1 Load double Store double 0 Integer op Integer op 0 FP Loop Hazards Loop: LD F0,0(R1) ;R1 is the pointer to a vector ADDD F4,F0,F2 ;F2 contains a scalar value SD 0(R1),F4 ;store back result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot Where are the stalls? CS510 Computer Architectures
FP Loop Showing Stalls 1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 8stall 9 BNEZ R1,Loop ;branch R1!=zero 10 stall ;delayed branch slot Rewrite code to minimize stalls? CS510 Computer Architectures
For Load-ALU latency Consider moving SUBI into this Load Delay Slot. Reading R1 by LD is done before Writing R1 by SUBI. Yes we can. For ALU-ALU latency 8 When we do this, we need to change the immediate value 0 to 8 in SD Reducing Stalls 1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 stall 5 stall 6 SD 0(R1),F4 7 SUBI R1,R1,#8 8 stall 9 BNEZ R1,Loop 10 stall There is only one instruction left, i.e., BNEZ. When we do that, SD instruction fills the delayed branch slot. CS510 Computer Architectures
Instruction Instruction Latency inproducing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Revised FP Loop to Minimize Stalls 1 Loop: LD F0,0(R1) 2 SUBI R1,R1,#8 3 ADDD F4,F0,F2 4 stall 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4;altered when move past SUBI Unroll loop 4 times to make the code faster CS510 Computer Architectures
Unroll Loop 4 Times 1Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32;alter to 4*8 14 BNEZ R1,Loop 15NOP 15 + 4 x(1*+2+)+1^= 28 clock cycles, or 7 per iteration. 1*: LD to ADDD stall 1 cycle 2+: ADDD to SD stall 2 cycles 1^: Data dependency on R1 Rewrite loop to minimize the stalls CS510 Computer Architectures
Assumptions - OK to move SD past SUBI even though SUBI changes R1 SUBI IF RF EX MEM WB SD IF ID EX MEM WB BNEZ IF ID EX MEM WB - OK to move loads before stores(Get right data) - When is it safe for compiler to do such changes? 14 clock cycles, or 3.5 per iteration Unrolled Loop to Minimize Stalls 1 Loop: LD F0,0(R1) 2LD F6,-8(R1) 3LD F10,-16(R1) 4LD F14,-24(R1) 5ADDD F4,F0,F2 6ADDD F8,F6,F2 7ADDD F12,F10,F2 8ADDD F16,F14,F2 9SD 0(R1),F4 10SD -8(R1),F8 11SUBI R1,R1,#32 12SD16(R1),F12; 16-32= -16 13BNEZ R1,LOOP 14SD8(R1),F16; 8-32 = -24 CS510 Computer Architectures
Compiler Perspectives on Code Movement • Definitions: Compiler is concerned about dependencies in the program, whether this causes a HW hazard or not depends on a given pipeline • Data dependencies (RAW if a hazard for HW): Instruction j is data dependent on instruction i if either • Instruction i produces a result used by instruction j, or • Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. • Easy to determine for registers (fixed names) • Hard for memory: • Does 100(R4) = 20(R6)? • From different loop iterations, does 20(R6) = 20(R6)? CS510 Computer Architectures
Compiler Perspectives on Code Movement • Name Dependence: Two instructions use the same name(register or memory location) but they do not exchange data • Two kinds of Name Dependence Instruction i precedes instruction j • Antidependence (WAR if a hazard for HW) • Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first • Output dependence (WAW if a hazard for HW) • Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved. CS510 Computer Architectures
Compiler Perspectives on Code Movement • Again Hard for Memory Accesses • Does 100(R4) = 20(R6)? • From different loop iterations, does 20(R6) = 20(R6)? • Our example required compiler to know that if R1 doesn’t change then:0(R1) ¹ -8(R1) ¹ -16(R1) ¹ -24(R1) 1 There were no dependencies between some loads and stores, so they could be moved by each other. CS510 Computer Architectures
Compiler Perspectives on Code Movement • Control Dependence • Example if p1 {S1;}; if p2 {S2;} S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1. CS510 Computer Architectures
Compiler Perspectives on Code Movement • Two (obvious) constraints on control dependencies: • An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. • An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch. • Control dependencies may be relaxed in some systems to get parallelism; get the same effect if preserve the order of exceptions and data flow CS510 Computer Architectures
When Safe to Unroll Loop? • Example: When a loop is unrolled, where are data dependencies? (A,B,C distinct, non-overlapping) for (i=1; i<=100; i=i+1) {A[i+1]= A[i] + C[i]; /* S1 */ B[i+1] = B[i] +A[i+1];}/* S2 */ 1. S2 uses the valueA[i+1],computed by S1 in the same iteration. 2. S1 uses a value computed byS1in an earlier iteration, since iteration i computesA[i+1]which is read in iteration i+1. The same is true ofS2 for B[i] and B[i+1]. This is aloop-carried dependencebetween iterations • Implies thatiterations are dependent, and can’t be executed in parallel • Not the case for our example; each iteration was distinct CS510 Computer Architectures
When Safe to Unroll Loop? • Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping)Following looks like there is a loop carried dependence for (i=1; i<=100; i=i+1) {A[i] = A[i] +B[i];/* S1 */B[i+1]= C[i] + D[i];} /* S2 */ However, we can rewrite it as follows for loop carried dependence-free A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1];} B[101] = C[100]+D[100]; CS510 Computer Architectures
Summary • Instruction Level Parallelism in SW or HW • Loop level parallelism is easiest to see • SW parallelism dependencies defined for a program, hazards if HW cannot resolve • SW dependencies/compiler sophistication determine if compiler can unroll loops CS510 Computer Architectures