430 likes | 597 Views
CPSC 614:Graduate Computer Architecture Hazards (Chap. 3). Prof. E. J. Kim. Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation. Instruction-Level Parallelism (ILP). Instructions are evaluated in parallel. Pipelining Two approaches to exploiting ILP
E N D
CPSC 614:Graduate Computer Architecture Hazards (Chap. 3) Prof. E. J. Kim
Chapter 3Instruction-Level Parallelism andIts Dynamic Exploitation
Instruction-Level Parallelism (ILP) • Instructions are evaluated in parallel. • Pipelining • Two approaches to exploiting ILP • Hardware-dependent (chapter 3) • Intel Pentium 3 & 4, Athlon, MIPS R10000/12000, Sun UltraSPARC III, PowerPC, … • Software-dependent (chapter 4) • IA-64, Intel Itanium, embedded processors
Pipeline CPI = Ideal CPI + Structural stalls + Data hazard stalls + Control stalls
Techniques to Decrease Pipeline CPI • Forwarding and Bypassing • Delayed Branches and Simple Branch Scheduling • Basic Dynamic Scheduling (Scoreboarding) • Dynamic Scheduling with Renaming • Dynamic Branch Prediction • Issuing Multiple Instructions per Cycle • Speculation • Dynamic Memory Disambiguation
Techniques to Decrease Pipeline CPI • Loop Unrolling • Basic Compiler Pipeline Scheduling • Compiler Dependence Analysis • Software Pipelining, Trace Scheduling • Compiler Speculation
Data Dependences • If two instructions are parallel, they can execute simultaneously. • If two instructions are dependent, they must be executed in order. • How to determine an instruction is dependent on anther instruction?
Data Dependences • Data dependences (True data dependences) • Name dependences • Control dependences • An instruction j is dependent on instruction i if either • i produces a result that may be used by j, or • j is data dependent on instruction k, and k is data dependent on i.
Floating-point data dependences Integer data dependence Loop: L.D F0, 0(R1) ;F0=array element ADD.D F4, F0, F2 ;add scalar in F2 S.D F4, 0(R1) ;store the result DADDUI R1, R1, #-8 ;decrement pointer 8 bytes BNE R1, R2, LOOP ;branch R1 != R2
A dependence • indicates the possibility of a hazard, • determines the order in which results must be calculated, and • sets an upper bound on how much parallelism can be possibly be exploited.
How to Overcome a Dependence • Maintaining the dependence but avoiding a hazard • Eliminating a dependence by transforming the code
Name Dependence • Occurs then two instructions use the same register or memory location (name), but there is no flow of data between the instructions associated with that name. • When i precedes j in program order: • Antidependence: Instruction j writes a register or memory location that instruction i reads. • Output dependence: Instructions i and j write the same register or memory location.
Register Renaming • Instructions involved in a name dependence can execute simultaneously or be reordered, if the name (register number or memory location) used in the instructions is changed so the instructions do not conflict. (Especially for register operands)
Data Hazards • Goal: Preserve the program order only where it affects the outcome of the program to maximize ILP. • When instruction i occurs before instruction j in program order, • RAW (Read after Write): j tries to read a source before i writes it. • WAW (Write after Write): j tries to write an operand before it is written by i. • WAR (Write after Read): j tries to write a destination before it is read by i.
Control Dependence • Caused by branch instructions • An instruction that is control dependent on a branch cannot be moved before the branch. • An instruction that is not control dependent on a branch cannot be moved after the branch.
Control dependence is not the critical property that must be preserved. • We can violate the control dependences, if we can do so without affecting the correctness of the program. (e.g. branch prediction)
MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU 5 Steps of MIPS Datapath Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Memory RS2 Memory MUX MUX Sign Extend WB Data Imm Datapath RD RD RD Control Path
MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Inst 2 Inst 3 Inst 1 Inst 2 Inst 1 5 Steps of MIPS Datapath Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Memory RS2 Memory MUX MUX Sign Extend Inst 1 WB Data Imm Datapath RD RD RD Control Path
Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Review: Visualizing PipeliningFigure 3.3, Page 133 , CA:AQA 2e Time (clock cycles) I n s t r. O r d e r
Limits to pipelining • Hazards: circumstances that would cause incorrect execution if next instruction were launched • Structural hazards: Attempting to use the same hardware to do two different things at the same time • Data hazards: Instruction depends on result of prior instruction still in the pipeline • Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch DMem DMem DMem ALU ALU ALU ALU DMem Ifetch Structural Hazard Example: One Memory Port/Structural HazardFigure 3.6, Page 142 , CA:AQA 2e Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load DMem Instr 1 Instr 2 Instr 3 Instr 4
Resolving structural hazards • Defn: attempt to use same hardware for two different things at the same time • Solution 1: Wait • must detect the hazard • must have mechanism to stall • Solution 2: Throw more hardware at the problem
Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem ALU ALU ALU ALU Bubble Bubble Bubble Bubble Bubble Detecting and Resolving Structural Hazard Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load DMem Instr 1 Instr 2 Stall Instr 3
MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Eliminating Structural Hazards at Design Time Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Instr Cache RS2 Data Cache MUX MUX Sign Extend WB Data Imm Datapath RD RD RD Control Path
Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem EX WB MEM IF ID/RF I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Data HazardsFigure 3.9, page 147 , CA:AQA 2e Time (clock cycles)
Three Generic Data Hazards • Read After Write (RAW)InstrJ tries to read operand before InstrI writes it • Caused by a “Data Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add r1,r2,r3 J: sub r4,r1,r3
I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Three Generic Data Hazards • Write After Read (WAR)InstrJ writes operand before InstrI reads it • Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”. • Can’t happen in MIPS 5 stage pipeline because: • All instructions take 5 stages, and • Reads are always in stage 2, and • Writes are always in stage 5
I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Three Generic Data Hazards • Write After Write (WAW)InstrJ writes operand before InstrI writes it. • Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”. • Can’t happen in MIPS 5 stage pipeline because: • All instructions take 5 stages, and • Writes are always in stage 5 • Will see WAR and WAW in later more complicated pipes
Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Forwarding to Avoid Data HazardFigure 3.10, Page 149 , CA:AQA 2e Time (clock cycles)
ALU HW Change for ForwardingFigure 3.20, Page 161, CA:AQA 2e ID/EX EX/MEM MEM/WR NextPC mux Registers Data Memory mux mux Immediate
Reg Reg Reg Reg Reg Reg Reg Reg ALU Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU lwr1, 0(r2) I n s t r. O r d e r sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 Data Hazard Even with ForwardingFigure 3.12, Page 153 , CA:AQA 2e Time (clock cycles)
Resolving this load hazard • Adding hardware? ... not • Detection? • Compilation techniques? • What is the cost of load delays?
Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem ALU Bubble ALU ALU Reg Reg DMem DMem Bubble Reg Reg Resolving the Load Data Hazard Time (clock cycles) I n s t r. O r d e r lwr1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Bubble ALU DMem or r8,r1,r9
Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd
Instruction Set Connection • What is exposed about this organizational hazard in the instruction set? • k cycle delay? • bad, CPI is not part of ISA • k instruction slot delay • load should not be followed by use of the value in the next k instructions • Nothing, but code can reduce run-time delays • MIPS did the transformation in the assembler
User program plus Data this can change! Main Memory ADD SUB AND . . . one of these is mapped into one of these DATA execution unit CPU control memory Historical Perspective: Microprogramming Supported complex instructions a sequence of simple micro-inst (RTs) Pipelined micro-instruction processing, but very limited view. Could not reorganize macroinstructions to enable pipelining
Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 Control Hazard on Branches=> Three Stage Stall
Example: Branch Stall Impact • If 30% branch, Stall 3 cycles significant • Two part solution: • Determine branch taken or not sooner, AND • Compute taken branch address earlier • MIPS branch tests if register = 0 or 0 • MIPS Solution: • Move Zero test to ID/RF stage • Adder to calculate new PC in ID/RF stage • 1 clock cycle penalty for branch versus 3
MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Pipelined MIPS DatapathFigure 3.22, page 163, CA:AQA 2/e Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next SEQ PC Next PC MUX Adder Zero? RS1 Reg File Memory RS2 Data Memory MUX MUX Sign Extend WB Data Imm RD RD RD • Data stationary control • local decode for each instruction phase / pipeline stage
Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken • Execute successor instructions in sequence • “Squash” instructions in pipeline if branch actually taken • Advantage of late pipeline state update • 47% MIPS branches not taken on average • PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken • 53% MIPS branches taken on average • But haven’t calculated branch target address in MIPS • MIPS still incurs 1 cycle branch penalty • Other machines: branch target known before outcome
Four Branch Hazard Alternatives #4: Delayed Branch • Define branch to take place AFTER a following instruction branch instructionsequential successor1 sequential successor2 ........ sequential successorn ........ branch target if taken • 1 slot delay allows proper decision and branch target address in 5 stage pipeline • MIPS uses this Branch delay of length n
Delayed Branch • Where to get instructions to fill branch delay slot? • Before branch instruction • From the target address: only valuable when branch taken • From fall through: only valuable when branch not taken • Canceling branches allow more slots to be filled • Compiler effectiveness for single branch delay slot: • Fills about 60% of branch delay slots • About 80% of instructions executed in branch delay slots useful in computation • About 50% (60% x 80%) of slots usefully filled • Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
Example: Evaluating Branch Alternatives Assume: Conditional & Unconditional = 14%, 65% change PC Scheduling Branch CPI speedup v. scheme penalty stall Stall pipeline 3 1.42 1.0 Predict taken 1 1.14 1.26 Predict not taken 1 1.09 1.29 Delayed branch 0.5 1.07 1.31