350 likes | 546 Views
CENG 450 Computer Systems and Architecture Lecture 6. Amirali Baniasadi amirali@ece.uvic.ca. Overview of Today’s Lecture. MIPS Pipelining. CPU Pipelining Example:. Theoretically: Speedup should be equal to number of stages ( n tasks, k stages, p latency )
E N D
CENG 450Computer Systems and ArchitectureLecture 6 Amirali Baniasadi amirali@ece.uvic.ca
Overview of Today’s Lecture • MIPS • Pipelining
CPU Pipelining Example: • Theoretically: • Speedup should be equal to number of stages ( n tasks, k stages, p latency) • Speedup = n*p =~ k (for large n) • p/k*(n-1) + p • Practically: • Stages are imperfectly balanced • Pipelining needs overhead • Speedup less than number of stages • If we have 3 consecutive instructions • Non-pipelined needs 8 x 3 = 24 ns • Pipelined needs 14 ns => Speedup = 24 / 14 = 1.7 • If we have 1003 consecutive instructions • Add more time for 1000 instruction (i.e. 1003 instruction)on the previous example • Non-pipelined total time= 1000 x 8 + 24 = 8024 ns • Pipelined total time = 1000 x 2 + 14 = 2014 ns => Speedup ~ 3.98~ (8 ns / 2 ns] ~ near perfect speedup => Performance increases for larger number of instructions (throughput)
MIPS: Software conventions for Registers 0 zero constant 0 1 at reserved for assembler 2 v0 expression evaluation & 3 v1 function results 4 a0arguments 5 a1 6 a2 7 a3 8 t0temporary: caller saves . . . (callee can clobber) 15 t7 16 s0callee saves . . . (caller can clobber) 23 s7 24 t8temporary (cont’d) 25 t9 26 k0 reserved for OS kernel 27 k1 28 gp Pointer to global area 29 sp Stack pointer 30 fp frame pointer 31 ra Return Address (HW) Plus a 3-deep stack of mode bits.
Example in C: swap swap(int v[], int k) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } • Assume swap is called as a procedure • Assume temp is register $15; arguments in $a1, $a2; $16 is scratch reg: • Write MIPS code
swap: MIPS swap: addiu $sp,$sp, –4 ; create space on stack sw $16, 4($sp) ; callee saved register put onto stack sll $t2, $a2,2 ; multiply k by 4 addu $t2, $a1,$t2 ; address of v[k] lw $15, 0($t2) ; load v[k] lw $16, 4($t2) ; load v[k+1] sw $16, 0($t2) ; store v[k+1] into v[k] sw $15, 4($t2) ; store old value of v[k] into v[k+1] lw $16, 4($sp) ; callee saved register restored from stack addiu $sp,$sp, 4 ; restore top of stack jr $31 ; return to place that called swap
5 Steps of MIPS Datapath MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Memory RS2 Memory MUX MUX Sign Extend WB Data Imm Datapath RD RD RD Control Path
5 Steps of MIPS Datapath MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Inst 2 Inst 3 Inst 1 Inst 2 Inst 1 Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Memory RS2 Memory MUX MUX Sign Extend Inst 1 WB Data Imm Datapath RD RD RD Control Path
Review: Visualizing Pipelining Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Time (clock cycles) I n s t r. O r d e r
Limits to pipelining • Hazards: circumstances that would cause incorrect execution if next instruction were launched • Structural hazards: Attempting to use the same hardware to do two different things at the same time • Data hazards: Instruction depends on result of prior instruction still in the pipeline • Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
Example: One Memory Port/Structural Hazard Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch DMem DMem DMem ALU ALU ALU ALU DMem Ifetch Structural Hazard Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load DMem Instr 1 Instr 2 Instr 3 Instr 4
Resolving structural hazards • Defn: attempt to use same hardware for two different things at the same time • Solution 1: Wait • must detect the hazard • must have mechanism to stall • Solution 2: Throw more hardware at the problem
Detecting and Resolving Structural Hazard Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem ALU ALU ALU ALU Bubble Bubble Bubble Bubble Bubble Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load DMem Instr 1 Instr 2 Stall Instr 3
Eliminating Structural Hazards at Design Time MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Instr Cache RS2 Data Cache MUX MUX Sign Extend WB Data Imm Datapath RD RD RD Control Path
Role of Instruction Set Design in Structural Hazard Resolution • Simple to determine the sequence of resources used by an instruction • opcode tells it all • Uniformity in the resource usage • Compare MIPS to IA32? • MIPS approach => all instructions flow through same 5-stage pipeling
Data Hazards Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem EX WB MEM IF ID/RF I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Time (clock cycles)
Three Generic Data Hazards • Read After Write (RAW)InstrJ tries to read operand before InstrI writes it • Caused by a “Data Dependence”. This hazard results from an actual need for communication. I: add r1,r2,r3 J: sub r4,r1,r3
Three Generic Data Hazards I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Write After Read (WAR)InstrJ writes operand before InstrI reads it • an “anti-dependence” by compiler writers.This results from reuse of the name “r1”. • Can’t happen in MIPS 5 stage pipeline because: • All instructions take 5 stages, and • Reads are always in stage 2, and • Writes are always in stage 5
Three Generic Data Hazards I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Write After Write (WAW)InstrJ writes operand before InstrI writes it. • Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”. • Can’t happen in MIPS 5 stage pipeline because: • All instructions take 5 stages, and • Writes are always in stage 5 • Will see WAR and WAW in later more complicated pipes
Forwarding to Avoid Data Hazard Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Time (clock cycles)
HW Change for Forwarding ALU ID/EX EX/MEM MEM/WR NextPC mux Registers Data Memory mux mux Immediate
Data Hazard Even with Forwarding Reg Reg Reg Reg Reg Reg Reg Reg ALU Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU lwr1, 0(r2) I n s t r. O r d e r sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 Time (clock cycles)
Resolving this load hazard • Adding hardware? ... not • Detection? • Compilation techniques? • What is the cost of load delays?
Resolving the Load Data Hazard Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem ALU Bubble ALU ALU Reg Reg DMem DMem Bubble Reg Reg Time (clock cycles) I n s t r. O r d e r lwr1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Bubble ALU DMem or r8,r1,r9 How is this different from the instruction issue stall?
Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd
Instruction Set Connection • What is exposed about this organizational hazard in the instruction set? • k cycle delay? • bad, CPI is not part of ISA • k instruction slot delay • load should not be followed by use of the value in the next k instructions • Nothing, but code can reduce run-time delays • MIPS did the transformation in the assembler
Eliminating Control Hazards at Design Time MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Instr Cache RS2 Data Cache MUX MUX Sign Extend WB Data Imm Datapath RD RD RD Control Path
Example: Branch Stall Impact • If 30% branch, Stall 3 cycles significant • Two part solution: • Determine branch taken or not sooner, AND • Compute taken branch address earlier • MIPS branch tests if register = 0 or 0 • MIPS Solution: • Move Zero test to ID/RF stage • Adder to calculate new PC in ID/RF stage • 1 clock cycle penalty for branch versus 3
Pipelined MIPS Datapath MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next SEQ PC Next PC MUX Adder Zero? RS1 Reg File Memory RS2 Data Memory MUX MUX Sign Extend WB Data Imm EXTRA HARDWARE RD RD RD • Data stationary control • local decode for each instruction phase / pipeline stage
Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken • Execute successor instructions in sequence • “Squash” instructions in pipeline if branch actually taken • Advantage of late pipeline state update • 47% MIPS branches not taken on average • PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken • 53% MIPS branches taken on average • But haven’t calculated branch target address in MIPS • MIPS still incurs 1 cycle branch penalty • Other machines: branch target known before outcome
Four Branch Hazard Alternatives #4: Delayed Branch • Define branch to take place AFTER a following instruction branch instructionsequential successor1 sequential successor2 ........ sequential successorn ........ branch target if taken • 1 slot delay allows proper decision and branch target address in 5 stage pipeline • MIPS uses this Branch delay of length n
Delayed Branch • Where to get instructions to fill branch delay slot? • Before branch instruction • From the target address: only valuable when branch taken • From fall through: only valuable when branch not taken • Canceling branches allow more slots to be filled • Compiler effectiveness for single branch delay slot: • Fills about 60% of branch delay slots • About 80% of instructions executed in branch delay slots useful in computation • About 50% (60% x 80%) of slots usefully filled • Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
Recall:Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1:
Example: Evaluating Branch Alternatives Assume: Conditional & Unconditional = 14%, 65% change PC Scheduling Branch CPI speedup v. scheme penalty stall Stall pipeline 3 1.42 1.0 Predict taken 1 1.14 1.26 Predict not taken 1 1.09 1.29 Delayed branch 0.5 1.07 1.31
Summary • Hazards • Date Hazards & Control Hazards • How to remove Hazard? • Data Hazards: Forwarding Change program order • Control Hazards: Speculate branch outcome Delay Slots Use extra hardware