Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab.

Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology http://csg.csail.mit.edu/SNU

MemWrite WBSrc 0x4 clk we clk rs1 rs2 ALU Control we rd1 addr PC ws addr inst wd rd2 ALU GPRs z Inst. Memory rdata clk Data Memory Imm Ext wdata zero? OpCode RegDst ExtSel OpSel BSrc Harvard-Style Datapath for MIPS PCSrc RegWrite br rind jabs pc+4 Add Add 31

What problem arises if instructions and data reside in the same memory? At least the instruction fetch and a Load (or Store) cannot be executed in the same cycle Structural hazard

PCSrc RegWrite WBSrc MemWrite PCen 0x4 Add Add clk we clk clk rs1 rs2 rd1 PC 31 we ws addr IR rd2 ALU wd GPRs clk rdata z Data Memory Imm Ext wdata ALU Control IRen OpCode RegDst ExtSel OpSel BSrc zero? AddrSrc Princeton MicroarchitectureDatapath & Control off off off on Fetch phase = PC

AddrSrc=PC IRen=on PCen=off Wen=off AddrSrc=ALU IRen=off PCen=on Wen=on Two-State Controller:Princeton Architecture fetch phase execute phase A flipflop can be used to remember the phase

Hardwired Controller:Princeton Architecture . . . ExtSel, BSrc, OpSel, WBSrc, RegDest, PCsrc1, PCsrc2 old combinational logic (Harvard) op code IR zero? MemWrite RegWrite Wen new combinational logic S PCen IRen AddrSrc 1-bit Toggle FF I-fetch / Execute

Two-Cycle SMIPS Register File stage PC Execute Decode ir +4 Data Memory Inst Memory http://csg.csail.mit.edu/SNU 7

Two-Cycle SMIPS modulemkProc(Proc); Reg#(Addr) pc <- mkRegU; RFilerf<- mkRFile; Memory mem <- mkMemory; PipeReg#(FBundle) ir<- mkPipeReg; Reg#(Bit#(1)) stage <- mkReg(0); letpcir = ir.first(); let pc = pcir.pc; let inst = pcir.inst; ruledoProc; if(stage==0 && ir.notFull) begin //fetch letinstResp <- mem(MemReq{op:Ld, addr:pc, data:?}); ir.enq(FBundle{pc:pc, inst:instResp}); stage <= 1; end http://csg.csail.mit.edu/SNU

Two-Cycle SMIPS cont-1 if(stage==1 && ir.notEmpty) begin //decode letdecInst = decode(inst); Data rVal1 = rf.rd1(decInst.rSrc1); Data rVal2 = rf.rd2(decInst.rSrc2); //execute letexecInst = exec(decInst, pc, rVal1, rVal2); if(execInst.instType==Ld || execInst.instType==St) execInst.data <- mem(MemReq{op:execInst.instType, addr:execInst.addr, data:execInst.data}); pc <= execInst.brTaken ? execInst.addr : pc+4; //writeback http://csg.csail.mit.edu/SNU

Two-Cycle SMIPS cont-2 //writeback if(execInst.instType==Alu|| execInst.instType==Ld) rf.wr(execInst.rDst, execInst.data); ir.deq; stage <= 0; end endrule endmodule; http://csg.csail.mit.edu/SNU

Time = Instructions Cycles Time Program Program * Instruction * Cycle Processor Performance • Instructions per program depends on source code, compiler technology and ISA • Cycles per instructions (CPI) depends upon the ISA and the microarchitecture • Time per cycle depends upon the microarchitecture and the base technology

Single-Cycle Hardwired Control: Harvard architecture We will assume clock period is sufficiently long for all of the following steps to be “completed”: 1. instruction fetch 2. decode and register fetch 3. ALU operation 4. data fetch if required 5. register write-back setup time tC > tIFetch + tRFetch+ tALU+ tDMem+ tRWB At the rising edge of the following clock, the PC, the register file and the memory are updated

Clock Period tC-Princeton > max {tM , tRF+ tALU+ tM + tWB} tC-Princeton > tRF+ tALU+ tM + tWB while in the hardwired Harvard architecture tC-Harvard > tM + tRF + tALU+ tM+ tWB which will execute instructions faster?

Clock Rate vs CPI Suppose tM >> tRF+ tALU+ tWB tC-Princeton = 0.5 * tC-Harvard CPIPrinceton= 2 CPIHarvard= 1 No difference in performance! Is it possible to design a controller for the Princeton architecture with CPI < 2 ? CPI = Clock cycles Per Instruction

0x4 Add we rs1 rs2 we rd1 we addr PC ws IR addr rdata ALU wd rd2 GPRs rdata Memory Memory wdata Imm Ext wdata fetch phase execute phase Princeton microarchitecture(redrawn) The same (mux not shown) Only one of the phases is active in any cycle  a lot of datapath is not in use at any given time

0x4 Add we rs1 rs2 we rd1 we addr PC ws IR addr rdata ALU wd rd2 GPRs rdata Memory Memory wdata Imm Ext wdata fetch phase execute phase Princeton MicroarchitectureOverlapped execution Can we overlap instruction fetch and execute? Yes, unless IR contains a Load or Store Which action should be prioritized? Execute Stall it How? What do we do with Fetch?

stall? we rs1 rs2 rd1 we ws addr ALU wd rd2 GPRs rdata Memory Imm Ext wdata Stalling the instruction fetch Princeton Microarchitecture 0x4 Add nop we addr PC IR rdata Memory wdata fetch phase execute phase When stall condition is indicated don’t fetch a new instruction and don’t change the PC insert a nop in the IR set the Memory Address mux to ALU (not shown) What if IR contains a jump or branch instruction?

Jump? we rs1 rs2 rd1 we ws addr ALU wd rd2 GPRs rdata Memory Imm Ext wdata Need to stall on branchesPrinceton Microarchitecture 0x4 Add nop we addr PC IR rdata Memory wdata When IR contains a jump or branch-taken no “structural conflict” for the memory but we do not have the correct PC value in the PC memory cannot be used – Address Mux setting is irrelevant insert a nop in the IR insert the nextPC (branch-target) address in the PC

0x4 Pipelined Princeton Microarchitecture PCSrc PCSrc2 RegWrite WBSrc MemWrite PCen Add Add clk we nop clk clk rs1 rs2 rd1 PC 31 we ws addr IR rd2 ALU wd GPRs clk rdata z Data Memory Imm Ext wdata ALU Control IRSrc stall? OpCode RegDst ExtSel OpSel BSrc zero? MAddrSrc stall

Pipelined Princeton Architecture Clock:tC-Princeton > tRF+ tALU+ tM CPI: (1- f) + 2f cycles per instruction where f is the fraction of instructions that cause a stall What is a likely value of f?

Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab.