Computer Architecture: A Constructive Approach Multi-cycle SMIPS Implementations Joel Emer

Computer Architecture: A Constructive Approach Multi-cycle SMIPS Implementations Joel Emer Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology http://csg.csail.mit.edu/6.S078

MemWrite WBSrc 0x4 clk we clk rs1 rs2 ALU Control we rd1 addr PC ws addr inst wd rd2 ALU GPRs z Inst. Memory rdata clk Data Memory Imm Ext wdata zero? OpCode RegDst ExtSel OpSel BSrc old way Harvard-Style Datapath for MIPS PCSrc RegWrite br rind jabs pc+4 Add Add 31 http://csg.csail.mit.edu/6.S078

no yes ALU pc+4 Imm Op no yes ALU rt pc+4 sExt16 Imm + yes no * * pc+4 no no * * * * no no * * * * * no * * * * no no * * sExt16 * 0? no no * * yes PC rind R31 * * * no old way Hardwired Control Table * Reg Func no yes ALU rd pc+4 sExt16 Imm Op rt uExt16 sExt16 Imm + no yes Mem rt pc+4 sExt16 * 0? br pc+4 jabs yes PC R31 jabs rind BSrc = Reg / Imm WBSrc = ALU / Mem / PC RegDst = rt / rd / R31 PCSrc = pc+4 / br / rind / jabs http://csg.csail.mit.edu/6.S078

Single-Cycle SMIPS new way 2 read & 1 write ports Register File PC Execute Decode +4 separate Instruction & Data memories Data Memory Inst Memory Datapath and control were derived automatically from a high-level rule-based description http://csg.csail.mit.edu/6.S078

Single-Cycle SMIPS code structure module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; Memory mem <- mkTwoPortedMemory; let iMem = mem.iport ; let dMem = mem.dport; rule doProc; let inst = iMem(MemReq{op:Ld, addr:pc, data:?}); let dInst = decode(inst); Data rVal1 = rf.rd1(dInst.rSrc1); Data rVal2 = rf.rd2(dInst.rSrc2); let eInst = exec(dInst, rVal1, rVal2, pc); update rf, pc and dMem http://csg.csail.mit.edu/6.S078

Decoding Instructions: input-output types decode 31:26, 5:0 iType IType 31:26 aluFunc 5:0 AluFunc instruction 31:26 brComp Bit#(32) BrType Type DecodedInst 20:16 rDst 15:11 Rindex Mux control logic not shown 25:21 rSrc1 Rindex 20:16 rSrc2 Rindex 15:0 imm Bit#(32) ext 25:0 Bool immValid http://csg.csail.mit.edu/6.S078

Reading Registers Read registers RF RVal1 RSrc1 RSrc2 RVal2 Pure combinational logic http://csg.csail.mit.edu/6.S078

Executing Instructions execute iType rDst dInst either for rf write or St rVal2 data ALU either for memory reference or branch target rVal1 ALUBr addr Pure combinational logic Branch Address brTaken pc http://csg.csail.mit.edu/6.S078

Branch Address Calculation function Addr brAddrCalc(Address pc, Data val, IType iType, Data imm); let targetAddr = case (iType) J, Jal : {pc[31:28], imm[27:0]}; Jr, Jalr : val; default : pc + imm; endcase; return targetAddr; endfunction http://csg.csail.mit.edu/6.S078

Some Useful Functions function Bool memType (IType i) return (i==Ld || i == St); endfunction function Bool regWriteType (IType i) return (i==Alu || i==Ld || i==Jal || i==Jalr); endfunction function Bool controlType (IType i) return (i==J || i==Jr || i==Jal || i==Jalr || i==Br); endfunction http://csg.csail.mit.edu/6.S078

Execute Function function ExecInst exec(DecodedInst dInst, Data rVal1, Data rVal2, Addr pc); ExecInst einst = ?; Data aluVal2 = (dInst.immValid)? dInst.imm : rVal2 let aluRes = alu(rVal1, aluVal2, dInst.aluFunc); let brAddr = brAddrCal(pc, rVal1, dInst.iType, dInst.imm); einst.itype = dInst.iType; einst.addr = (memType(dInst.iType)? aluRes : brAddr; einst.data = dInst.iType==St ? rVal2 : aluRes; einst.brTaken = aluBr(rVal1, aluVal2, dInst.brComp); einst.rDst = dInst.rDst; return einst; endfunction http://csg.csail.mit.edu/6.S078

Single-Cycle SMIPS atomic state updates if(memType(eInst.iType)) eInst.data <- dMem(MemReq{ op: eInst.iType==Ld ? Ld : St, addr: eInst.addr, data: eInst.data}); if(regWriteType(eInst.iType)) rf.wr(eInst.rDst, eInst.data); pc <= eInst.brTaken ? eInst.addr : pc + 4; endrule endmodule http://csg.csail.mit.edu/6.S078

Single-Cycle SMIPS: Clock Speed Register File PC Execute Decode +4 tClock > tM + tDEC + tRF + tALU+ tM+ tWB Data Memory Inst Memory We can improve the clock speed if we execute each instruction in two clock cycles tClock > max {tM , (tDEC +tRF + tALU+ tM+ tWB)} http://csg.csail.mit.edu/6.S078

Two-Cycle SMIPS Register File stage PC Execute Decode ir +4 Data Memory Inst Memory Introduce register “ir” to hold a fetched instruction and register “stage” to remember which stage (fetch/execute) we are in http://csg.csail.mit.edu/6.S078

ir: The instruction register • You may recall from our earlier discussion of pipelining that when we take multiple cycles to perform some operation (e.g., IFFT), there is a possibility that intermediate registers do not contain any meaningful data in some cycles • It is straight forward to convert ir into a pipeline register • We can associate (Valid/Invalid) bit with ir • Equivalently, we can think of ir as a single-element FIFO http://csg.csail.mit.edu/6.S078

Additional Types typedef struct { Addr pc; Bit#(32) inst; } TypeFetch2Decode deriving (Bits, Eq); typedef enum {Fetch, Execute} TypeStage deriving (Bits, Eq); March 5, 2012 http://csg.csail.mit.edu/6.S078 L8-16

Two-Cycle SMIPS module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; Memory mem <- mkTwoPortedMemory; let iMem = mem.iport; let dMem = mem.dport; Reg#(TypeFetch2Decode) ir <- mkRegU; Reg#(TypeStage) stage <- mkReg(Fetch); rule doFetch (state==Fetch); let inst = iMem(MemReq{op:Ld, addr:pc, data:?}); ir <= TypeFetch2Decode{pc: pc, inst: inst}; stage <= Execute; endrule http://csg.csail.mit.edu/6.S078

Two-Cycle SMIPS rule doExecute(stage==Execute); let irpc = ir.pc; let inst = ir.inst; let dInst = decode(inst); Data rVal1 = rf.rd1(dInst.rSrc1); Data rVal2 = rf.rd2(dInst.rSrc2); let eInst = exec(dInst, rVal1, rVal2, irpc); if(memType(eInst.iType)) eInst.data <- dMem(MemReq{ op: eInst.iType==Ld ? Ld : St, addr: eInst.addr, data: eInst.data}); if(regWriteType(eInst.iType)) rf.wr(eInst.rDst, eInst.data); pc <= eInst.brTaken ? eInst.addr : pc + 4; stage <= Fetch; endrule endmodule no change from single-cycle http://csg.csail.mit.edu/6.S078

Princeton versus Harvard Architecture • Harvard architecture uses different memories for instructions and data • needed for a single-cycle implementation • Princeton architecture uses the same memory for instruction and data and thus, requires at least two cycles to execute Load/Store instructions The two-cycle implementations of Princeton and Harvard architectures are almost the same http://csg.csail.mit.edu/6.S078

SMIPS Princeton Architecture Register File stage PC Execute Decode ir +4 Since both the Fetch and Execute stages want to use the memory, there is a structural hazard in accessing memory Memory http://csg.csail.mit.edu/6.S078

Two-Cycle SMIPS Princeton module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; Memory mem <- mkOnePortedMemory; let uMem = mem.port; Reg#(TypeFetch2Decode) ir <- mkRegU; Reg#(TypeStage) stage <- mkReg(Fetch); rule doFetch (stage==Fetch); let inst <- uMem(MemReq{op:Ld, addr:pc, data:?}); ir <= TypeFetch2Decode{pc: pc, inst: inst}; stage <= Execute; endrule http://csg.csail.mit.edu/6.S078

Two-Cycle SMIPS Princeton rule doExecute(stage==Execute); let irpc = ir.pc; let inst = ir.inst; let dInst = decode(inst); Data rVal1 = rf.rd1(dInst.rSrc1); Data rVal2 = rf.rd2(dInst.rSrc2); let eInst = exec(dInst, rVal1, rVal2, irpc); if(memType(eInst.iType)) eInst.data <- uMem(MemReq{ op: eInst.iType==Ld ? Ld : St, addr: eInst.addr, data: eInst.data}); if(regWriteType(eInst.iType)) rf.wr(eInst.rDst, eInst.data); pc <= eInst.brTaken ? eInst.addr : pc + 4; stage <= Fetch; endrule endmodule http://csg.csail.mit.edu/6.S078

Two-Cycle SMIPS: Analysis Register File Fetch Execute stage PC Execute Decode ir +4 Data Memory Inst Memory In any given clock cycle, lots of unused hardware! next lecture: Pipelining to increase the throughput http://csg.csail.mit.edu/6.S078

Computer Architecture: A Constructive Approach Multi-cycle SMIPS Implementations Joel Emer