Revisiting the Processor Arvind Computer Science & Artificial Intelligence Lab

Revisiting the Processor Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology

An unpipelined multicycle architecture Only one instruction at a time RFile pc PCGen Exec WB dstReg DCache ICache

The Multi-stage Design • 3 Rules (one for each stage) • 1 instruction active at a time • 2-3 stages / instruction • Little inter-stage buffering • Multi-cycle memory interface

Rules for multistage/multicycle design rule pcgen(stage == PC); imemReq.enq(Rd{a:pc}); stage <= EX; endrule rule exec(stage == EX); let inst = imemResp.first(); imemResp.deq(); match {.nextpc, .rf_cmd, .mem_cmd} = doInst(pc,inst,rf); pc <= nextpc; case (rf_cmd) matches tagged RF {.dst,.val}: rf.upd(dst,val); endcase case (mem_cmd) matches tagged Ld {.dst,.addr}: begin dmemReq.enq(Rd{a:addr}}; dstReg <= dst; end tagged St {.addr,.val}: dmemReq.enq(Wr{a:addr,v:val}); endcase stage <= (mem_cmd matches tagged Ld .*)? WB: PC; endrule rule writeback (stage == WB); let dresp = dmemResp.first(); dmemResp.deq(); case (dresp) matches tagged RdResp {.val}: rf.upd(dstReg, val); endcase stage <= PC; endrule case (inst) …

Discussion Point s • Can these 3 rules be combined into one rule? • No, memory takes multiple cycles • What will happen if we forgot to dequeue (imemResp)? • System will get stuck • Eventually, PCgen can not longer fire

Pipelining the Design RFile Step 1: Insert buffers pc PCGen Exec WB dstReg DCache ICache

Pipelining the Design 2 It is problematic to write RFile from two stages: structural hazards & possible out-of-order writes(and reads) to RFile RFile pc PCGen Exec WB Step 2: delay all RFile writes to WB stage (Requires passing dst-val pairs to WB stage. Subsumes dstReg) DCache ICache

Pipelining the Design 3 PC is read by PCGen and written by Exec. No parallelism without PC speculation RFile pc PCGen Exec WB epoch If speculation fails, we must reset the PC and discard false path instructions (may take many cycles) DCache ICache Epochs to identify to which speculative path an instruction belongs

Pipelining the Design 4 Final concern: Data hazards RFile pc Exec PCGen WB epoch DCache ICache

Isolating RFile Port Usage rule pcgen(stage == PC); imemReq.enq(Rd{a:pc}); stage <= EX; endrule rule exec(stage == EX); let inst = imemResp.first(); imemResp.deq(); match {.nextpc, .rf_cmd, .mem_cmd} = doInst(pc,inst,rf); pc <= nextpc; case (rf_cmd) matches tagged RF {.dst,.val}: rf.upd(dst,val); endcase case (mem_cmd) matches tagged Ld {.dst,.addr}: begin dmemReq.enq(Rd{a:addr}}; dstReg <= dst; end tagged St {.addr,.val}: dmemReq.enq(Wr{a:addr,v:val}); endcase stage <= (mem_cmd matches tagged Ld .*)? WB: PC; endrule rule writeback (stage == WB); let dresp = dmemResp.first(); dmemResp.deq(); case (dresp) matches tagged RdResp {.val}: rf.upd(dstReg, val); endcase stage <= PC; endrule case (inst) … … Reg2RegOp: …wbData <= RF{dst,val};… LoadOp: …wbData <= Ld{dst};… Suppose we use a wbData register to pass information to the WB stage about updating the RFile rule writeback (stage == WB); case (wbData) matches tagged RF {.dst,.val}: rf.upd(dst,val); tagged Ld {.dst}: begin let dresp = dmemResp.first(); dmemResp.deq(); rf.upd(dst,memVal(dresp)); end endcase stage <= PC; endrule

Rules for the Pipelined machine rule pcgen; imemReq.enq(Rd{a:pc}); pc <= predPC; pcQ.enq(tuple2(pc,epoch)); endrule rule discard (epoch != eEpoch); pcQ.deq(); imemResp.deq(); endrule rule exec ((epoch != eEpoch)&& !(stall(inst,wbQ))); case based on the fetched instruction rule writeback (True); wbQ.deq(); case (wbQ.first()) matches tagged RF {.dst,.val}: rf.upd(dst,val); tagged Ld {.dst}: begin let dresp = dmemResp.first(); dmemResp.deq(); rf.upd(dst, memVal(dresp)); end endcase endrule

The Execute Rule let inst = imemResp.first(); match {.predPC,.eEpoch} = pcQ.first(); rule exec(epoch == eEpoch && !stall(inst, wbQ)); pcQ.deq(); imemResp.deq(); match {.nextPC, .rf_cmd, .mem_cmd} = doInst(pc,inst,rf); if(predPC!=nextPC) begin pc <= nextPC; epoch<= epoch+1; end case (tuple2(rf_cmd, mem_cmd)) matches {tagged RF {.dst,.val}, .*}: wbQ.enq(RF{dst,val}); {.*, tagged Ld {.dst,.addr}}: begin wbQ.enq(Ld{dst,addr}); dmemReq.enq(Rd{a:dst}}; end {.*, tagged St {.addr,.val}}: begin wbQ.enq(St); dmemReq.enq(Wr{a:addr,v:val}); end endcase endrule

Design Flow • Are these rules correct? I.e. do they produce the correct results regardless of the order in which they are executed • test – fix – test – fix …. • Run the design and look at the traces to understand concurrency • traces tell you what is happening at each cycle but not why something is not happening • Does your design permit “concurrent firings”, i.e., multiple instructions in the pipeline • Compiler output can tell you • Can multiple guards be true simultaneously? • Structural conflicts? • Permitted rule orderings within a cycle • You may want to split a rule into multiple rules

Top down concurrency analysis • Determine the concurrent rule firings and rule ordering you want • To hand analysis to determine the required concurrent behavior of methods of submodules. • If this behavior is prohibited by a submodule, create a submodule with the desired behavior • this may require Rwires and ConfigRegs

Branch Prediction • Pipeline has simple speculation rule pcGen (True); pc <= pc + 4; otherActions; endrule Simplest prediction: Always not-taken

Branch Predictors RFile pred pc Exec PCGen WB epoch DCache ICache

Branch Prediction Make prediction interface BranchPredictor; method Addr getNextPC(Addr pc); method Action update (Addr pc, Addr correct_next_pc); endinterface rule pcGen (True); pc <= pred.getNextPC(pc); otherActions; endrule rule execute … if (nextPC != correctPC) pred.update(curPc, nextPC); case (instr) matches … BzTaken: if (mispredicted) … endrule Update predictions

Revisiting the Processor Arvind Computer Science & Artificial Intelligence Lab