Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind

Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology http://csg.csail.mit.edu/6.S078

Next fetch started PC Fetch I-cache Fetch Buffer Decode Issue Buffer Execute Func. Units Result Buffer Branchexecuted Commit Arch. State Control Flow Penalty Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution ! How much work is lost if pipeline doesn’t follow correct instruction flow? ~ Loop length x pipeline width http://csg.csail.mit.edu/6.S078

Average Run-Length between Branches Average dynamic instruction mix from SPEC92: SPECint92 SPECfp92 ALU 39 % 13 % FPU Add 20 % FPU Mult 13 % load 26 % 23 % store 9 % 9 % branch 16 % 8 % other 10 % 12 % SPECint92: compress, eqntott, espresso, gcc, li SPECfp92: doduc, ear, hydro2d, mdijdp2, su2cor What is the average run-lengthbetween branches? http://csg.csail.mit.edu/6.S078

MIPS Branches and Jumps Each instruction fetch depends on one or two pieces of information from the preceding instruction: 1. Is the preceding instruction a taken branch? 2.If so, what is the target address? Instruction Taken known? Target known? J JR BEQZ/BNEZ After Inst. Decode After Inst. Decode After Inst. Decode After Reg. Fetch After Exec After Inst. Decode http://csg.csail.mit.edu/6.S078

Currently our simple pipelined architecture does very simple branch prediction What is it? Branch is predicted not taken: pc, pc+4, pc+8, … Can we do better? http://csg.csail.mit.edu/6.S078

Branch Prediction Bits • Assume 2 BP bits per instruction • Use saturating counter http://csg.csail.mit.edu/6.S078

Fetch PC 0 0 I-Cache k 2k-entry BHT, 2 bits/entry BHT Index Instruction Opcode offset + Branch? Taken/¬Taken? Target PC Branch History Table (BHT) 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions http://csg.csail.mit.edu/6.S078

Where does BHT fit in the processor pipeline? • BHT can only be used after instruction decode • What should we do at the fetch stage? • Need a mechanism to update the BHT • where does the update information come from http://csg.csail.mit.edu/6.S078

BP, JMP, Ret Next Addr Pred Decode RegRead Execute Overview of branch prediction Best predictors reflect program behavior PC Instr type, PC relative targets available Simple conditions, register targets available Complex conditions available Need next PC immediately Tight loop Loose loop Loose loop Loose loop http://csg.csail.mit.edu/6.S078

Next Address Predictor (NAP)first attempt predicted BPb target Branch Target Buffer (2k entries) iMem k PC target BP BP bits are stored with the predicted target address. IF stage: nPC = If (BP=taken) then target else pc+4 later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb http://csg.csail.mit.edu/6.S078

132 Jump 100 1028 Add ..... target BPb 236 take Address Collisions Assume a 128-entry NAP Instruction Memory What will be fetched after the instruction at 1028? NAP prediction = Correct target =  236 1032 kill PC=236 and fetch PC=1032 Is this a common occurrence? Can we avoid these bubbles? http://csg.csail.mit.edu/6.S078

Use NAP for Control Instructions only NAP contains useful information for branch and jump instructions only Do not update it for other instructions For all other instructions the next PC is (PC)+4 ! How to achieve this effect without decoding the instruction? http://csg.csail.mit.edu/6.S078

I-Cache PC Entry PC predicted Valid target PC k = match target valid Branch Target Buffer (BTB)a special form of NAP 2k-entry direct-mapped BTB • Keep the (pc, predicted pc) in the BTB • pc+4 is predicted if no pc match is found • BTB is updated only for branches and jumps Permits nextPC to be determined before instruction is decoded http://csg.csail.mit.edu/6.S078

132 Jump 100 entry PC target BPb 1028 Add ..... 132 236 take Consulting BTB Before Decoding • The match for pc =1028 fails and 1028+4 is fetched •  eliminates false predictions after ALU instructions • BTB contains entries only for control transfer instructions • more room to store branch targets Even very small BTBs are very effective http://csg.csail.mit.edu/6.S078

Observations There is a plethora of branch prediction schemes – their importance grows with the depth of processor pipeline Processors often use more than one prediction scheme It is usually easy to understand the data structures required to implement a particular scheme It takes considerably more effort to understand how a particular scheme with its lookup and updates is integrated in the pipeline and how various schemes interact with each other http://csg.csail.mit.edu/6.S078

Plan revisit the simple two-stage pipeline without branch prediction We will begin with a very simple 2-stage pipeline and integrate a simple BTB scheme in it We will extend the design to a multistage pipeline and integrate at least one more predictor, say BHT, in the pipeline (next lecture) http://csg.csail.mit.edu/6.S078

Decoupled Fetch and Execute nextPC <updated pc> Fetch Execute ir <instructions, pc, epoch> Fetch sends instructions to Execute along with pc and other control information Execute sends information about the target pc to Fetch, which updates pc and other control registers whenever it looks at the nextPCfifo http://csg.csail.mit.edu/6.S078

A solution using epoch • Add fEpoch and eEpoch registers to the processor state; initialize them to the same value • The epoch changes whenever Execute determines that the pc prediction is wrong. This change is reflected immediately in eEpoch and eventually in fEpoch via nextPC FIFO • Associate the fEpoch with every instruction when it is fetched • In the execute stage, reject, i.e., kill, the instruction if its epoch does not match eEpoch http://csg.csail.mit.edu/6.S078

Two-Stage pipelineA robust two-rule solution Bypass FIFO Register File eEpoch fEpoch nextPC PC Execute Decode ir +4 Pipeline FIFO Data Memory Inst Memory Either fifo can be a normal (>1 element) fifo http://csg.csail.mit.edu/6.S078

Two-stage pipeline Decoupled modulemkProc(Proc); Reg#(Addr) pc <- mkRegU; RFilerf <- mkRFile; IMemoryiMem<- mkIMemory; DMemorydMem <- mkDMemory; PipeReg#(TypeFetch2Decode) ir<- mkPipeReg; Reg#(Bool) fEpoch <- mkReg(False); Reg#(Bool) eEpoch <- mkReg(False); FIFOF#(Addr) nextPC <- mkBypassFIFOF; ruledoFetch (ir.notFull); letinst = iMem(pc); ir.enq(TypeFetch2Decode {pc:pc, epoch:fEpoch, inst:inst}); if(nextPC.notEmpty) begin pc<=nextPC.first; fEpoch<=!fEpoch;nextPC.deq;end else pc <= pc + 4; endrule explicit guard simple branch prediction http://csg.csail.mit.edu/6.S078

Two-stage pipeline Decoupled cont rule doExecute (ir.notEmpty); letirpc= ir.first.pc; letinst = ir.first.inst; if(ir.first.epoch==eEpoch) begin leteInst = decodeExecute(irpc, inst, rf); letmemData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if (eInst.brTaken) begin nextPC.enq(eInst.addr); eEpoch <= !eEpoch; end end ir.deq; endrule endmodule http://csg.csail.mit.edu/6.S078

Two-Stage pipeline with a Branch Predictor Register File eEpoch fEpoch nextPC PC Execute Decode ir + ppc Branch Predictor Data Memory Inst Memory http://csg.csail.mit.edu/6.S078

Branch Predictor Interface interface NextAddressPredictor; method Addr prediction(Addr pc); method Action update(Addr pc, Addr target); endinterface http://csg.csail.mit.edu/6.S078

Null Branch Prediction • Replaces PC+4 with … • Already implemented in the pipeline • Right most of the time • Why? module mkNeverTaken(NextAddressPredictor); method Addr prediction(Addr pc); return pc+4; endmethod method Action update(Addr pc, Addr target); noAction; endmethod endmodule http://csg.csail.mit.edu/6.S078

Branch Target Prediction (BTB) module mkBTB(NextAddressPredictor); RegFile#(LineIdx, Addr) tagArr <- mkRegFileFull; RegFile#(LineIdx, Addr) targetArr <- mkRegFileFull; method Addr prediction(Addr pc); LineIdx index = truncate(pc >> 2); let tag = tagArr.sub(index); let target = targetArr.sub(index); if (tag==pc) return target; else return (pc+4); endmethod method Action update(Addr pc, Addr target); LineIdx index = truncate(pc >> 2); tagArr.upd(index, pc); targetArr.upd(index, target); endmethod endmodule http://csg.csail.mit.edu/6.S078

Two-stage pipeline + BP modulemkProc(Proc); Reg#(Addr) pc <- mkRegU; RFilerf <- mkRFile; IMemoryiMem<- mkIMemory; DMemorydMem <- mkDMemory; PipeReg#(TypeFetch2Decode) ir<- mkPipeReg; Reg#(Bool) fEpoch <- mkReg(False); Reg#(Bool) eEpoch <- mkReg(False); FIFOF#(Tuple2#(Addr,Addr)) nextPC <- mkBypassFIFOF; NextAddressPredictorbpred <- mkNeverTaken; Some target predictor The definition of TypeFetch2Decode is changed to include predicted pc typedefstruct { Addr pc; Addrppc; Bool epoch; Data inst; } TypeFetch2Decode deriving (Bits, Eq); http://csg.csail.mit.edu/6.S078

Two-stage pipeline + BP Fetch rule ruledoFetch (ir.notFull); let ppc = bpred.prediction(pc); letinst = iMem(pc); ir.enq(TypeFetch2Decode {pc:pc, ppc:ppc, epoch:fEpoch, inst:inst}); if(nextPC.notEmpty) begin match{.ipc, .ippc} = nextPC.first; pc <= ippc; fEpoch <= !fEpoch; nextPC.deq; bpred.update(ipc, ippc); end else pc <= ppc; endrule http://csg.csail.mit.edu/6.S078

Two-stage pipeline + BP Execute rule rule doExecute (ir.notEmpty); letirpc= ir.first.pc; letinst = ir.first.inst; letirppc = ir.first.ppc; if(ir.first.epoch==eEpoch) begin leteInst = decodeExecute(irpc, irppc, inst, rf); letmemData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if (eInst.missPrediction) begin nextPC.enq(tuple2(irpc, eInst.brTaken ? eInst.addr : irpc+4)); eEpoch <= !eEpoch; end end ir.deq; endrule endmodule http://csg.csail.mit.edu/6.S078

Execute Function functionExecInstexec(DecodedInstdInst, Data rVal1, Data rVal2, Addrpc, Addrppc); ExecInsteinst = ?; letaluVal2 = (dInst.immValid)? dInst.imm : rVal2 letaluRes = alu(rVal1, aluVal2, dInst.aluFunc); letbrAddr = brAddrCal(pc, rVal1, dInst.iType, dInst.imm); einst.itype = dInst.iType; einst.addr = (memType(dInst.iType)? aluRes : brAddr; einst.data = dInst.iType==St ? rVal2 : aluRes; einst.brTaken = aluBr(rVal1, aluVal2, dInst.brComp); einst.missPrediction = brTaken ? brAddr!=ppc : (pc+4)!=ppc; einst.rDst = dInst.rDst; returneinst; endfunction http://csg.csail.mit.edu/6.s078Rev

Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind

Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind

Presentation Transcript

Computer Architecture: A Constructive Approach Branch Direction Prediction – Pipeline Integration

Computer Architecture: A Constructive Approach Sequential Circuits Arvind

Computer Architecture: A Constructive Approach Branch Direction Prediction – Six Stage Pipeline

Constructive Computer Architecture Interrupts/Exceptions/Faults Arvind

6.S078 - Computer Architecture: A Constructive Approach Combinational ALU Arvind

Computer Architecture Advanced Branch Prediction

Computer Architecture: A Constructive Approach Combinational ALU Arvind

Constructive Computer Architecture: Branch Prediction: Direction Predictors Arvind

Constructive Computer Architecture: Branch Prediction Arvind

Computer Architecture Advanced Branch Prediction

Constructive Computer Architecture Cache Coherence Arvind

Constructive Computer Architecture Cache Coherence Arvind

Computer Architecture: A Constructive Approach Branch Prediction - 2 Arvind

Computer Architecture: A Constructive Approach Instruction Representation Arvind

Constructive Computer Architecture: Pipelining combinational circuits Arvind

Constructive Computer Architecture: Non-Pipelined Processors Arvind

Constructive Computer Architecture Sequential Circuits Arvind

Computer Architecture: A Constructive Approach Implementing SMIPS Arvind

Computer Architecture: A Constructive Approach SMIPS Implementations Arvind

Constructive Computer Architecture: Control Hazards Arvind

Constructive Computer Architecture Combinational circuits Arvind

Constructive Computer Architecture: Pipelining combinational circuits Arvind