Pipelined Implementation Part II

Pipelined Implementation Part II

Overview Make the pipelined processor work! • Data Hazards • Instruction having register R as source follows shortly after instruction having register R as destination • Common condition, don’t want to slow down pipeline • Control Hazards • Mispredict conditional branch • Our design predicts all branches as being taken • Naïve pipeline executes two extra instructions • Getting return address for ret instruction • PIPE- executes three extra instructions • Making Sure It Really Works • What if multiple special cases happen simultaneously?

Suggested Reading • - Chap 4.5

Branch Misprediction Example • Should only execute first 7 instructions demo-j.ys 0x000: xorl %eax,%eax 0x002: jne t # Not taken 0x007: irmovl $1, %eax # Fall through 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl $3, %edx # Target (Should not execute) 0x017: irmovl $4, %ecx # Should not execute 0x01d: irmovl $5, %edx # Should not execute

Branch Misprediction Trace • Incorrectly execute two instructions at branch target

Return Example demo-ret.ys 0x000: irmovl Stack,%esp # Intialize stack pointer 0x006: nop # Avoid hazard on %esp 0x007: nop 0x008: nop 0x009: call p # Procedure call 0x00e: irmovl $5,%esi # Return point 0x014: halt 0x020: .pos 0x20 0x020: p: nop # procedure 0x021: nop 0x022: nop 0x023: ret 0x024: irmovl $1,%eax # Should not be executed 0x02a: irmovl $2,%ecx # Should not be executed 0x030: irmovl $3,%edx # Should not be executed 0x036: irmovl $4,%ebx # Should not be executed 0x100: .pos 0x100 0x100: Stack: # Stack: Stack pointer • Require lots of nops to avoid data hazards

Incorrect Return Example • Incorrectly execute 3 instructions following ret

Handling Misprediction Predict branch as taken • Fetch 2 instructions at target Cancel when mispredicted • Detect branch not-taken in execute stage • On following cycle, replace instructions in execute and decode by bubbles • No side effects have occurred yet Figure 4.63 P346

Detecting Mispredicted Branch Figure 4.64 P347

F F D D E E M M W W Control for Misprediction 1 2 3 4 5 6 7 8 9 10 # demo - j. ys F F D D E E M M W W 0x000: xorl % eax ,% eax F F D D E E M M W W 0x002: jne t # Not taken F D 0x011: t: irmovl $2,% edx # Target E M W bubble F 0x017: irmovl $3,% ebx # Target+1 D E M W bubble F F D D E E M M W W 0x007: irmovl $1,% eax # Fall through 0x00d: nop Figure 4.63 P346 Figure 4.66 P348

Return Example demo-retb.ys • Previously executed three additional instructions 0x000: irmovl Stack,%esp # Initialize stack pointer 0x006: call p # Procedure call 0x00b: irmovl $5,%esi # Return point 0x011: halt 0x020: .pos 0x20 0x020: p: irmovl $-1,%edi # procedure 0x026: ret 0x027: irmovl $1,%eax # Should not be executed 0x02d: irmovl $2,%ecx # Should not be executed 0x033: irmovl $3,%edx # Should not be executed 0x039: irmovl $4,%ebx # Should not be executed 0x100: .pos 0x100 0x100: Stack: # Stack: Stack pointer

W W valM valM = = 0x0b 0x0b • • • Correct Return Example # demo - retb F D E M W 0x026: ret F D E M W bubble F D E M W bubble F D E M W bubble F F D D E E M M W W 0x00b: irmovl $5,% esi # Return Figure 4.61 P344 • As ret passes through pipeline, stall at fetch stage • While in decode, execute, and memory stage • fetch the same instruction after ret 3 times. • Inject bubble into decode stage • Release stall when reach write-back stage F F f f valC valC 5 5 f f rB rB % % esi esi

Detecting Return Figure 4.64 P347

Control for Return # demo - retb F D E M W 0x026: ret F D E M W bubble F D E M W bubble F D E M W bubble F F D D E E M M W W 0x00b: irmovl $5,% esi # Return Figure 4.66 P348

Special Control Cases Figure 4.64 P347 • Detection • Action Figure 4.66 P348

Implementing Pipeline Control Figure 4.68 P351 • Combinational logic generates pipeline control signals • Action occurs at start of following cycle

Initial Version of Pipeline Control bool F_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool D_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB }; bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool E_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB};

Control Combinations • Special cases that can arise on same clock cycle • Combination A • Not-taken branch • ret instruction at branch target • Combination B • Instruction that reads from memory to %esp • Followed by ret instruction Figure 4.67 P349

1 1 1 Mispredict Mispredict ret ret ret M M M M M JXX JXX E E E E E ret ret ret D D D D D Combination A Control Combination A • Should handle as mispredicted branch • Stalls F pipeline register • But PC selection logic will be using M_valA anyhow

Control Combination B 1 1 1 Load/use ret ret ret • Would attempt to bubble and stall pipeline register D • Signaled by processor as pipeline error M M M M Load E E E E Use ret ret ret D D D D Combination B

Handling Control Combination B 1 1 1 Load/use ret ret ret • Load/use hazard should get priority • ret instruction should be held in decode stage for additional cycle M M M M Load E E E E Use ret ret ret D D D D Combination B

Corrected Pipeline Control Logic bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode } # but not condition for a load/use hazard && !(E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB }); • Load/use hazard should get priority • ret instruction should be held in decode stage for additional cycle

Pipeline Summary • Data Hazards • Most handled by forwarding • No performance penalty • Load/use hazard requires one cycle stall • Control Hazards • Cancel instructions when detect mispredicted branch • Two clock cycles wasted • Stall fetch stage while ret passes through pipeline • Three clock cycles wasted • Control Combinations • Must analyze carefully • First version had subtle bug • Only arises with unusual instruction combination

4.5.10 Performance Metrics • Clock rate • Measured in Megahertz or Gigahertz • Function of stage partitioning and circuit design • Keep amount of work per stage small • Rate at which instructions executed • CPI: cycles per instruction • On average, how many clock cycles does each instruction require? • Function of pipeline design and benchmark programs • E.g., how frequently are branches mispredicted?

CPI for PIPE • CPI  1.0 • Fetch instruction each clock cycle • Effectively process new instruction almost every cycle • Although each individual instruction has latency of 5 cycles • CPI > 1.0 • Sometimes must stall or cancel branches • Computing CPI • C clock cycles • I instructions executed to completion • B bubbles injected (C = I + B) CPI = C/I = (I+B)/I = 1.0 + B/I • Factor B/I represents average penalty due to bubbles

CPI for PIPE (Cont.) Typical Values B/I = LP + MP + RP • LP: Penalty due to load/use hazard stalling • Fraction of instructions that are loads 0.25 • Fraction of load instructions requiring stall 0.20 • Number of bubbles injected each time 1  LP = 0.25 * 0.20 * 1 = 0.05 • MP: Penalty due to mispredicted branches • Fraction of instructions that are cond. jumps 0.20 • Fraction of cond. jumps mispredicted 0.40 • Number of bubbles injected each time 2  MP = 0.20 * 0.40 * 2 = 0.16 • RP: Penalty due to ret instructions • Fraction of instructions that are returns 0.02 • Number of bubbles injected each time 3  RP = 0.02 * 3 = 0.06 • Net effect of penalties 0.05 + 0.16 + 0.06 = 0.27  CPI = 1.27 (Not bad!)

Fetch Logic Revisited • During Fetch Cycle • Select PC • Read bytes from instruction memory • Examine icode to determine instruction length • Increment PC • Timing • Steps 2 & 4 require significant amount of time

Standard Fetch Timing • Must Perform Everything in Sequence • Can’t compute incremented PC until know how much to increment it by need_regids, need_valC Select PC Mem. Read Increment 1 clock cycle

Fast Slow A Fast PC Increment Circuit incrPC High-order 29 bits Low-order 3 bits carry MUX 0 1 29-bit incrementer 3-bit adder need_regids 0 High-order 29 bits need_ValC Low-order 3 bits PC

Modified Fetch Timing • 29-Bit Incrementer • Acts as soon as PC selected • Output not needed until final MUX • Works in parallel with memory read need_regids, need_valC 3-bit add Select PC Mem. Read MUX Incrementer Standard cycle 1 clock cycle

More Realistic Fetch Logic • Fetch Box • Integrated into instruction cache • Fetches entire cache block (16 or 32 bytes) • Selects current instruction from current block • Works ahead to fetch next block • As reaches end of current block • At branch target

Exception • Condition: Instruction encounter an error condition • Deal flow: • Break the program flow • Invoke the exception hander provided by OS. • (Maybe) continue the program flow • E.g. page fault exception

Exceptions • Conditions under which pipeline cannot continue normal operation • Causes • Halt instruction (Current) • Bad address for instruction or data (Previous) • Invalid instruction (Previous) • Pipeline control error (Previous) • Desired Action • Complete some instructions • Either current or previous (depends on exception type) • Discard others • Call exception handler • Like an unexpected procedure call

Exception Examples • Detect in Fetch Stage jmp $-1 # Invalid jump target .byte 0xFF # Invalid instruction code halt # Halt instruction Detect in Memory Stage irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address

5 W Exception detected M E D Exception detected Exceptions in Pipeline Processor #1 # demo-exc1.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # Invalid address nop .byte 0xFF # Invalid instruction code • Desired Behavior • rmmovl should cause exception 1 2 3 4 0x000: irmovl $100,%eax F D E M 0x006: rmmovl %eax,0x10000(%eax) F D E 0x00c: nop F D 0x00d: .byte 0xFF F

4 5 6 7 8 9 M W E M D E M W F D E M W F D E M W Exception detected Exceptions in Pipeline Processor #2 # demo-exc2.ys 0x000: xorl %eax,%eax # Set condition codes 0x002: jne t # Not taken 0x007: irmovl $1,%eax 0x00d: irmovl $2,%edx 0x013: halt 0x014: t: .byte 0xFF # Target • Desired Behavior • No exception should occur 1 2 3 0x000: xorl %eax,%eax F D E 0x002: jne t F D 0x014: t: .byte 0xFF F 0x???: (I’m lost!) 0x007: irmovl $1,%eax

W exc icode valE valM dstE dstM M exc icode Bch valE valA dstE dstM E exc icode ifun valC valA valB dstE dstM srcA srcB D exc icode ifun rA rB valC valP F predPC Maintaining Exception Ordering • Add exception status field to pipeline registers • Fetch stage sets to either “AOK,” “ADR” (when bad fetch address), or “INS” (illegal instruction) • Decode & execute pass values through • Memory either passes through or sets to “ADR” • Exception triggered only when instruction hits write back

5 W Exception detected M E Condition code set Side Effects in Pipeline Processor • Desired Behavior • rmmovl should cause exception • No following instruction should have any effect # demo-exc3.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address addl %eax,%eax # Sets condition codes 1 2 3 4 0x000: irmovl $100,%eax F D E M 0x006: rmmovl %eax,0x10000(%eax) F D E 0x00c: addl %eax,%eax F D

Avoiding Side Effects • Presence of Exception Should Disable State Update • When detect exception in memory stage • Disable condition code setting in execute • Must happen in same clock cycle • When exception passes to write-back stage • Disable memory write in memory stage • Disable condition code setting in execute stage • Implementation • Hardwired into the design of the PIPE simulator • You have no control over this

Rest of Exception Handling • Calling Exception Handler • Push PC onto stack • Either PC of faulting instruction or of next instruction • Usually pass through pipeline along with exception status • Jump to handler address • Usually fixed address • Defined as part of ISA • Implementation • Haven’t tried it yet!

Modern CPU Design

Processor Summary • Design Technique • Create uniform framework for all instructions • Want to share hardware among instructions • Connect standard logic blocks with bits of control logic • Operation • State held in memories and clocked registers • Computation done by combinational logic • Clocking of registers/memories sufficient to control overall behavior • Enhancing Performance • Pipelining increases throughput and improves resource utilization • Must make sure maintains ISA behavior

Pipelined Implementation Part II