Processor Design 5Z032

Processor Design5Z032 Processor Pipelining Chapter 6 Henk Corporaal Eindhoven University of Technology 2009

Topics • Pipelining • Pipelined datapath • Pipelined control • Hazards: • Structural • Data • Control • Exceptions • Perfomance improvements • Scheduling • Branch prediction • Superscalar processors

P r o g r a m 2 4 6 8 1 0 1 2 1 4 1 6 1 8 e x e c u t i o n T i m e o r d e r ( i n i n s t r u c t i o n s ) I n s t r u c t i o n D a t a l w $ 1 , 1 0 0 ( $ 0 ) R e g A L U R e g f e t c h a c c e s s I n s t r u c t i o n D a t a l w $ 2 , 2 0 0 ( $ 0 ) R e g A L U R e g 8 n s f e t c h a c c e s s I n s t r u c t i o n l w $ 3 , 3 0 0 ( $ 0 ) 8 n s f e t c h . . . 8 n s P r o g r a m 1 4 2 4 6 8 1 0 1 2 e x e c u t i o n T i m e o r d e r ( i n i n s t r u c t i o n s ) I n s t r u c t i o n D a t a l w $ 1 , 1 0 0 ( $ 0 ) R e g A L U R e g f e t c h a c c e s s I n s t r u c t i o n D a t a l w $ 2 , 2 0 0 ( $ 0 ) 2 n s R e g A L U R e g f e t c h a c c e s s I n s t r u c t i o n D a t a l w $ 3 , 3 0 0 ( $ 0 ) 2 n s R e g A L U R e g f e t c h a c c e s s 2 n s 2 n s 2 n s 2 n s 2 n s Pipelining Improve performance by increasing instruction throughput

Pipelining • Ideal speedup = number of stages • Do we achieve this?

Pipelining • What makes it easy • all instructions are the same length • just a few instruction formats • memory operands appear only in loads and stores • What makes it hard? • structural hazards: suppose we had only one memory • control hazards: need to worry about branch instructions • data hazards: an instruction depends on a previous instruction • We’ll build a simple pipeline and look at these issues • We’ll talk about modern processors and what really makes it hard: • exception handling • trying to improve performance with out-of-order execution, etc.

I F : I n s t r u c t i o n f e t c h I D : I n s t r u c t i o n d e c o d e / E X : E x e c u t e / M E M : M e m o r y a c c e s s W B : W r i t e b a c k r e g i s t e r f i l e r e a d a d d r e s s c a l c u l a t i o n 0 M u x 1 A d d A d d 4 A d d r e s u l t S h i f t l e f t 2 R e a d r e g i s t e r 1 A d d r e s s P C R e a d d a t a 1 R e a d Z e r o r e g i s t e r 2 I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U 0 R e a d W r i t e d a t a 2 r e s u l t A d d r e s s 1 d a t a r e g i s t e r M I n s t r u c t i o n M u D a t a u m e m o r y W r i t e x m e m o r y x d a t a 1 0 W r i t e d a t a 1 6 3 2 S i g n e x t e n d Basic Idea What do we need to add to actually split the datapath into stages? Fig. 6.10

0 M u x 1 I F / I D I D / E X E X / M E M M E M / W B A d d A d d 4 A d d r e s u l t S h i f t l e f t 2 R e a d n o r e g i s t e r 1 i A d d r e s s P C t R e a d c u d a t a 1 r t R e a d s Z e r o n r e g i s t e r 2 I I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U m e m o r y 0 R e a d W r i t e A d d r e s s d a t a 2 1 r e s u l t d a t a r e g i s t e r M M u D a t a u W r i t e x m e m o r y x d a t a 1 0 W r i t e d a t a 1 6 3 2 S i g n e x t e n d Pipelined Datapath Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem? Fig. 6.12

I D / E X R e a d r e g i s t e r 1 R e a d d a t a 1 R e a d Z e r o r e g i s t e r 2 R e g i s t e r s A L U R e a d A L U R e a d W r i t e A d d r e s s d a t a 2 r e s u l t 1 d a t a r e g i s t e r M M D a t a u u W r i t e x m e m o r y x d a t a 1 W r i t e d a t a 1 6 S i g n e x t e n d Corrected Datapath 0 M u x 1 I F / I D E X / M E M M E M / W B A d d A d d 4 A d d r e s u l t S h i f t l e f t 2 n o A d d r e s s P C i t c u r t s n I n s t r u c t i o n I m e m o r y 0 0 3 2 Fig. 6.18

T i m e ( i n c l o c k c y c l e s ) P r o g r a m C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) l w $ 1 0 , 2 0 ( $ 1 ) I M R e g A L U D M R e g s u b $ 1 1 , $ 2 , $ 3 I M R e g D M R e g A L U Graphically Representing Pipelines • Can help with answering questions like: • how many cycles does it take to execute this code? • what is the ALU doing during cycle 4? • use this representation to help understand datapaths

P C S r c 0 M u x 1 I F / I D I D / E X E X / M E M M E M / W B A d d A d d A d d 4 r e s u l t B r a n c h S h i f t R e g W r i t e l e f t 2 n R e a d M e m W r i t e o i t r e g i s t e r 1 P C A d d r e s s c R e a d u r d a t a 1 t R e a d s A L U S r c M e m t o R e g n Z e Z r e o r o I r e g i s t e r 2 I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U m e m o r y 0 R e a d W r i t e d a t a 2 A d d r e s s r e s u l t 1 r e g i s t e r M d a t a M u D a t a u W r i t e x m e m o r y x d a t a 1 0 W r i t e d a t a I n s t r u c t i o n 6 1 6 [ 1 5 – 0 ] 3 2 S i g n A L U e x t e n d M e m R e a d c o n t r o l I n s t r u c t i o n [ 2 0 – 1 6 ] 0 M A L U O p I n s t r u c t i o n u [ 1 5 – 1 1 ] x 1 R e g D s t Pipeline Control Fig. 6.25

Pipeline control • We have 5 stages. What needs to be controlled in each stage? • Instruction Fetch and PC Increment • Instruction Decode / Register Fetch • Execution • Memory Stage • Write Back • How would control be handled in an automobile plant? • a fancy control center telling everyone what to do? • should we use a finite state machine?

Pipeline Control Pass control signals along just like the data Fig. 6.29

P C S r c I D / E X 0 M W B u E X / M E M x 1 C o n t r o l M W B M E M / W B E X M W B I F / I D A d d A d d 4 A d d r e s u l t e t i r B r a n c h W S h i f t g e l e f t 2 e t i R r A L U S r c W m g R e a d n e e o r e g i s t e r 1 i M R A d d r e s s P C t R e a d c o t u d a t a 1 r m t R e a d s e Z e r o n r e g i s t e r 2 M I I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U m e m o r y 0 R e a d W r i t e d a t a 2 A d d r e s s r e s u l t 1 d a t a r e g i s t e r M M D a t a u u m e m o r y W r i t e x x d a t a 1 0 W r i t e d a t a I n s t r u c t i o n 1 6 3 2 6 [ 1 5 – 0 ] S i g n A L U M e m R e a d e x t e n d c o n t r o l I n s t r u c t i o n [ 2 0 – 1 6 ] 0 A L U O p M u I n s t r u c t i o n x [ 1 5 – 1 1 ] 1 R e g D s t Datapath with Control Fig. 6.30

Hazards

Hazards Hazards: problems due to pipelining Hazard types: • Structural • same resource is needed multiple times in the same cycle • Data • data dependencies limit pipelining • Control • next executed instruction is not the next specified instruction

Structural hazards Examples: • Two accesses to a single ported memory • Two operations need the same function unitat the same time • Two operations need the same function unitin successive cycles, but the unit is not pipelined Solutions: • stalling • add more hardware

IF ID OF EX WB IF ID OF EX WB IF ID OF EX WB IF ID OF EX WB IF ID OF EX WB Structural hazards time Simple pipelining diagram (not MIPS!): • IF: instruction fetch • ID: instruction decode • OF: operand fetch • EX: execute stage(s) • WB: write back Instruction stream Pipeline stalls due to lack of resources: load time IF ID OF EX WB IF ID OF EX WB Instruction stream IF ID OF EX EX EX WB IF ID OF EX WB IF ID OF EX WB Shared memory port One FU

Structural hazards Non-pipelined units Same non-pipelined FU time IF ID OF EX WB IF ID OF EX EX WB Instruction stream IF ID OF EX EX WB IF ID OF EX WB IF ID OF EX WB Stall cycle

time IF IF IF IF IF ID ID ID ID ID EX EX EX EX EX MEM MEM MEM MEM MEM WB WB WB WB WB Instruction stream Structural hazards on MIPS Q: Do we have structural hazards on our simple MIPS pipeline?

Data hazards • Data dependencies: • RaW (read-after-write) • WaW (write-after-write) • WaR (write-after-read) • Hardware solution: • Forwarding / Bypassing • Detection logic • Stalling • Software solution: Scheduling

Data dependences Three types: RaW, WaR and WaW add r1, r2, 5 ; r1 := r2+5 sub r4, r1, r3 ; RaW of r1 add r1, r2, 5 sub r2, r4, 1 ; WaR of r2 add r1, r2, 5 sub r1, r1, 1 ; WaW of r1 st r1, 5(r2) ; M[r2+5] := r1 ld r5, 0(r4) ; RaW if 5+r2 = 0+r4 WaW and WaR do not occur in simple pipelines, but they limit scheduling freedom! Problems for your compiler and Pentium!  useregister renamingto solve this!

OF OF EX EX WB WB IF IF ID ID OF EX WB IF ID RaW dependence add r1, r2, 5 ;r1:= r2+5 sub r4, r1, r3 ;RaW of r1 Without bypass circuitry time add r1, r2, 5 sub r4, r1, r3 OF EX WB IF ID With bypass circuitry time add r1, r2, 5 Saves two cycles sub r4, r1, r3

T i m e ( i n c l o c k c y c l e s ) C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 V a l u e o f r e g i s t e r $ 2 : 1 0 1 0 1 0 1 0 1 0 / – 2 0 – 2 0 – 2 0 – 2 0 – 2 0 P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) R e g s u b $ 2 , $ 1 , $ 3 I M R e g D M a n d $ 1 2 , $ 2 , $ 5 I M D M R e g R e g I M D M R e g o r $ 1 3 , $ 6 , $ 2 R e g a d d $ 1 4 , $ 2 , $ 2 I M D M R e g R e g s w $ 1 5 , 1 0 0 ( $ 2 ) I M D M R e g R e g RaW on MIPS pipeline Fig. 6.36

T i m e ( i n c l o c k c y c l e s ) C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 V a l u e o f 1 0 1 0 1 0 1 0 1 0 / – 2 0 – 2 0 – 2 0 – 2 0 – 2 0 r e g i s t e r $ 2 : V a l u e o f E X / M E M : X X X – 2 0 X X X X X V a l u e o f M E M / W B : X X X X – 2 0 X X X X P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) s u b $ 2 , $ 1 , $ 3 I M R e g D M R e g what if this $2 was $13? a n d $ 1 2 , $ 2 , $ 5 I M R e g D M R e g o r $ 1 3 , $ 6 , $ 2 I M R e g D M R e g a d d $ 1 4 , $ 2 , $ 2 I M D M R e g R e g s w $ 1 5 , 1 0 0 ( $ 2 ) I M R e g D M R e g Forwarding Use temporary results, don’t wait for them to be written • register file forwarding to handle read/write to same register • ALU forwarding Fig. 6.37

ALU Forwarding hardware ALU forwarding circuitry principle: buf from register file buf to register file from register file buf

I D / E X W B E X / M E M M W B C o n t r o l M E M / W B E X M W B I F / I D M n o u i t c x u r t s R e g i s t e r s ForwardA n D a t a I I n s t r u c t i o n A L U P C m e m o r y M m e m o r y u x M u x ForwardB I F / I D . R e g i s t e r R s R s I F / I D . R e g i s t e r R t R t I F / I D . R e g i s t e r R t R t M E X / M E M . R e g i s t e r R d u I F / I D . R e g i s t e r R d R d x F o r w a r d i n g M E M / W B . R e g i s t e r R d u n i t Forwarding Fig. 6.38

Example: if (EX/MEM.RegWrite)  (EX/MEM.RegisterRd  0)  (EX/MEM.RegisterRd = ID/EX.RegisterRs)then ForwardA = 10 Forwarding check • Check for matching register-ids: • For each source-id of operation in the EX-stage check if there is a matching pending dest-id Q. How many comparators do we need?

T i m e ( i n c l o c k c y c l e s ) P r o g r a m C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) R e g 0 M l w $ 2 , 2 ( $ 1 ) I D M R e g a n d $ 4 , $ 2 , $ 5 I M R e g D M R e g o r $ 8 , $ 2 , $ 6 I M R e g D M R e g a d d $ 9 , $ 4 , $ 2 I M R e g D M R e g s l t $ 1 , $ 6 , $ 7 I M D M R e g R e g Can't always forward • Load word can still cause a hazard: • an instruction tries to read register r following a load to the same r • Need a hazard detection unit to “stall” the load instruction Fig. 6.44

P r o g r a m T i m e ( i n c l o c k c y c l e s ) e x e c u t i o n C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 C C 1 0 o r d e r ( i n i n s t r u c t i o n s ) R e g D M R e g I M l w $ 2 , 2 0 ( $ 1 ) R e g D M I M R e g R e g a n d $ 4 , $ 2 , $ 5 R e g o r $ 8 , $ 2 , $ 6 D M R e g I M I M b u b b l e a d d $ 9 , $ 4 , $ 2 R e g I M D M R e g In CC4 the ALU is not used, Reg, and IM are redone s l t $ 1 , $ 6 , $ 7 R e g D M I M R e g Stalling We can stall the pipeline by keeping an instruction in the same stage Fig. 6.45

Hazard Detection Unit I D / E X . M e m R e a d H a z a r d d e t e c t i o n u n i t I D / E X e W B t i E X / M E M r W D M I / C o n t r o l u M W B F M E M / W B I x 0 E X M W B I F / I D e t i r W M n C o u P i t x c u r t R e g i s t e r s s n D a t a I I n s t r u c t i o n A L U P C m e m o r y M m e m o r y u x M u x I F / I D . R e g i s t e r R s I F / I D . R e g i s t e r R t R t I F / I D . R e g i s t e r R t M E X / M E M . R e g i s t e r R d u I F / I D . R e g i s t e r R d R d x I D / E X . R e g i s t e r R t R s F o r w a r d i n g M E M / W B . R e g i s t e r R d u n i t R t Fig. 6.46

Software only solution • Have compiler guarantee that no hazards occur • Example: where do we insert the “NOPs” ? sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) • Problem: this really slows us down!

Control hazards • Control operations may change the sequential flow of instructions • branch • jump • call (jump and link) • return • (exception)

Branch Branch actions: • Compute new address • Determine condition • Perform the actual branch (if taken): PC := new address Squash pipeline: • When we decide to branch, other instructions are in the pipeline! • We are predicting “branch not taken” • need to add hardware for flushing instructions if we are wrong

IF ID EX MEM WB IF ID EX MEM WB Branch with predict not taken Clock cycles Branch L IF ID EX MEM WB Predict not taken IF ID EX MEM WB IF ID EX MEM WB L:

T i m e ( i n c l o c k c y c l e s ) P r o g r a m e x e c u t i o n C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 o r d e r ( i n i n s t r u c t i o n s ) 4 0 b e q $ 1 , $ 3 , 7 I M R e g D M R e g 4 4 a n d $ 1 2 , $ 2 , $ 5 I M R e g D M R e g 4 8 o r $ 1 3 , $ 6 , $ 2 I M R e g D M R e g 5 2 a d d $ 1 4 , $ 2 , $ 2 I M R e g D M R e g 7 2 l w $ 4 , 5 0 ( $ 7 ) R e g D M R e g I M Branch example Fig. 6.50

IF ID EX MEM WB Clock cycles IF ID EX MEM WB Branch L Predict not taken IF ID EX MEM WB L: Branch speedup • Earlier address computation • Earlier condition calculation • Put both in the ID pipeline stage • adder • comparator

I F . F l u s h H a z a r d d e t e c t i o n u n i t I D / E X M u x W B E X / M E M M u C o n t r o l M W B M E M / W B x 0 E X M W B I F / I D 4 S h i f t l e f t 2 M u x = R e g i s t e r s D a t a I n s t r u c t i o n A L U P C m e m o r y M m e m o r y u x M u x S i g n e x t e n d M u x F o r w a r d i n g u n i t Improved branching / flushing IF/ID Fig. 6.51

Exception support Types of exceptions: • Overflow • I/O device request • Operating system call • Undefined instruction • Hardware malfunction • Page fault • Precise exception: • finish previous instructions (which are still in the pipeline) • flush excepting and following instructions, redo them after handling the exception(s)

Exceptions Changes needed for handling overflow exception of an operation in EX stage(see fig. 6.55) : • Extend PC input mux with extra entry with fixed address • Add EPC register recording the ID/EX stage PC • this is the address of the next instruction ! • Cause register recording exception type • In case of overflow exception insert 3 bubbles: flush • IF/ID stage • ID/EX stage • EX/MEM stage

Performance improvements

Performance improvements • Scheduling • avoiding data hazards • avoiding control hazards • Branches • delay slot • branch prediction • Superscalar

Scheduling, why? Let’s look at the execution time: Texecution = Ncycles x Tcycle = Ninstructions x CPI x Tcycle Scheduling may reduce Texecution • Reduce CPI (cycles per instruction) • early scheduling of long latency operations • avoid pipeline stalls due to structural, data and control hazards • allow Nissue > 1 and therefore CPI < 1 • Reduce Ninstructions • compact many operations into each instruction (VLIW)

Scheduling data hazards:example 1 Try and avoid RaW stalls (in this case load interlocks)! E.g., reorder these instructions: lw $t0, 0($t1) lw $t2, 4($t1) sw $t0, 4($t1) sw $t2, 0($t1) lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) ?

Unscheduled code: Lw R1,b Lw R2,c Add R3,R1,R2 interlock Sw a,R3 Lw R1,e Lw R2,f Sub R4,R1,R2 interlock Sw d,R4 Scheduled code: Lw R1,b Lw R2,c Lw R5,e extra reg. needed! Add R3,R1,R2 Lw R2,f Sw a,R3 Sub R4,R5,R2 Sw d,R4 Scheduling data hazardsexample 2 Avoiding RaW stalls: Reordering instructions for following program (by you or the compiler) Code: a = b + c d = e - f

Scheduling control hazards Texecution = Ninstructions x CPI x Tcycle CPI = CPIideal + fbranch x Pbranch Pbranch = Ndelayslots x miss_rate • Modern processors tend to have large branch penalty, Pbranch,due to many pipeline stages • Note that penalties have larger effect when CPIideal is low

Scheduling control hazards What can we do about control hazards and CPI penalty? • Keep penalty Pbranch low: • Early computation of new PC • Early determination of condition • Visible branch delay slots filled by compiler (MIPS) • Branch prediction • Reduce control dependencies (control height reduction) [Schlansker and Kathail, Micro’95] • Remove branches: if-conversion • Conditional instructions: CMOVE, cond skip next • Guarding all instructions: TriMedia

Branch delay slot • Add a branch delay slot: • the next instruction after a branch is always executed • rely on compiler to “fill” the slot with something useful

Branch delay slot scheduling Q. What to put in the delay slot? op 1 beq r1,r2, L ............. op 2 ............. 'fall-through' L: op 3 branch target .............

Branch prediction • Predict (not)taken schemes use fixed prediction • Can we remember (dynamically) branch directions? • 1-bit scheme • 2-bit schemes • multi-level branch predictors • hybrid schemes

1-bit prediction, using prediction buffer Branch address 2 K entries (Lower K bits) prediction bit • Problems • Aliasing: lower K bits of different branch instructions could be the same • Solution: Use tags; however very expensive • Loops are predicted wrong twice • Solution: Use n-bit saturation counter prediction • taken if counter  2 (n-1) • not-taken if counter < 2 (n-1) • A 2 bit saturating counter predicts a loop wrong only once

Processor Design 5Z032