Chapter Six Pipelining

Chapter SixPipelining

P r o g r a m 2 4 6 8 1 0 1 2 1 4 1 6 1 8 e x e c u t i o n T i m e o r d e r ( i n i n s t r u c t i o n s ) I n s t r u c t i o n D a t a l w $ 1 , 1 0 0 ( $ 0 ) R e g A L U R e g f e t c h a c c e s s I n s t r u c t i o n D a t a l w $ 2 , 2 0 0 ( $ 0 ) R e g A L U R e g 8 n s f e t c h a c c e s s I n s t r u c t i o n l w $ 3 , 3 0 0 ( $ 0 ) 8 n s f e t c h . . . 8 n s P r o g r a m 1 4 2 4 6 8 1 0 1 2 e x e c u t i o n T i m e o r d e r ( i n i n s t r u c t i o n s ) I n s t r u c t i o n D a t a l w $ 1 , 1 0 0 ( $ 0 ) R e g A L U R e g f e t c h a c c e s s I n s t r u c t i o n D a t a l w $ 2 , 2 0 0 ( $ 0 ) R e g A L U R e g 2 n s f e t c h a c c e s s I n s t r u c t i o n D a t a l w $ 3 , 3 0 0 ( $ 0 ) 2 n s R e g A L U R e g f e t c h a c c e s s 2 n s 2 n s 2 n s 2 n s 2 n s Pipelining • Improve performance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?

Pipelining • What makes it easy • all instructions are the same length • just a few instruction formats • memory operands appear only in loads and stores • What makes it hard? • structural hazards: suppose we had only one memory • control hazards: need to worry about branch instructions • data hazards: an instruction depends on a previous instruction • We’ll build a simple pipeline and look at these issues • We’ll talk about modern processors and what really makes it hard: • exception handling • trying to improve performance with out-of-order execution, etc.

I F : I n s t r u c t i o n f e t c h I D : I n s t r u c t i o n d e c o d e / E X : E x e c u t e / M E M : M e m o r y a c c e s s W B : W r i t e b a c k r e g i s t e r f i l e r e a d a d d r e s s c a l c u l a t i o n 0 M u x 1 A d d A d d 4 A d d r e s u l t S h i f t l e f t 2 R e a d r e g i s t e r 1 A d d r e s s P C R e a d d a t a 1 R e a d Z e r o r e g i s t e r 2 I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U 0 R e a d W r i t e d a t a 2 r e s u l t A d d r e s s 1 d a t a r e g i s t e r M I n s t r u c t i o n M u D a t a u m e m o r y W r i t e x m e m o r y x d a t a 1 0 W r i t e d a t a 1 6 3 2 S i g n e x t e n d Basic Idea • What do we need to add to actually split the datapath into stages?

0 M u x 1 I F / I D I D / E X E X / M E M M E M / W B A d d A d d 4 A d d r e s u l t S h i f t l e f t 2 n R e a d o i r e g i s t e r 1 t A d d r e s s P C c R e a d u r d a t a 1 t R e a d s n Z e r o I r e g i s t e r 2 I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U m e m o r y 0 R e a d W r i t e A d d r e s s d a t a 2 1 r e s u l t d a t a r e g i s t e r M M u D a t a u W r i t e x m e m o r y x d a t a 1 0 W r i t e d a t a 1 6 3 2 S i g n e x t e n d Pipelined Datapath Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

0 M u x 1 I F / I D I D / E X E X / M E M M E M / W B A d d A d d 4 A d d r e s u l t S h i f t l e f t 2 n R e a d o i t r e g i s t e r 1 A d d r e s s P C c R e a d u r d a t a 1 t s R e a d n Z e r o I r e g i s t e r 2 I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U m e m o r y 0 R e a d W r i t e A d d r e s s d a t a 2 r e s u l t 1 d a t a r e g i s t e r M M D a t a u u W r i t e x m e m o r y x d a t a 1 0 W r i t e d a t a 1 6 3 2 S i g n e x t e n d Corrected Datapath

T i m e ( i n c l o c k c y c l e s ) P r o g r a m C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) l w $ 1 0 , 2 0 ( $ 1 ) I M R e g A L U D M R e g s u b $ 1 1 , $ 2 , $ 3 I M R e g D M R e g A L U Graphically Representing Pipelines • Can help with answering questions like: • how many cycles does it take to execute this code? • what is the ALU doing during cycle 4? • use this representation to help understand datapaths

P C S r c 0 M u x 1 I F / I D I D / E X E X / M E M M E M / W B A d d A d d A d d 4 r e s u l t B r a n c h S h i f t R e g W r i t e l e f t 2 n R e a d M e m W r i t e o i t r e g i s t e r 1 c P C A d d r e s s R e a d u r t d a t a 1 s R e a d A L U S r c n M e m t o R e g I Z e r o Z e r o r e g i s t e r 2 I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U m e m o r y 0 R e a d W r i t e d a t a 2 A d d r e s s r e s u l t 1 r e g i s t e r M d a t a M u D a t a u W r i t e x m e m o r y x d a t a 1 0 W r i t e d a t a I n s t r u c t i o n 6 [ 1 5 – 0 ] 1 6 3 2 S i g n A L U e x t e n d M e m R e a d c o n t r o l I n s t r u c t i o n [ 2 0 – 1 6 ] 0 M A L U O p I n s t r u c t i o n u [ 1 5 – 1 1 ] x 1 R e g D s t Pipeline Control

Pipeline control • We have 5 stages. What needs to be controlled in each stage? • Instruction Fetch and PC Increment • Instruction Decode / Register Fetch • Execution • Memory Stage • Write Back • How would control be handled in an automobile plant? • a fancy control center telling everyone what to do? • should we use a finite state machine?

W B I n s t r u c t i o n M W B C o n t r o l E X M W B I F / I D I D / E X E X / M E M M E M / W B Pipeline Control • Pass control signals along just like the data

P C S r c I D / E X 0 M W B u E X / M E M x 1 C o n t r o l M W B M E M / W B E X M W B I F / I D A d d A d d 4 A d d e r e s u l t t i r B r a n c h W S h i f t g e e t l e f t 2 i R r W A L U S r c m g n e R e a d e o i M R t r e g i s t e r 1 A d d r e s s P C c o R e a d t u r m d a t a 1 t s R e a d e n M Z e r o I r e g i s t e r 2 I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U m e m o r y 0 R e a d W r i t e d a t a 2 A d d r e s s r e s u l t 1 d a t a r e g i s t e r M M D a t a u u m e m o r y W r i t e x x d a t a 1 0 W r i t e d a t a I n s t r u c t i o n 1 6 3 2 6 [ 1 5 – 0 ] S i g n A L U M e m R e a d e x t e n d c o n t r o l I n s t r u c t i o n [ 2 0 – 1 6 ] 0 A L U O p M u I n s t r u c t i o n x [ 1 5 – 1 1 ] 1 R e g D s t Datapath with Control

T i m e ( i n c l o c k c y c l e s ) C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 V a l u e o f r e g i s t e r $ 2 : 1 0 1 0 1 0 1 0 1 0 / – 2 0 – 2 0 – 2 0 – 2 0 – 2 0 P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) R e g s u b $ 2 , $ 1 , $ 3 I M R e g D M a n d $ 1 2 , $ 2 , $ 5 I M D M R e g R e g I M D M R e g o r $ 1 3 , $ 6 , $ 2 R e g a d d $ 1 4 , $ 2 , $ 2 I M D M R e g R e g s w $ 1 5 , 1 0 0 ( $ 2 ) I M D M R e g R e g Dependencies • Problem with starting next instruction before first is finished • dependencies that “go backward in time” are data hazards

Software Solution • Have compiler guarantee no hazards • Where do we insert the “nops” ? sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) • Problem: this really slows us down!

T i m e ( i n c l o c k c y c l e s ) C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 V a l u e o f r e g i s t e r $ 2 : 1 0 1 0 1 0 1 0 1 0 / – 2 0 – 2 0 – 2 0 – 2 0 – 2 0 V a l u e o f E X / M E M : X X X – 2 0 X X X X X V a l u e o f M E M / W B : X X X X – 2 0 X X X X P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) s u b $ 2 , $ 1 , $ 3 I M R e g D M R e g a n d $ 1 2 , $ 2 , $ 5 I M R e g D M R e g o r $ 1 3 , $ 6 , $ 2 I M R e g D M R e g a d d $ 1 4 , $ 2 , $ 2 I M R e g D M R e g s w $ 1 5 , 1 0 0 ( $ 2 ) I M R e g D M R e g what if this $2 was $13? Forwarding • Use temporary results, don’t wait for them to be written • register file forwarding to handle read/write to same register • ALU forwarding

I D / E X W B E X / M E M M W B C o n t r o l M E M / W B E X M W B I F / I D M n o u i t c x u r t s R e g i s t e r s n D a t a I I n s t r u c t i o n A L U P C m e m o r y M m e m o r y u x M u x I F / I D . R e g i s t e r R s R s I F / I D . R e g i s t e r R t R t I F / I D . R e g i s t e r R t R t M E X / M E M . R e g i s t e r R d u I F / I D . R e g i s t e r R d R d x F o r w a r d i n g M E M / W B . R e g i s t e r R d u n i t Forwarding

T i m e ( i n c l o c k c y c l e s ) P r o g r a m C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) R e g l w $ 2 , 2 0 ( $ 1 ) I M D M R e g a n d $ 4 , $ 2 , $ 5 I M R e g D M R e g o r $ 8 , $ 2 , $ 6 I M R e g D M R e g a d d $ 9 , $ 4 , $ 2 I M R e g D M R e g s l t $ 1 , $ 6 , $ 7 I M D M R e g R e g Can't always forward • Load word can still cause a hazard: • an instruction tries to read a register following a load instruction that writes to the same register. • Thus, we need a hazard detection unit to “stall” the load instruction

P r o g r a m T i m e ( i n c l o c k c y c l e s ) e x e c u t i o n C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 C C 1 0 o r d e r ( i n i n s t r u c t i o n s ) R e g D M R e g I M l w $ 2 , 2 0 ( $ 1 ) R e g D M I M R e g R e g a n d $ 4 , $ 2 , $ 5 R e g o r $ 8 , $ 2 , $ 6 D M R e g I M I M b u b b l e a d d $ 9 , $ 4 , $ 2 R e g I M D M R e g s l t $ 1 , $ 6 , $ 7 R e g D M I M R e g Stalling • We can stall the pipeline by keeping an instruction in the same stage

I D / E X . M e m R e a d H a z a r d d e t e c t i o n u n i t I D / E X e W B t i E X / M E M r W D M I / F C o n t r o l u M W B I M E M / W B x 0 E X M W B I F / I D e t i r W M n C o P u i t c x u r t s R e g i s t e r s n I D a t a I n s t r u c t i o n A L U P C m e m o r y M m e m o r y u x M u x I F / I D . R e g i s t e r R s I F / I D . R e g i s t e r R t R t I F / I D . R e g i s t e r R t M E X / M E M . R e g i s t e r R d u I F / I D . R e g i s t e r R d R d x I D / E X . R e g i s t e r R t R s F o r w a r d i n g M E M / W B . R e g i s t e r R d u n i t R t Hazard Detection Unit • Stall by letting an instruction that won’t write anything go forward

T i m e ( i n c l o c k c y c l e s ) P r o g r a m e x e c u t i o n C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 o r d e r ( i n i n s t r u c t i o n s ) 4 0 b e q $ 1 , $ 3 , 7 I M R e g D M R e g 4 4 a n d $ 1 2 , $ 2 , $ 5 I M R e g D M R e g 4 8 o r $ 1 3 , $ 6 , $ 2 I M R e g D M R e g 5 2 a d d $ 1 4 , $ 2 , $ 2 I M R e g D M R e g 7 2 l w $ 4 , 5 0 ( $ 7 ) R e g D M R e g I M Branch Hazards • When we decide to branch, other instructions are in the pipeline! • We are predicting “branch not taken” • need to add hardware for flushing instructions if we are wrong

I F . F l u s h H a z a r d d e t e c t i o n u n i t I D / E X M u x W B E X / M E M M u M W B C o n t r o l M E M / W B x 0 E X M W B I F / I D 4 S h i f t l e f t 2 M u x = R e g i s t e r s D a t a I n s t r u c t i o n A L U P C m e m o r y M m e m o r y u x M u x S i g n e x t e n d M u x F o r w a r d i n g u n i t Flushing Instructions

Improving Performance • Try and avoid stalls! E.g., reorder these instructions: lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) • Add a “branch delay slot” • the next instruction after a branch is always executed • rely on compiler to “fill” the slot with something useful • Superscalar: start more than one instruction in the same cycle

Dynamic Scheduling • The hardware performs the “scheduling” • hardware tries to find instructions to execute • out of order execution is possible • speculative execution and dynamic branch prediction • All modern processors are very complicated • DEC Alpha 21264: 9 stage pipeline, 6 instruction issue • PowerPC and Pentium: branch history table • Compiler technology important • This class has given you the background you need to learn more

Chapter Six Pipelining

Chapter Six Pipelining

Presentation Transcript

CHAPTER SIX

CHAPTER SIX

Chapter 3 Pipelining

Chapter 8. Pipelining

Chapter Six

CHAPTER SIX

Chapter Six

Chapter 8. Pipelining

Pipelining (Chapter 8)

Chapter Six

Pipelining (Chapter 8)

Chapter Six

Chapter 6: Pipelining

Pipelining Chapter 6

Chapter Six

Chapter 8. Pipelining

Chapter 3 Pipelining

Chapter Six Enhancing Performance with Pipelining

Chapter Six