Multicycle Approach in Processor Datapath and Control

Chapter FiveThe Processor: Datapath and Control(Parte B: multiciclo)

I n s t r u c t i o n r e g i s t e r D a t a P C A d d r e s s A R e g i s t e r # I n s t r u c t i o n A L U A L U O u t M e m o r y R e g i s t e r s o r d a t a R e g i s t e r # M e m o r y d a t a B D a t a r e g i s t e r R e g i s t e r # Multicycle Approach • Break up the instructions into steps, each step takes a cycle • balance the amount of work to be done • restrict each cycle to use only one major functional unit • 1 ALU, 1 Memória, 1 Banco de Registradores • At the end of a cycle • store values for use in later cycles (easiest thing to do) • introduce additional “internal” registers

P C 0 0 I n s t r u c t i o n R e a d M M A d d r e s s [ 2 5 – 2 1 ] r e g i s t e r 1 u u x x R e a d A I n s t r u c t i o n R e a d Z e r o M e m o r y 1 d a t a 1 1 [ 2 0 – 1 6 ] r e g i s t e r 2 A L U A L U A L U O u t 0 M e m D a t a R e g i s t e r s r e s u l t I n s t r u c t i o n W r i t e M R e a d [ 1 5 – 0 ] r e g i s t e r B u 0 d a t a 2 I n s t r u c t i o n W r i t e x [ 1 5 – 1 1 ] M I n s t r u c t i o n 4 1 W r i t e d a t a 1 u r e g i s t e r d a t a 2 x 0 I n s t r u c t i o n 3 [ 1 5 – 0 ] M u x M e m o r y 1 1 6 3 2 d a t a S h i f t S i g n r e g i s t e r l e f t 2 e x t e n d Multicycle Approach • IR (Instruction Register) e MDR (Memory Data Register) salvam saída da memória • Registradores A e B salvam saída do banco de registradores • ALUout salva saída da ALU • Todos, exceto IR, guardam dados por um clock Þ controle de escrita é desnecessário • novos MUXes: endereço da memória, operandos da ALU

I o r D M e m R e a d M e m W r i t e I R W r i t e R e g D s t R e g W r i t e A L U S r c A P C 0 0 I n s t r u c t i o n R e a d M M [ 2 5 – 2 1 ] r e g i s t e r 1 A d d r e s s u u x R e a d x A I n s t r u c t i o n R e a d Z e r o d a t a 1 M e m o r y 1 1 [ 2 0 – 1 6 ] r e g i s t e r 2 A L U A L U A L U O u t 0 M e m D a t a R e g i s t e r s r e s u l t I n s t r u c t i o n W r i t e R e a d M B [ 1 5 – 0 ] r e g i s t e r d a t a 2 0 u I n s t r u c t i o n W r i t e x M [ 1 5 – 1 1 ] I n s t r u c t i o n 4 1 W r i t e d a t a 1 u r e g i s t e r d a t a 2 x 0 I n s t r u c t i o n 3 [ 1 5 – 0 ] M u x M e m o r y 1 1 6 3 2 d a t a A L U S h i f t S i g n r e g i s t e r c o n t r o l l e f t 2 e x t e n d I n s t r u c t i o n [ 5 – 0 ] M e m t o R e g A L U S r c B A L U O p Via de dados multiciclo e sinais de controle Falta implementar 3 possíveis fontes de carga para PC: PC+4, beq e j

P C W r i t e C o n d P C S o u r c e P C W r i t e A L U O p O u t p u t s I o r D A L U S r c B M e m R e a d A L U S r c A C o n t r o l M e m W r i t e R e g W r i t e M e m t o R e g O p R e g D s t I R W r i t e [ 5 – 0 ] 0 M 1 u J u m p 2 6 2 8 x I n s t r u c t i o n [ 2 5 – 0 ] a d d r e s s [ 3 1 - 0 ] S h i f t 2 l e f t 2 I n s t r u c t i o n [ 3 1 - 2 6 ] P C [ 3 1 - 2 8 ] P C 0 0 I n s t r u c t i o n R e a d M M [ 2 5 – 2 1 ] r e g i s t e r 1 u A d d r e s s u x x R e a d A I n s t r u c t i o n R e a d Z e r o M e m o r y 1 d a t a 1 1 [ 2 0 – 1 6 ] r e g i s t e r 2 A L U A L U A L U O u t 0 M e m D a t a R e g i s t e r s r e s u l t I n s t r u c t i o n W r i t e M R e a d B [ 1 5 – 0 ] r e g i s t e r u 0 I n s t r u c t i o n d a t a 2 W r i t e x [ 1 5 – 1 1 ] M I n s t r u c t i o n 4 1 W r i t e d a t a 1 u r e g i s t e r d a t a 2 x 0 I n s t r u c t i o n 3 [ 1 5 – 0 ] M u x M e m o r y 1 1 6 3 2 d a t a A L U S h i f t S i g n r e g i s t e r c o n t r o l l e f t 2 x t e n d e I n s t r u c t i o n [ 5 – 0 ] Via de dados completa

Sinais de controle (1 bit)

Sinais de controle (2 bits)

Five Execution Steps • Instruction Fetch • Instruction Decode and Register Fetch • Execution, Memory Address Computation, or Branch Completion • Memory Access or R-type instruction completion • Write-back step INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

Step 1: Instruction Fetch • Use PC to get instruction and put it in the Instruction Register. • Increment the PC by 4 and put the result back in the PC. • Can be described succinctly using RTL "Register-Transfer Language" IR = Memory[PC]; PC = PC + 4;Can we figure out the values of the control signals?What is the advantage of updating the PC now?

Step 2: Instruction Decode and Register Fetch • Read registers rs and rt in case we need them • Compute the branch address in case the instruction is a branch • RTL: A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC + (sign-extend(IR[15-0]) << 2); • We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)

Step 3 (instruction dependent) • ALU is performing one of three functions, based on instruction type • Memory Reference: ALUOut = A + sign-extend(IR[15-0]); • R-type: ALUOut = A op B; • Branch: if (A==B) PC = ALUOut;

Step 4 (R-type or memory-access) • Loads and stores access memory MDR = Memory[ALUOut]; or Memory[ALUOut] = B; • R-type instructions finish Reg[IR[15-11]] = ALUOut;The write actually takes place at the end of the cycle on the edge

Write-back step • Reg[IR[20-16]]= MDR; What about all the other instructions?

Summary:

0 IR=Mem[PC] PC=PC+4 1 A=Reg[IR[25:21]] B=Reg[IR[20:16]] ALUout=PC+(SignExt(IR(15:0))<<2) 6 2 8 9 ALUout=A op B ALUout=A+SignExt(IR(15:0)) If A==B PC=ALUout PC=PC[31:28]& IR[25:0]<<2 7 3 5 Reg[IR[15:11]]=ALUout MDR=Mem(ALUout) Mem(ALUout)=B 4 Reg[IR[15:11]]=MDR Outra visão, fluxograma

Simple Questions • How many cycles will it take to execute this code? lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, Label #assume not add $t5, $t2, $t3 sw $t5, 8($t3)Label: ... • What is going on during the 8th cycle of execution? • In what cycle does the actual addition of $t2 and $t3 takes place?

Implementing the Control • Value of control signals is dependent upon: • what instruction is being executed • which step is being performed • Use the information we’ve acculumated to specify a finite state machine • specify the finite state machine graphically, or • use microprogramming • Implementation can be derived from specification

Tabela dos sinais de controle

I n s t r u c t i o n d e c o d e / I n s t r u c t i o n f e t c h r e g i s t e r f e t c h 0 M e m R e a d 1 A L U S r c A = 0 I o r D = 0 A L U S r c A = 0 I R W r i t e A L U S r c B = 1 1 S t a r t A L U S r c B = 0 1 A L U O p = 0 0 A L U O p = 0 0 P C W r i t e P C S o u r c e = 0 0 ) ) ' e Q ) p y ' t E - J R B ' = ' = p = O ( ) p p ' W M e m o r y a d d r e s s S O ' O B r a n c h = J u m p ( p ( O c o m p u t a t i o n ( r o E x e c u t i o n c o m p l e t i o n c o m p l e t i o n ) ' W L ' = p O 2 6 8 9 ( A L U S r c A = 1 A L U S r c A = 1 A L U S r c B = 0 0 A L U S r c A = 1 P C W r i t e A L U S r c B = 1 0 A L U O p = 0 1 A L U S r c B = 0 0 P C S o u r c e = 1 0 A L U O p = 0 0 P C W r i t e C o n d A L U O p = 1 0 P C S o u r c e = 0 1 ( O ) ' p W = ' L S ' W = ' ) p M e m o r y M e m o r y O ( a c c e s s a c c e s s R - t y p e c o m p l e t i o n 3 5 7 R e g D s t = 1 M e m R e a d M e m W r i t e R e g W r i t e I o r D = 1 I o r D = 1 M e m t o R e g = 0 W r i t e - b a c k s t e p 4 R e g D s t = 0 R e g W r i t e M e m t o R e g = 1 Graphical Specification of FSM • How many state bits will we need?

P C W r i t e P C W r i t e C o n d I o r D M e m R e a d M e m W r i t e I R W r i t e C o n t r o l l o g i c M e m t o R e g P C S o u r c e A L U O p O u t p u t s A L U S r c B A L U S r c A R e g W r i t e R e g D s t N S 3 N S 2 N S 1 I n p u t s N S 0 5 4 3 2 1 0 p p p p p p 3 2 1 0 O O O O O O S S S S I n s t r u c t i o n r e g i s t e r S t a t e r e g i s t e r o p c o d e f i e l d Finite State Machine for Control

Desempenho desta máquina para o gcc • CPI = 0,23*5 + 0,13*4 + 0,43*4 + 0,19*3 + 0,02*3 = 4,02 • Melhor do que se todas as instruções tomassem 5 ciclos • Melhoria em MIPS (supor ck = 100 MHz) • CPI = 4 -> 25 MIPS • CPI = 5 -> 20 MIPS • CPI = 1 (fazer tudo em um ciclo) -> 100 MIPS

PLA Implementation • If I picked a horizontal or vertical line could you explain it?

m n ROM Implementation • ROM = "Read Only Memory" • values of memory locations are fixed ahead of time • A ROM can be used to implement a truth table • if the address is m-bits, we can address 2m entries in the ROM. • our outputs are the bits of data that the address points to.m is the "heigth", and n is the "width" 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1

ROM Implementation • How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i.e., 210 = 1024 different addresses) • How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs • ROM is 210 x 20 = 1K x 20 bits (and a rather unusual size) • Rather wasteful, since for lots of the entries, the outputs are the same — i.e., opcode is often ignored

ROM vs PLA • Break up the table into two parts — 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM — 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM — Total: 4.3K bits of ROM • PLA is much smaller — can share product terms — only need entries that produce an active output — can take into account don't cares • Size is (#inputs  #product-terms) + (#outputs  #product-terms) For this example = (10x17)+(20x17) = 460 PLA cells • PLA cells usually about the size of a ROM cell (slightly bigger)

Multicycle Approach in Processor Datapath and Control