EE204 Computer Architecture

EE204Computer Architecture Single Cycle Data path Performance Hina Anwar Khan 2011

Performance of Single-Cycle Machines • Let's assume that the operation time for the following units is: Memory - 2 nanoseconds (ns), ALU and adders - 2 ns, Register file - 1 ns. We will assume that MUXs, control, sign-extension, PC accesses, and wires have no delays. • Which implementation is faster? 1. Every instruction operates in 1 clock cycle of fixed length.2. Every instruction operates in a varying length clock cycle. • Lets look at the time needed by each instruction: Inst. Fetch Reg. Rd ALU op Memory Reg. Wr TotalR-Type 2 1 2 0 1 6nsLoad 2 1 2 2 1 8nsStore 2 1 2 2 7nsBranch 2 1 2 5nsJump 2 2ns Hina Anwar Khan Spring 2011

Fixed vs. Variable Cycle Length • Lets Assume a program has the following instruction mix: 24% loads, 12% stores, 44% R-type, 18% branches, 2% jumps. • For the fixed cycle length the cycle time is 8 ns, long enough for the longest instruction (load). Thus each instruction takes 8 ns to execute. • For the variable cycle time the average CPU clock cycle is:8*24% + 7*12% + 6*44% + 5*18% + 2*2% = 6.3 ns • It is obvious that the variable clock implementation is faster but it is extremely hard to implement. • Variable clock implementation is 8/6.3 = 1.27 times faster • When adding instructions such as multiply and divide which can take tens of cycles this scheme is too slow. Hina Anwar Khan Spring 2011

Observations on the Single Cycle Design • The single-cycle datapath is straightforward, but... • It has to use 3 separate ALU’s • It has separate Instruction and Data memories • Cycle time is determined by worst-case path • A multi-cycle datapath might be better • We can reuse some of the hardware • We can combine the memories • Cycle time is still constant, but instructions may take differing numbers of cycles Hina Anwar Khan Spring 2011

Multi-Cycle Implementation • Multi-Cycle Implementation • Each step in execution = 1 clock • Each Instruction of different clock cycles • Functional unit can be used more than once per instruction as long as it is used on different clock cycles • Reduce and Share Hardware units Hina Anwar Khan Spring 2011

Multicycle Datapath Single Instruction & Data Memory Single ALU Registers Hina Anwar Khan Spring 2011

Multicycle Execution • Instruction Register (IR) • Holds instruction until end of execution • Memory Data Register (MDR) • A Register • B Register • ALUOut Register Hina Anwar Khan Spring 2011

Multicycle Datapath Branch target address Address Register Block Address Inst/Data Memory Instruction PC = PC +4 ALU Data Arithmetic/branch Instruction lw/sw Instruction Hina Anwar Khan Spring 2011

Multicycle Datapath Hina Anwar Khan Spring 2011

MultiCycle Datapath & Control Signals Hina Anwar Khan Spring 2011

One Single ALU • One single ALU is used to perform all of the necessary functions: • An arithmetic operation on two register operands • Add a register to a sign-extended constant, for computing memory addresses in lw/sw instructions • Compute PC+4 to increment the PC • Add a sign-extended, shifted offset to (PC+4) for branches Hina Anwar Khan Spring 2011

Implications of Shared Functional Units • Need to add multiplexors or expand existing multiplexors • e.g. Memory unit now contains both instructions (address in PC) and data (address in ALUOut) • e.g. ALU now must accommodate all inputs from previous ALU and adders. Hina Anwar Khan Spring 2011

Two extra multiplexers • To enable all the actions listed for the ALU, two extra multiplexers are needed • A 2-to-1 mux, ALUsrcA, selects whether the first ALU input is the PC or a register • A 4-to-1 mux, ALUSrcB, selects the 2nd input from among • the register file • a constant 4 • a sign-extended constant, and • a sign-extended and shifted constant Hina Anwar Khan Spring 2011

One single memory • One single memory is used in both the instruction fetch and data access stages. • The address for this memory may come from: • the PC register, when fetching an instruction • the ALU output, when doing a lw/sw instruction and need the effective memory address. • => add a 2-to-1 mux, IorD, to select whether the memory is being accessed for instructions or for data. Hina Anwar Khan Spring 2011

Breaking Instruction into Clock Cycles • Goal: balance the amount of work done in each cycle so that we can minimize clock period. • Restrict each step to contain at most 1 of • ALU operation • Register File Access • Memory Access • Clock cycle time will be longest of above operations. Hina Anwar Khan Spring 2011

Complete Multicycle Datapath Hina Anwar Khan Spring 2011

Arithmetic Instruction Steps • Instruction Fetch • IR = Mem[PC] • PC = PC + 4 • Instruction Decode • A = Reg[IR[25-21]] • B = Reg[IR[20-16]] • Instruction Execution • ALUOut = A op B • Store Result • Reg[IR[15-11]] = ALUOut Hina Anwar Khan Spring 2011

lw Instruction Steps • Instruction Fetch • IR = Mem[PC] • PC = PC + 4 • Instruction Decode • A = Reg[IR[25-21]] • Address calculate • ALUOut = A + sign-extd. (IR[15 – 0]) • Memory Access • MDR = Memory[ALUOut] • Memory read completion • Reg[IR[20-16]] = MDR Hina Anwar Khan Spring 2011

sw Instruction Steps • Instruction Fetch • IR = Mem[PC] • PC = PC + 4 • Instruction Decode • A = Reg[IR[25-21]] • B = Reg[IR[20-16]] • Address calculate • ALUOut = A + sign-extd. (IR[15 – 0]) • Memory write completion • Mem[ALUOut] = B Hina Anwar Khan Spring 2011

Branch Instruction Steps • Instruction Fetch • IR = Mem[PC] • PC = PC + 4 • Instruction Decode • A = Reg[IR[25-21]] • B = Reg[IR[20-16]] • ALUOut = PC + (sign-extd.(IR[15-0]) << 2) • Branch Execution • If (A == B) PC = ALUOut Hina Anwar Khan Spring 2011

Jump Instruction • Instruction Fetch • IR = Mem[PC] • PC = PC + 4 • Jump Execution • PC = PC[31-28] || (IR[25-0] <<2) Hina Anwar Khan Spring 2011

Breaking instruction into steps • Instruction Fetch • IR = Mem[PC] all instructions • PC = PC + 4 all instructions • Instruction Decode • A = Reg[IR[25-21]] all inst. except jump • B = Reg[IR[20-16]] arith. & branch • ALUOut = PC + (sign-extd.(IR[15-0]) << 2) branch inst. only Hina Anwar Khan Spring 2011

Breaking instruction into steps • Execution, Mem. Address calc. or branch • ALUOut = A + sign-extd. (IR[15 – 0]) lw/sw inst. • ALUOut = A op B arith. inst. • If (A == B) PC = ALUOut branch inst. • PC = PC[31-28] || (IR[25-0] <<2) jump inst. • Memory Access or R-type Inst. Completion • MDR = Memory[ALUOut] lw inst. • Mem[ALUOut] = B sw inst. • Reg[IR[15-11]] = ALUOut arith. inst. Hina Anwar Khan Spring 2011

Break Instruction into steps • Memory read completion • Reg[IR[20-16]] = MDR lw inst. Hina Anwar Khan Spring 2011

Finite State Machine Control Hina Anwar Khan Spring 2011

Sh.Left2 2 0 1 Registers PC 0 0 Read reg num A Read address Read reg data A 1 1 Memory Read reg num B Zero Read data Result Write address 0 Write reg num 0 Read reg data B 1 Write data 1 Write reg data 2 1 ALUcontrol 3 0 Sh.Left2 signextend Instr. [31-0] Instr. Reg Cycle 1 All instructions Instruction Fetch PCSource 28 26 Concat. 32 0 1 x PCWrite Control 4 PCWriteCond 0 ALUOp Zero Inst[25-0] [31-28] Inst[31-26] 0 IorD MemRead 1 1 x 0 x 0 ALUSelA MemWrite MemToReg RegWrite IRWrite RegDest PCorPC+4 0 [25-21] A [20-16] ALUOut ALUSelB 1 [15-11] B 4 IorD=0MemRead=1MemWrite=0IRWrite=1ALUSelA=0 ALUSelB=1 MDR ALUOp=0PCWrite=1PCSource=0RegWrite=0 16 32 [15-0] [5-0] Hina Anwar Khan Spring 2011

Cycle 2 All instructions Sh.Left2 2 0 1 Registers PC 0 0 Read reg num A Read address Read reg data A 1 1 Memory Read reg num B Zero Read data Result Write address 0 Write reg num 0 Read reg data B 1 Write data 1 Write reg data 2 1 ALUcontrol 3 0 Sh.Left2 signextend Instr. [31-0] Instr. Reg Instr. Decode/Reg. Fetch PCSource 28 26 Concat. 32 x 0 0 PCWrite Control 4 PCWriteCond 0 ALUOp Zero Inst[25-0] [31-28] Inst[31-26] 0 x IorD MemRead x 0 0 x 0 ALUSelA MemWrite MemToReg RegWrite IRWrite RegDest PCorPC+4 0 [25-21] A [20-16] ALUOut ALUSelB 3 [15-11] B 4 MDR MemRead=0MemWrite=0IRWrite=0ALUSelA=0 ALUSelB=3 ALUOp=0PCWrite=0PCWriteCond=0RegWrite=0 16 32 [15-0] [5-0] Hina Anwar Khan Spring 2011

Fetch & Decode Instruction State Diagram Hina Anwar Khan Spring 2011

EE204 Computer Architecture