340 likes | 362 Views
CS-447– Computer Architecture M,W 10-11:20am Lecture 12 Multiple Cycle Datapath. October 8th, 2007 Majd F. Sakr msakr@qatar.cmu.edu www.qatar.cmu.edu/~msakr/15447-f07/. Implementation vs. Performance. Performance of a processor is determined by Instruction count of a program CPI
E N D
CS-447– Computer Architecture M,W 10-11:20amLecture 12Multiple Cycle Datapath October 8th, 2007 Majd F. Sakrmsakr@qatar.cmu.edu www.qatar.cmu.edu/~msakr/15447-f07/
Implementation vs. Performance Performance of a processor is determined by • Instruction count of a program • CPI • Clock cycle time (clock rate) The compiler & the ISA determine the instruction count. Theimplementationof the processor determines the CPI and the clock cycle time.
Possible Execution Steps of Any Instructions • Instruction Fetch • Instruction Decode and Register Fetch • Execution of the Memory Reference Instruction • Execution of Arithmetic-Logical operations • Branch Instruction • Jump Instruction
Instruction Processing • Five steps: • Instruction fetch (IF) • Instruction decode and operand fetch (ID) • ALU/execute (EX) • Memory (not required) (MEM) • Write-back (WB) WB IF EX ID MEM
What’s Wrong with Single Cycle? • All instructions run at the speed of the slowest instruction. • Adding a long instruction can hurt performance • What if you wanted to include multiply? • You cannot reuse any parts of the processor • We have 3 different adders to calculate PC+1, PC+1+offset and the ALU • No profit in making the common case fast • Since every instruction runs at the slowest instruction speed • This is particularly important for loads as we will see later
What’s Wrong with Single Cycle? 1 ns – Register read/write time 2 ns – ALU/adder 2 ns – memory access 0 ns – MUX, PC access, sign extend, ROM add: 2ns + 1ns + 2ns + 1ns = 6 ns beq: 2ns + 1ns + 2ns = 5 ns sw: 2ns + 1ns + 2ns + 2ns = 7 ns lw: 2ns + 1ns + 2ns + 2ns + 1ns = 8 ns Get read ALU mem write Instr reg operation reg
Computing Execution Time Assume: 100 instructions executed 25% of instructions are loads, 10% of instructions are stores, 45% of instructions are adds, and 20% of instructions are branches. Single-cycle execution: 100 * 8ns = 800ns Optimal execution: 25*8ns + 10*7ns + 45*6ns + 20*5ns = 640ns
Single Cycle Problems • A sequence of instructions: • LW (IF, ID, EX, MEM, WB) • SW (IF, ID, EX, MEM) • etc Single Cycle Implementation: Cycle 1 Cycle 2 Clk Load Store Waste • what if we had a more complicated instruction like floating point? • wasteful of area
Multiple Cycle Solution • use a “smaller” cycle time • have different instructions take different numbers of cycles • a “multicycle” datapath:
Multicycle Approach • We will be reusing functional units • ALU used to compute address and to increment PC • Memory used for instruction and data • We’ll use a finite state machine for control
IF ID Ex Mem WB The Five Stages of an Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 • IF: Instruction Fetch and Update PC • ID: Instruction Decode and Registers Fetch • Ex: Execute R-type; calculate memory address • Mem: Read/write the data from/to the Data Memory • WB: Write the result data into the register file
Multicycle Implementation • Break up the instructions into steps, each step takes a cycle • balance the amount of work to be done • restrict each cycle to use only one major functional unit • At the end of a cycle • store values for use in later cycles (easiest thing to do) • introduce additional “internal” registers
IF ID Ex Mem WB The Five Stages of Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 • IF: Instruction Fetch and Update PC • ID: Instruction Decode and Registers Fetch • Ex: Execute R-type; calculate memory address • Mem: Read/write the data from/to the Data Memory • WB: Write the result data into the register file lw
IFetch IFetch Dec Dec Exec Exec Mem Mem WB WB Multiple Cycle Implementation • Break the instruction execution into Clock Cycles • Different instructions require a different number of clock cycles • Clock cycle is limited by the slowest stage • Instruction latency is not reduced (time from the start of an instruction to its completion) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw sw
IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Single Cycle vs. Multiple Cycle Single Cycle Implementation: Cycle 1 Cycle 2 Clk Load Store Waste Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw sw R-type
Multicycle Implementation • Break up the instructions into steps, each step takes a cycle • balance the amount of work to be done • restrict each cycle to use only one major functional unit • At the end of a cycle • store values for use in later cycles (easiest thing to do) • introduce additional “internal” registers
Instructions from ISA perspective • Consider each instruction from perspective of ISA. • Example: • The add instruction changes a register. • Register specified by bits 15:11 of instruction. • Instruction specified by the PC. • New value is the sum (“op”) of two registers. • Registers specified by bits 25:21 and 20:16 of the instruction Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]] • In order to accomplish this we must break up the instruction. (kind of like introducing variables when programming)
Breaking down an instruction • ISA definition of arithmetic:Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]] • Could break down to: • IR <= Memory[PC] • A <= Reg[IR[25:21]] • B <= Reg[IR[20:16]] • ALUOut <= A op B • Reg[IR[20:16]] <= ALUOut • We forgot an important part of the definition of arithmetic! • PC <= PC + 4
Idea behind multicycle approach • We define each instruction from the ISA perspective (do this!) • Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps) • Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.) • Finally try and pack as much work into each step (avoid unnecessary cycles)while also trying to share steps where possible (minimizes control, helps to simplify solution)
Five Execution Steps • Instruction Fetch • Instruction Decode and Register Fetch • Execution, Memory Address Computation, or Branch Completion • Memory Access or R-type instruction completion • Write-back step INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
Step 1: Instruction Fetch • Use PC to get instruction and put it in the Instruction Register. • Increment the PC by 4 and put the result back in the PC. • Can be described succinctly using RTL "Register-Transfer Language" IR <= Memory[PC]; PC <= PC + 4;Can we figure out the values of the control signals?What is the advantage of updating the PC now?
Step 2: Instruction Decode and Register Fetch • Read registers rs and rt in case we need them • Compute the branch address in case the instruction is a branch • RTL: A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(IR[15:0]) << 2); • We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)
Step 3 (instruction dependent) • ALU is performing one of three functions, based on instruction type • Memory Reference: ALUOut <= A + sign-extend(IR[15:0]); • R-type: ALUOut <= A op B; • Branch: if (A==B) PC <= ALUOut;
Step 4 (R-type or memory-access) • Loads and stores access memory MDR <= Memory[ALUOut]; or Memory[ALUOut] <= B; • R-type instructions finish Reg[IR[15:11]] <= ALUOut;
Write-back step • Reg[IR[20:16]] <= MDR; Which instruction needs this?
Review: finite state machines • Finite state machines: • a set of states and • next state function (determined by current state and the input) • output function (determined by current state and possibly input) • We’ll use a Moore machine (output based only on current state)
Implementing the Control • Value of control signals is dependent upon: • what instruction is being executed • which step is being performed • Use the information we’ve accumulated to specify a finite state machine • specify the finite state machine graphically, or • use microprogramming • Implementation can be derived from specification
Finite State Machine for Control • Implementation: