770 likes | 917 Views
CprE 381 Computer Organization and Assembly Level Programming, Fall 2013. Chapter 4. The Processor. Zhao Zhang Iowa State University Revised from original slides provided by MKP. Week 9 Overview. Mini Project B CPU Pipelining: Pipelined Data Path and Control
E N D
CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Chapter 4 The Processor Zhao Zhang Iowa State University Revised from original slides provided by MKP
Week 9 Overview • Mini Project B • CPU Pipelining: Pipelined Data Path and Control • ALU Data Hazards and Forwarding Chapter 1 — Computer Abstractions and Technology — 2
Mini-Project BOverview Implement single-cycle processor (SCP). There will be three parts • Part 1, SCPv1: Implement the nine-instruction ISA • Part 2, SCPv2a: Support all the instructions needed to run bubble sorting • With coarse-level modeling of datapath elements • Part 3, SCPv2b: Detailed modeling of datapath elements There is a bonus project Chapter 1 — Computer Abstractions and Technology — 3
Project A Late Submission • Start working on Project B, ASAP • You may submit Mini-Project A late for three weeks (with 20% late penalty) • Demo those parts that are working • Late penalty only applies to those parts that are actually late • If you demo Project B successfully, you don’t have to demo any late part of Project A Chapter 1 — Computer Abstractions and Technology — 4
Part 1: SCPv1 Implementing the nine-Instruction MIPS ISA • Memory reference: LW and SW • Arithmetic/logic: ADD, SUB, AND, OR, SLT • Branch: BEQ, J The textbook provides almost all implementation details • Datapath and control • The main control unit (9-bit signals w/o Jump) • The ALU control unit Chapter 1 — Computer Abstractions and Technology — 5
Part 1: SCPv1 Use this diagram as the blueprint for Part 1 Chapter 1 — Computer Abstractions and Technology — 6
SCPv1: Control Signals • Control signal setting for SCPv1 • It is a truth table Note: “R-” means R-format Chapter 1 — Computer Abstractions and Technology — 7
SCPv1: ALU Control • Truth table for ALU Control Chapter 4 — The Processor — 8
SCPv1 Fast Prototyping You are provided with following files • mips32.vhd: A VHDL package • regfile.vhd: For the register file • register.vhd: For the PC • alu.vhd: For the ALU • adder.vhd: For the PC-related adders • mem.vhd: The memory, for both instruction memory and data memory Chapter 1 — Computer Abstractions and Technology — 9
SCPv1 Fast Prototyping Rational behind Part 1: Focus on the structure/organization of the CPU • The provided components are modeled at coarse-level • We know that efficient circuit design exists for those components: Memory, register file, ALU, adder, mux and so on • Work out the details at the late time Chapter 1 — Computer Abstractions and Technology — 10
Strongly Structural Modeling • Your CPU composition must be strongly structural • No behavior modeling can be used. No process statement. • Limited dataflow modeling (see next) • Additional requirement: Declare all components in the architecture body of CPU • Only component instantiation, no entity instantiation Chapter 1 — Computer Abstractions and Technology — 11
Strongly Structural Modeling • Acceptable forms of dataflow modeling • Signal copying/splitting opcode<= inst(31 downto 26); • Signal Merging j_target<= PC(31 downto 28) & j_offset& "00”; • One-level of basic logic gates taken_branch<= branch AND zero; Chapter 1 — Computer Abstractions and Technology — 12
Cpu.vhd • This is a partial sample -- Control Unit CONTROL1: control port map (opcode, reg_dst, alu_src, mem_to_reg,…); -- ALU Control unit ALU_CTRL1: alu_ctrl port map (alu_op, funct, alu_code); -- The mux connected to the dst port of regfile DST_MUX : mux2to1 generic map (M => 5) port map (rt, rd, reg_dst, dst); … Chapter 1 — Computer Abstractions and Technology — 13
Datapath and Control Modeling • For datapath elements and control units, you may use any modeling style (in Part 1) • The provided components all use behavior modeling for simplicity Chapter 1 — Computer Abstractions and Technology — 14
mips32.vhd package MIPS32 is -- Half Cycle Time of the clock signal constant HCT : time := 50 ns; -- Clock Cycle Time of the clock signal constant CCT : time := 2 * HCT; -- MIPS32 logic type subtype m32_logic is std_logic; -- MIPS32 logic vector type subtype m32_vector is std_logic_vector; Pre-defined constants and types to make coding simpler and consistent Chapter 1 — Computer Abstractions and Technology — 15
mips32.vhd -- Word type, for … subtype m32_word is m32_vector(31 downto 0); -- Halfword, byte, and bit fields of varying size subtype m32_halfword is m32_vector(15 downto 0); subtype m32_byte is m32_vector(7 downto 0); subtype m32_1bit is m32_logic; subtype m32_2bits is m32_vector(1 downto 0); subtype m32_3bits is m32_vector(2 downto 0); … end MIPS32; Pre-defined types shorten the names Chapter 1 — Computer Abstractions and Technology — 16
Alu.vhd • Why provide the ALU and the other VHDL programs? • Your implementation might have bugs • We don’t want to fight the bugs in two fronts • You shall test those modules • Always test any modules that you will use • The provided modules have been tested • Some test-bench programs are provided • Write your own test-bench or extend the provided test-bench Chapter 1 — Computer Abstractions and Technology — 17
Alu.vhd entity ALU is port (rdata1 : in m32_word; rdata2 : in m32_word; alu_code : in m32_4bits; result : out m32_word; zero : out m32_1bit); end entity; Chapter 1 — Computer Abstractions and Technology — 18
Alu.vhd architecture behavior of ALU is signal r : m32_word; begin P_ALU : process (alu_code, rdata1, rdata2) variable code, a, b, sum, diff, slt: integer; begin -- Pre-calculate arithmetic results a := to_integer(signed(rdata1)); b := to_integer(signed(rdata2)); sum := a + b; diff := a - b; if (a < b) then slt := 1; else slt := 0; end if; Chapter 1 — Computer Abstractions and Technology — 19
Alu.vhd -- Select the result, convert to signal if necessary case (alu_code) is when "0000" => -- AND r <= rdata1 AND rdata2; when "0010" => -- add r <= std_logic_vector(to_signed(sum, 32)); … end case; end process; -- Drive the alu result output result <= r; -- Drive the zero output with r select zero <= '1' when x"00000000", '0' when others; end behavior; Coarse-level modelingis easy, reliable but maynot be synthesized efficiently Chapter 1 — Computer Abstractions and Technology — 20
Regfile.vhd entity regfile is port(src1 : in m32_5bits; src2 : in m32_5bits; dst : in m32_5bits; wdata : in m32_word; rdata1 : out m32_word; rdata2 : out m32_word; WE : in m32_1bit; reset : in m32_1bit; clock : in m32_1bit); end regfile; Caveat: The clock signal is needed in the single-cycle implementation Chapter 1 — Computer Abstractions and Technology — 21
Regfile.vhd architecture behavior of regfile is signal reg_array : m32_regval_array; begin -- Register reset logic P_WRITE : process (clock) variable r : integer; begin -- Write/reset logic if (rising_edge(clock)) then if (reset = '1') then for i in 0 to 31 loop reg_array(i) <= X"00000000"; end loop; Chapter 1 — Computer Abstractions and Technology — 22
Regfile.vhd elsif (WE = '1') then r := to_integer(unsigned(dst)); if not (r = 0) the reg_array(r) <= wdata; end if; end if; end if; end process; Chapter 1 — Computer Abstractions and Technology — 23
Regfile.vhd P_READ : process (clock, src1, src2) variable r1, r2 : integer; begin -- Read logic r1 := to_integer(unsigned(src1)); r2 := to_integer(unsigned(src2)); rdata1 <= reg_array(r1); rdata2 <= reg_array(r2); end process; end behavior; Chapter 1 — Computer Abstractions and Technology — 24
Demonstration For each of multiple test cases • Trace the program execution • Inspect the register and memory contents at the end of execution Test case consists of • MIPS binary code, e.g. in imem.txt • Data memory content, e.g. in dmem.txt Chapter 1 — Computer Abstractions and Technology — 25
Test Bench Inside test bench: CPU1 : cpu port map (imem_addr, inst, dmem_addr, dmem_read, dmem_write, dmem_wmask, dmem_rdata, dmem_wdata, reset, clock); INST_MEM : mem generic map (mif_filename => "imem.txt") port map (imem_addr(9 downto 2), "0000", clock, x"00000000", '0', inst); DATA_MEM : mem generic map (mif_filename => "dmem.txt") port map (dmem_addr(9 downto 2), dmem_wmask, clock, dmem_wdata, dmem_write, dmem_rdata); Note: Treat memories as external datapath elements Chapter 1 — Computer Abstractions and Technology — 26
Instruction Memory • imem.txt contents (MIF) DEPTH=1024; WIDTH = 32; -- lw $t0, 0($zero) -- lw $t1, 4($zero) -- beq $t0, $t1, +2 -- add $t0, $t0, $t1 -- sw $t0, 8($zero) -- noop CONTENT BEGIN -- Instruction formats --R ======-----=====-----=====------ --I ======-----=====---------------- --J ======-------------------------- 0 : 10001100000010000000000000000000; 1 : 10001100000010010000000000000100; 2 : 00010001000010010000000000000010; 3 : 00000001000010010100000000100000; 4 : 10101100000010000000000000001000; [5..63] : 00000000; END; Chapter 1 — Computer Abstractions and Technology — 27
Part 2. SCPv2 Prototyping (SCPv2a) • Support all MIPS instructions used by the bubble sort example • We have studied how to extend the nine-instruction design to support ADDI, SLL, BNE, and JAL • For each new instruction, think about • Datapath: Any new/revised data elements, any new signal connections • The main control: Any new control signals, any extension to the truth table • The ALU control: Any extension to the truth table Chapter 1 — Computer Abstractions and Technology — 28
Part 3. SCPv2b • SCPv2 Detailed Implementation • Provide detailed modeling for • Register file • ALU • Adder • Use your code from Labs 1-4 and Mini-Project A • You may revise your code • Your final code should be strongly structural • Consult your lab TAif you are not sure Chapter 1 — Computer Abstractions and Technology — 29
Bonus Project Part 1 • Green MIPS SCP (SCP-G) • Extend SCPv2 to support all integer instructions listed on the green sheet • Bonus Project Part 2 is to do pipelined implementation • The lab bonus can overflow in your overall grade • As said, quiz bonus does not overflow • Partial credit will be given • The grading details will be finalized Chapter 1 — Computer Abstractions and Technology — 30
Pipelined CPU CPU A natural idea to improve performance The devil is in the details • Pipelined data path and control • Data hazard from ALU instructions • Data Hazard from Load instructions • Control Hazard from branches • Exception handling in pipelined processor Chapter 1 — Computer Abstractions and Technology — 31
SCP With Jumps Added Chapter 4 — The Processor — 32
Performance Issues • Longest delay determines clock period • Critical path: load instruction • Instruction memory register file ALU data memory register file • Now we will improve performance by pipelining Chapter 4 — The Processor — 33
Pipelining Analogy • Pipelined laundry: overlapping execution • Parallelism improves performance §4.5 An Overview of Pipelining • Four loads: • Speedup= 8/3.5 = 2.3 • Non-stop: • Speedup= 2n/0.5n + 1.5 ≈ 4= number of stages Chapter 4 — The Processor — 34
Pipeline Performance Look at this example • In single-cycle implementation, the critical path is 800ps (one cycle @ 1.25 GHz) • The longest component latency is 200ps (one cycle @ 5GHz) Note: Latency of mux, extender and so on ignored Chapter 4 — The Processor — 35
MIPS Pipeline Idea • If we divide the execution into stages, clock frequency can be much faster • Five stages, one step per stage • IF: Instruction fetch from memory • ID: Instruction decode & register read • EX: Execute operation or calculate address • MEM: Access memory operand • WB: Write result back to register Chapter 4 — The Processor — 36
MIPS Pipeline Idea General idea: Split the datapath into stages, withcritical path delay <= 1 clock cycle Chapter 4 — The Processor — 37
Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) First look at performance gain Chapter 4 — The Processor — 38
Pipeline Speedup • If all stages are balanced • i.e., all take the same time • Time between instructionspipelined= Time between instructionsnonpipelined Number of stages • Ideal speedup = N for N-stage pipeline • If not balanced, speedup is less • In the example, speedup is up to 4.0 • Speedup due to increased throughput • Latency (time for each instruction) does not decrease, or even increases Chapter 4 — The Processor — 39
Pipelining and ISA Design • MIPS ISA designed for pipelining • All instructions are 32-bits • Easier to fetch and decode in one cycle • c.f. x86: 1- to 17-byte instructions • Few and regular instruction formats • Can decode and read registers in one step Chapter 4 — The Processor — 40
Pipelining and ISA Design • How would you design a pipeline for this instruction format? ModR/M: addressing-form specifier, mixing of register numbers, addressing modes, additional opcode bits SIB: Second addressing byte for base-plus-index and scale-plus-index addressing modes Source: Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z Prefixes (1-4 bytes) Opcode (1-3 bytes), required ModR/M (1 byte ) SIB (1 byte) Addr. Displacement (0, 1, 2, or 4 bytes) Immediate (0, 1, 2, or 4 bytes) Chapter 1 — Computer Abstractions and Technology — 41
Pipelining and ISA Design • MIPS ISA designed for pipelining • Load/store addressing • Can calculate address in 3rd stage, access memory in 4th stage • Alignment of memory operands • Memory access takes only one cycle Chapter 4 — The Processor — 42
Pipelining and ISA Design • How would you design a pipeline that works well for the following instructions? ADD eax, ebx ; add with two registers SUB ebx, 100 ; sub with reg and const ADD eax, [0x1000] ; add reg and memory ADD BYTE PTR [0x1000], 100 ; add with mem and const SUB [esi+4*ebx], eax ; sub with reg and mem (array) Chapter 1 — Computer Abstractions and Technology — 43
MIPS Pipelined Datapath §4.6 Pipelined Datapath and Control MEM Right-to-left flow leads to hazards WB Chapter 4 — The Processor — 44
Pipeline registers • Need registers between stages • To hold information produced in previous cycle Chapter 4 — The Processor — 45
Hazards • Situations that prevent starting the next instruction in the next cycle • Structure hazards • A required resource is busy • Data hazard • Need to wait for previous instruction to complete its data read/write • Control hazard • Deciding on control action depends on previous instruction Chapter 4 — The Processor — 46
Hazards There are ways to handle those hazards. Let’s ignore them for now Assume, for now, no data dependence and control dependence in the program lw $10, 20($1) sub $11, $2, $3 add $12, $3, $4 lw $13, 24($1) sub $14, $5, $6 Can you design a pipeline to run the about instructions correctly? Chapter 1 — Computer Abstractions and Technology — 47
Hazards Program with data dependence sub $2, $1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2) Program with control dependence beq $1, $3, +4addi $2, $2, 1 addi $4, $4, 1 Chapter 1 — Computer Abstractions and Technology — 48
Pipeline Operation • Cycle-by-cycle flow of instructions through the pipelined datapath • “Single-clock-cycle” pipeline diagram • Shows pipeline usage in a single cycle • Highlight resources used • c.f. “multi-clock-cycle” diagram • Graph of operation over time • We’ll look at “single-clock-cycle” diagrams for load & store Chapter 4 — The Processor — 49
IF for Load, Store, … Chapter 4 — The Processor — 50