2.21k likes | 2.42k Views
Outline Introduction Version 1 EMY CPU : Pipelined EMY CPU It executes only integer instructions How a memory hierarchy can be attached to the pipelined EMY CPU is also studied Version 0 , the Unpipelined EMY CPU is described in another presentation Handout to use Pipelined EMY CPU.
E N D
Outline • Introduction • Version 1 EMY CPU : Pipelined EMY CPU • It executes only integer instructions • How a memory hierarchy can be attached to the pipelined EMY CPU is also studied • Version 0, the Unpipelined EMY CPU is described in another presentation • Handout to use • Pipelined EMY CPU CS 2214
Introduction • On the microarchitecture layer, a computer is a collection of at least three interconnected digital systems • A central processing unit (CPU) • A (main) memory • An I/O controller to control an I/O device, such as the disk • There can be several I/O controllers to control several different I/O devices CPU Disk I/O Controller Interconnection System Memory CS 2214
Digital Systems • A digital system performs microoperations • It consists of a datapath (data unit) and a control unit • The datapath actually performs the microoperations • The control unit determines which microoperation happens when ALUs Registers Buses Datapath Sequencer Control Unit Status signals Control signals CS 2214
Digital Systems • The datapath (data unit) has registers, ALUs and buses to perform the microoperations • Registers keep information temporarily • ALUs perform arithmetic/logic operations • Buses interconnect the registers and ALUs • Other components are used include • Multiplexers (MUXes), decoders, encoders, comparators, counters, etc. CS 2214
Digital Systems • The control unit has a sequencer that determines the sequence of microoperations • The sequencer needs status signals from the data unit to know what is happening there • Then, based also on the current state it determines which microoperations to be performed and indicates to the datapath by means of control signals CS 2214
Designing Digital systems • Datapath design is simpler than the control unit since it has highly regular (duplicated) circuits • A 64-bit ADDer is composed of 4 16-bit identical ADDers • A 64-bit comparator consists of 8 8-bit identical comparators, etc. • Control unit design is more difficult due to • Large amounts of random logic • A substantial amount of effort is needed to make sure there are no timing problems • Microoperations must start at the right time and end at the right time ! CS 2214
Designing digital systems • We will use the finite-state machine (FSM) technique to design the EMY CPU where the FSM state diagram will have states with microoperations • The state diagram shows which state follows which state precisely • Each state indicates which microoperations to perform • The state diagram shows which states are needed when for which machine language instruction CS 2214
Designing digital systems • We will design the EMY CPU by using the finite-state machine (FSM) technique • More specifically, we will obtain the following for the complete EMY CPU design • A high-level-state diagram to show which microoperation happens when • The datapath from the high-level state diagram • The low-level state diagram from the high-level sate diagram and the datapath • The control unit from the low-level state diagram • It can be implemented by hardwiring and/or microprogramming CS 2214
Designing the microarchitecture level of a computer • There are two tasks in this design • Develop the CPU and memory digital systems so that instructions can be run • Develop the memory and I/O controller digital systems so that I/O can happen • We will concentrate on the CPU and memory digital systems CS 2214
Designing the CPU and memory digital systems • First we focus on the CPU digital system while we make a few design decisions on the memory quickly • We have designed the CPU as a slow CPU running only integer instructions : No pipelining • This is Version0 • We assumed the memory was fast which is not realistic today • We will see how a memory hierarchy with cache memories, etc. can be incorporated • This CPU coverage is given in another PowerPoint presentation • Now, we improve the CPU speed by using pipelining, but still running integer instructions • This is Version 1 • We will assume the memory is fast which is again not realistic today • Then, we will see how a memory hierarchy with cache memories, etc. can be incorporated • For both versions the memory will be a black box with a few details CS 2214
Designing the CPU as a Digital System • The unpipelined EMY CPU digital system has been designed for nine integer instructions • We obtained its • High-level state diagram • Datapath • Low-level state diagram • Control unit • We will design the pipelined EMY CPU digital system for eight integer instructions • We will obtaine its • High-level state diagram • Datapath CS 2214
Designing the Unpipelined CPU digital system • To design the unpipelined EMY CPU, we started with the EMY architecture • What is the connection between the architecture and the CPU? • A computer processes digital information, by running machine language instructions • A machine language program is a list of instructions each of which specifies operations on data (arguments) • An instruction specifies architectural operations • Each architectural operation is implemented by microoperations CS 2214
Designing the Unpipelined CPU Digital System • In order to perform an architectural operation, the CPU performs a series of microoperations in a number of clock periods • That is an architectural operation is broken down into smaller operations called microoperations • That is, to run a machine language instruction, the CPU performs microoperations • The CPU performs some microoperations by itself and some in cooperation with the memory and the I/O controllers CS 2214
Designing the UnpipelinedCPU Digital System • Architectural operations • An architectural operation is what we describe as the semantics of the instruction, such as • The architectural operation specified by the ADD instruction • Rd Rs + Rt • The architectural operation specified by the SUB instruction • Rd Rs - Rt • The architectural operation specified by the SLT instruction • If Rs < Rt then Rd 1 else Rd 0 • The architectural operation specified by the J instruction • PC[27-0] (Address * 4) • It is the CPU that contributes the most to the execution of an instruction since it performs most of the microoperations needed for an architectural operation CS 2214
Designing the UnpipelinedCPU Digital System • Typical CPU digital system microoperations • Add, subtract, multiply • In the past, a 32-bit addition was completed in 1clock period. • Today, a 32-bit addition is completed in several clock periods • AND, OR, XOR • Shift right, Shift left • Read data from memory, write data to memory • In the past, a memory access was completed in 1clock period. • Today, it is completed in several clock periods • Read instructions from memory (fetch) • Increment the program counter • Transfer a register to another register • … CS 2214
Designing the UnpipelinedCPU as a Digital System • Other machines, especially CISC machines, require other microoperations such as • Reading indirect address(es) from the memory • Effective address calculation for • Indexing • Autoincrement • Autodecrement • Alignment for • Instructions • Data • Addresses CS 2214
Designing the UnpipelinedCPU Digital System • Architecture’s effect on microoperations • The decisions made on architecture determine the microoperations needed for the execution of the instructions • General microoperations found on most CPUs • The ones mentioned on previous slides • Specific microoperations for certain CPUs • Specific microoperations for Memory Management Units (MMUs), caches, I/O controllers • The architecture also determines the characteristics of each microoperation • If the 26-bit PC-direct addressing mode is used, the rightmost 26 bits of IR are catenated the leftmost 4 bits of PC and the resulting 30 bits are shifted to the left by 2 • Thus, each machine language instruction requires a number of certain microoperations taking a certain time : the CPIi CS 2214
Designing the UnpipelinedCPU Digital System • Microoperations • The CPU can perform one or more microoperations per clock period, depending on the complexity of the microoperation and the availability of the hardware resources • Most often a microoperation can be completed in one clock period unless it is a complex microoperation • If a complex microoperations is desired to be run in a clock period, the clock period needs to be longer • The more and complex the microoperations are, the longer it takes to run the machine language instruction • CISC instructions take longer time to execute (larger CPIi) CS 2214
Designing the UnpipelinedCPU Digital System • Calculating CPIi • The time it takes to run an instruction, CPIi, is then determined by • The number of microoperations needed for it • The complexity of the microoperations • The number of clock periods for an instruction, CPIi, becomes a matter of figuring out the microoperations and how to distribute them to individual clock periods • One can come up with 5-10 simple microoperations to be performed one after another, resulting in a CPIi of 5-10 • But, since microoperations are simple, the clock period is short • Alternatively, one can come up with 2-4 complex microoperations, resulting in a CPIi of 2-4 • But, the clock period is longer CS 2214
Designing the Unpipelined CPU Digital System • Calculating CPIi • What can we do ? • Few long clock periods vs. many but shorter clock periods ? • Since increasing the clock frequency is important for marketing purposes the second option would weigh in substantially • It turns out that if pipelining is implemented, having many shorter clock periods would be beneficial as we will see • CPIi figures will be large but CPIave will be close to 1 (one) ! • Today’s microprocessors have instruction CPIi values in the range of 10-30, but CPIave figures for their targeted applications are even less than 1 (one) ! • Because they employ advanced pipelining techniques, such as superscalar execution, hyperthreading, etc. CS 2214
Designing the UnpipelinedCPU Digital System • Determining microoperations for a machine language instruction • Some microoperations are performed for all the instructions • Usually at the same point in time during the execution of every instruction • Fetching the instruction is always the first microoperation to perform for all CPUs • Updating PC (PC PC + 4) so that it points at the next instruction is also universal • The other microoperations depend on the instruction, the addressing mode, where the arguments are, the length of the arguments, etc. CS 2214
Designing the UnpipelinedCPU Digital System • Determining microoperations for a machine language instruction • We would list all the microoperations for each instruction, by making sure that we are consistent in terms of • Bus usage • We often decide an approximate number of buses we need for our datapath • Today’s CPUs have at least three internal buses to complete an integer arithmetic microoperation in one clock period • Two buses carry the numbers from two registers and the third bus carries the result to a register • ALU usage • An ALU is expensive and so we try to limit the number of ALUs CS 2214
Designing the UnpipelinedCPU Digital System • Determining microoperations for a machine language instruction • We would list all the microoperations for each instruction, by making sure that we are consistent in terms of • Register usage • Additional registers not visible to the architecture level are used to keep temporary values : microarchitectural registers • Typically, the more registers are used, the more clock periods we spend for an instruction since temporary values will be passed from one register in one clock period to another register to be used the following clock period • But, sometimes we have to use microarchitectural registers, such as the instruction register that keeps the current instruction • Control unit usage CS 2214
Designing the Unpipelined CPU Digital System • Determine how each EMY architectural operation is implemented by microoperations • Most microoperations must be simple enough to be completed in less than one clock period • A few microoperations may not be completed in a clock period • For example a memory read may take several clock periods since the memory is slower • These long microoperations should be accommodated in the high-level state diagram, the datapath, low-level state diagram and the control unit • We will assume in the beginning that every microoperation is completed in one clock period CS 2214
Designing the Unpipelined CPU Digital System • The EMY microoperations implied by the EMY machine language instructions include • Instruction fetch, performedalways • Update PC for next instruction, performed always • Effective address calculation for Displacement and relative addressing modes • Sign extension or catenation of 0s for data/addresses • Reading data from the memory • Writing data to the memory • Perform an arithmetic/logic • Register transfer • Testing a condition CS 2214
What is Pipelining ? • The unpipelined MIPS CPU can be thought of having five stages that correspond to the five major cycles • For the unpipelined MIPS CPU, at any time only one stage is busy and the remaining ones are idle EX IF MEM ID WB Control Unit Instructions Instructions Datapath CS 2214
Clock period ADD R10, R8, R11 6 • What is Pipelining ? • The unpipelined CPU works like this : • Only, one instruction is in the pipeline ! ID MEM IF WB EX Continues this way… LW R8, 0(R9) LW R8, 0(R9) LW R8, 0(R9) LW R8, 0(R9) LW R8, 0(R9) 5 3 4 1 2 CS 2214
What is Pipelining ? • Pipelining is the simultaneous execution of multiple instructions in an assembly line fashion in a single CPU IF ID MEM EX WB ADD R10, R8, R11 BEQ R12, R0, 3 SW R12, 0(R15) LW R8, 0(R9) ADD R12, R13, R14 ADD R12, R13, R14 ADD R10, R8, R11 SW R12, 0(R15) LW R8, 0(R9) ADD R12, R13, R14 ADD R10, R8, R11 LW R8, 0(R8) LW R8, 0(R9) ADD R10, R18, R11 LW R8, 0(R9) 2 3 4 5 1 Clock period CS 2214
What is Pipelining ? • Pipelining is a microarchitectural technique where consecutive instructions are executed overlappingly • Each instruction is in a pipeline stage • All stages are busy CS 2214
What is a Stage ? • Each stage is specialized hardware corresponding to a specific major cycle • IF, ID, EX, MEM, WB • The hardware for each major cycle can then be easily identified and often named stage CS 2214
What is Pipelining ? • Pipelined execution of instructions is similar to the assembly line manufacturing of cars CS 2214
What is Pipelining ? • There are two differences • On a car assembly line there is only one type of car assembled • For the CPU the instructions executed are different • Loads, Stores, A/L, Branch instructions • All the cars on an assembly line have the same requirements : the same pieces are placed on the cars • For the CPU, even if two back-to-back instructions are of the same type (for example two back-to-back Loads), they have different requirements (different effective addresses hence different memory locations are accessed) CS 2214
What is Pipelining ? • Because of these two differences, each stage has to pass information related to the instruction it just worked on to the next stage • Temporary registers (latches, buffers) are used between two stages to pass the information about the instruction just leaving one stage and entering the next one Latches ID MEM IF WB EX CS 2214
What is Pipelining ? • Latches are then necessary to pass information about an instruction from one stage to the next • Latches are also needed so that partial work done by one stage is passed to the next stage so the work continues CS 2214
What is the Pipe ? • We give the name “pipe” to the set of stages since the stages are cascaded in a single dimension forming a pipe where instructions • Enter from one end • Stay in a stage for one clock period • Proceed to the next stage • Finally exit from the other end • By which time the instruction execution is completed CS 2214
What is Pipelining ? • Consider a sequence of instructions and a 5-stage pipeline • Assume that all the instructions use the five stages • That is they all take five clock periods to complete their execution • This is not possible in real life but let’s assume this for the time being to understand pipelining quickly EX IF MEM ID WB Instructions Instructions …I9 I8 I7 I6 I5 I4 I3 I2 I1 CS 2214
I3 I1 I4 I2 I1 I3 I4 I2 I5 I1 I4 I2 I5 I3 I6 I1 I2 I7 I3 I6 I4 I5 I5 I8 I6 I7 I2 I3 I4 Pipeline is full ≡all stages are busy ≡ start-up time =5 clock periods v v v v v v v v v v v v v v v v v v v v v v v v v v v v v • What is Pipelining ? • The execution can be shown as follows Stage WB MEM EX ID I1 IF 0 7 1 2 3 5 4 6 8 Time WB MEM EX ID v IF CS 2214
What is Pipelining ? • Compared with unpipelining, the five stages are more complex to allow overlapped execution • All stages take the same amount of time, one clock period • The length of the clock period is determined by the slowest stage • Because, it is difficult to obtain stages with equal amount of work hence time CS 2214
What is Pipelining ? • If the CPU is unpipelined, the instructions would take 5 clock periods each • CPIi = 5 • Since each instruction is taking 5 clock periods • CPIave = 5 • Since the number of clock periods divided by the number of instructions run is 5 I1 I2 I3 I4 I5 I6 I7 Time 20 25 30 35 15 10 5 CS 2214
I1 I2 I3 I4 I6 I7 I5 Time 10 9 5 6 7 8 11 • What is Pipelining ? • If the CPU is pipelined, after the pipeline becomes full (the start-up time), every clock period an instruction is completed as opposed to completing every 5 clock periods • CPIi = 5 • Since each instruction is taking 5 clock periods • CPIave≈ 1 • Since after the start-up time, we complete one instruction each clock period CS 2214
What is Pipelining ? • Once the pipeline is filled, each clock period an instruction exits the pipeline • Each clock period an instruction is completed • It seems each instruction takes one clock period to execute • CPIave≈ 1 !!! CS 2214
What is Pipelining ? • Assume for next few slides that the unpipelined EMY CPU is converted to a pipelined CPU • CPILW = 5 • CPISW = 4 • CPIA/L R Format = 4 • CPIBEQ = 3 CS 2214
What is Pipelining ? • Consider the following piece of EMY code --- 400200 LW R8, 0(R9) ; R8 M[R9 + 0+] 400204 ADD R10, R11, R12 ; R10 R11 + R12 400208 SUB R13, R14, R15 ; R13 R14 – R15 40020C XOR R16, R17, R18 ; R16 <-- R17 + R18 400210 SW R19, 0(R20) ; M[R20 + 0+] <-- R19 400214 OR R21, R22, R23 ; R21 R22 | R23 400218 SLT R24, R25, R26 ; If R25 < R26, R24 1, else R24 0 40021C BEQ R27, R28, 5 ; If R27 is equal to R28, branch to 400234 --- This code is not realistic since the instructions are all independent of each other ! But, for the sake of understanding pipelining, we will use this piece of code ! CS 2214
WB v MEM v v v v v v EX v ID v v v v v v v v v v IF v v v v v v v v v v v v • What is Pipelining ? • Let’s see its pipelined execution by using textbook’s notation and assume that the memory takes one clock period 1 1 2 3 4 5 6 7 8 9 10 400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R18 400210 SW R19, 0(R20) 400214 OR R21, R22, R23 400218 SLT R24, R25, R26 40021C BEQ R27, R28, 5 IF ID EX MEM WB IF ID EX MEM IF ID EX MEM IF ID EX MEM IF ID EX MEM IF ID EX MEM IF ID EX MEM IF ID EX v v v v v CS 2214
IF ID EX MEM WB 1 2 3 4 5 400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R18 400210 SW R19, 0(R20) 400214 OR R21, R22, R23 400218 SLT R24, R25, R26 40021C BEQ R27, R28, 5 2 3 4 5 3 4 5 6 4 5 6 7 5 6 7 8 6 7 8 9 7 8 9 10 8 9 10 • What is Pipelining ? • Textbook’s notation is hard to follow if there are more than few instructions • Also, the notation requires a lot of space even for few instructions • From now on, we will use our notation • The execution by assuming assume that the cache memories take one clock period and there is no miss CS 2214
What is Pipelining ? • What if the EMY CPU was not pipelined ? • The execution timing would be as follows by assuming that the cache memories take one clock period and there is no miss IF ID EX MEM WB 400200 LW R8, 0(R9) 400204 ADD R10, R11, R12 400208 SUB R13, R14, R15 40020C XOR R16, R17, R18 400210 SW R19, 0(R20) 400214 OR R21, R22, R23 400218 SLT R24, R25, R26 40021C BEQ R27, R28, 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 The execution completes in 32 clock periods ! Pipelined execution takes 10 clock periods ! CS 2214
What is Pipelining ? • Pipelining decreases the execution time of the program, CPUtime • The number of instructions run, NI, stays the same • We execute the same number of instructions for a program • The CPIi stays the same • Often the unpipelined CPIi and Pipelined CPIi differ slightly for efficient pipelining • The Branch CPIi will reduce from 4 to 3 • The A/L Format CPIi will go up from 4 to 5 • Instructions go through the similar stages as the unpipelined case • But, we execute several instructions at the same time • All the stages are busy now • The CPU does more per clock period • CPIave decreases CS 2214
What is Pipelining ? • We execute more instructions per unit time (a second) • The throughput is increased • The MIPSave figure is increased • The number of instructions executed per second is increased • The MFLOPSave figure is increased • The number of FP operations performed per second is increased • That is why companies like to mention the MIPSave and MFLOPSave figures for their new generations of microprocessors since each new generation improves the pipeline which directly improves MIPSave and MFLOPSave. CS 2214
What is Pipelining ? • Pipelining does not decrease the CPIi of each individual instruction but increases the clock period slightly • The execution time of each instruction in terms of seconds is increased slightly ! • This is due to the slightly longer clock period • This is due to overhead of handling several instructions per clock period CS 2214