540 likes | 675 Views
Computer Organization and Architecture + Networks. Lecture 8 CPU Structure and Function. CPU Structure. CPU must: Fetch instructions Interpret instructions Fetch data Process data Write data. Overview. This section investigates how a typical CPU is organized
E N D
Computer Organization and Architecture + Networks Lecture 8 CPU Structure and Function
CPU Structure • CPU must: • Fetch instructions • Interpret instructions • Fetch data • Process data • Write data
Overview • This section investigates how a typical CPU is organized • major components (revisited) • Register organization • Instruction pipelining
CPU Organization • Recall the functions performed by the CPU • Fetch instructions • Fetch data • Process data • Write data • Organizational requirements that are derived from these functions • ALU • Control logic • Temporary storage • Means to move data and instructions in and around the CPU
CPU With Systems Bus External view of the CPU
Register Organization • Registers form the highest level of the memory hierarchy • Small set of high speed storage locations • CPU must have some working space (temporary storage for data and control information) • Two types of registers • User-visible • May be referenced by assembly-level instructions and are thus “visible” to the user • Control and status registers • Used to control the operation of the CPU • Most are not visible to the user
User Visible Registers • General categories based on function • General Purpose • Can be assigned a variety of functions • Ideally, they are defined orthogonally to the operations within the instructions • Data • These registers only hold data • Address • These registers only hold address information • Examples: general purpose address registers, segment pointers, stack pointers, index registers • Condition Codes • Visible to user but values set by the CPU as the result of performing operations • Example code bits: zero, positive, overflow • Bit values are used as the basis for conditional jump instructions
User Visible Registers • Design trade off between general purpose and specialized registers • General purpose registers maximize flexibility in instruction design • Special purpose registers permit implicit register specification in instructions reduces register field size in an instruction • No clear “best” design approach • How many registers are enough • More registers permit more operands to be held within the CPU reducing memory bandwidth requirements to some extent • More registers cause an increase in the field sizes needed to specify registers in an intruction word • Locality of reference may not support too many registers • Most machines use 8-32 bit registers (does not include RISC machines with register windowing)
User Visible Registers • How big (wide) • Address should be wide enough to hold the longest address • Data registers should be wide enough to hold most data types • Would not want use 64-bit registers if the vast majority of data operations used 16 and 32-bit operands • Related to width of memory data bus • Concatenate registers together to store longer formats • B-C registers in the 8085 • AccA-AccB registers in the 68HC11
Control and Status Registers • These registers are used during the fetching, decoding and execution of instructions • Many are not visible to the user/programmer • Some are visible but can not be (easily) modified • Typical registers • Program counter • Points to the next instruction to be executed • Instruction register • Contains the instruction being executed • Memory address register • Memory data/buffer register • Program status word(s) • Superset of condition code register • Interrupt masks, supervisory modes, etc • Status information
Foreground Reading • Stallings Chapter 12 • Manufacturer web sites & specs
Instruction Cycle • Recall of the instruction cycle from Stallings Chapter 3 • Fetch the instruction; decode it; fetch operands; perform the operation; store results; recognize pending interrupts • Based on the addressing techniques (Chapter 9), we can modify the state diagram for the cycle to explicitly show indirection in addressing • Flow of data and information between registers during the instruction cycle varies from processor to processor
Instruction Cycle State Diagram To determine if any indirect addressing is involved This illustrates more correctly the nature of the instruction cycle. Once an instruction is fetched, its operand specifiers must be identified. Each input operand in memory is then fetched, and this process may require indirect addressing. Register-based operands need not be fetched. Once the opcode is executed, a similar process may be needed to store the result in main memory.
Instruction Pipelining - Introduction • Organization enhancements to the processor can improve performance of the processor. Among the enhancements are: • Use of multiple registers rather than a single accumulator • Use of cache memory • Instruction piepline • The instruction cycle state diagram clearly shows the sequence of operations that take place in order to execute a single instruction • A “good” design goal of any system is to have all of its components performing useful work all of the time high efficiency • Following the instruction cycle in a sequential fashion does not permit this level of efficiency
Instruction Pipelining • Compare the instruction cycle to an assembly line in manufacturing plant • Perform all tasks concurrently, but on different (sequential) instructions • The result is temporal parallelism • Result is the instruction pipeline • Product of various stages can be worked on in parallel (simultaneously)
Instruction Pipelining • In summary, pipelining in computer organization is to overlap execution of several instructions by starting the next instruction before the previous one has been completed
Pipelining a CPU • The hardware required to do much of the Fetch-Execute Cycle is independent: • Fetching requires the PC, the bus and main memory • Executing requires the ALU and registers • Overlap the execution of each instruction so that the hardware is all being used • Can fetch next instruction during execution of current instruction called instruction prefetch or fetch overlap • Each stage of the Fetch-Execute Cycle will be executing in parallel but on a different instruction • Fetch first instruction • Decode first, fetch second instruction • Fetch first instruction’s, decode second, fetch third
Pipelining a CPU • The Fetch-Execution Cycle can be decomposed into 6 stages: • Fetch instruction (FI) • Decode instruction (DI) • Calculate operands address (CO) • Fetch operands (FO) • Execute instructions (EI) • Write result (WO)
Pipeline Timing Diagram • First instruction requires 6 units of time to complete • Since the second instruction started at time 2, it finishes at time 7 (even though it also needs 6 units of time to complete) Instruction i completes at time i + 5 eg. Instruction 7 finishes at 7+5 = time 12
Pipeline Timing Diagram 9 instructions (i) 54 time units (9 x 6 stages) without pipelining (ii) 14 time units with pipelining
Pipeline Timing Diagram • There are factors that will limit the performance enhancement of pipeline: • The stages are not having equal duration • Conditional branch instruction
The Effects of Conditional Branch on Pipelining • Conditional Branch Example: Instruction 3 is a conditional branch to instruction 15 (Figure below) • The pipeline simply loads the next instruction • No instructions completes during time 9 – 12 (branch penalty)
The Effects of Conditional Branch on Pipelining • Another example: 3-stages pipeline (FI, EI, WO) and branch at Instruction 3 to Instruction 15 and there are 6 instructions (including Instruction 15). Draw the timing diagram
Performance Pipeline • Speed-up factor for the instruction pipeline compared to the execution without pipeline: • S = nk / [k+(n-1)] where, k = number of stages in the instruction pipeline n = number of instructions • For example: Pipeline processor has 5 stages and 96 instructions • Speed-up, S = (5 x 96) / [5 + (96 – 1)] = 480/100 = 4.8 Speed-up approximates the number of stages
Dealing with Branches • Conditional Branch Instruction gives a problem to pipelining because it is impossible to determine whether the branch will be taken or not until the instruction is executed • Approaches to deal with conditional branches: • Multiples Streams • Prefetch Branch Target • Loop Buffer • Branch Prediction • Delayed Branch
Dealing with Branches • Multiples Streams • Normal Pipeline must choose 1 of 2 instructions to fetch for a branch instruction and may make the wrong choice • To overcome this problem: Use two streams (mush support multiple streams for each instruction in the pipeline approach) to fetch both instructions • Advantage: Increase chance of memory contention • Limitation of this approach: • Delay for access to memory and register • Multiple branch instructions need additional streams • Use in IBM 370/168 and IBM 3033
Dealing with Branches • Prefetch Branch Target • When the branch instruction is decoded, begin to fetch the branch target instruction and place in a second prefetch buffer • If the branch is not taken, the sequential instructions are already in the pipe no loss performance • If the branch is taken, the next instruction has been prefetched and results in minimal branch penalty (don’t have to incur memory read operation at the end of the branch to fetch the instruction) • Loop Buffer (look ahead, look behind buffer) • Many conditional branches operations are used for loop control • Expand prefetch buffer so as to buffer the last few instructions executed in addition to the ones that are waiting to be executed • If buffer is big enough, entire loop can be held in it reducing branch penalty • Similar principle to cache
Dealing with Branches • Branch Prediction • Make a good guess as to which instruction will be executed next and start that one down the pipeline • If the guess turns out to be right, no loss of performance in the pipeline • If the guess was wrong, empty the pipeline and restart with the correct instruction suffering the full branch penalty • Static guesses: make the guess without considering the runtime history of the program • Predict never taken • Predict always taken • Predict by opcode • Dynamic guesses: track the history of conditional branches in the program • Taken/not taken switch • Branch history table
Dealing with Branches • Delayed Branch • Minimize the branch penalty by finding valid instructions to execute in the pipeline while the branch address is being resolved • Compiler is tasked with rearrange the instruction sequence to find enough independent instructions (wrt to the conditional branch) to feed into the pipeline after the branch that the branch penalty is reduced to zero • Implemented on many RISC architectures
Dealing With Branches Instruction fetch stage always fetch the next sequential address ① If branch is taken, logic in processor detects this and instructs next instruction to be fetched from target address (prefetch triggers lookup in branch history: if match prediction is made (take next instruction/branch target); if no match fetch next instruction ② State of instruction is then updated to inform correct/incorrect prediction. If incorrect prediction, ‘select logic’ redirected to correct address for next fetch. Branch history contains the instruction fetch stage of the pipeline (small cache)
Superscalar and Superpipelined Processor • Logical evolution of pipeline designs resulted in 2 high-performance execution techniques • Superpipeline Designs • Observation: Many pipeline stages need lees than half a clock cycle • Double internal clock speed gets two task per external clock cycle • Time to complete individual instructions does not change • Degree of parallelism goes up • Perceived speed-up goes up • Superscalar • Implement the CPU such that more than one instruction can be performed (completed ) at a time • Involves replication of some or all parts of the CPU/ALU • Examples:
Superscalar and Superpipelined Processor • Superscalar • Examples: • Fetch multiple instructions at the same time • Decode multiple instructions at the same time • Perform add and multiply at the same time • Perform load/stores while performing ALU operation • Degree of parallelism and hence the speed up of the machine goes up as more instructions are executed in parallel
Comparison of superscalar • and superpipeline operation to regular pipeline
Superscalar Design Limitations • Superscalar depends on the ability to execute multiple instructions in parallel instruction-level parallelism: degree to which on average the instructions can be executed in parallel • Fundamental limitations • Data dependency • Resource dependency • Procedural dependency • Output dependency • Antidependency
Superscalar Design Limitations • Data dependency: must insure computed results are the same as would be computed on a strictly sequential machine • Two instructions can not be executed in parallel if the (data) output of none is the input of the other or they both write to the same output location • Consider: S1: A = B + C S2: D = A +1 S3: B = E + F S4: A = E + 3 • Resource dependency • In the above sequence of instructions, the adder unit gets a real workout • Parallelism is limited by the number of adders in the ALU
Superscalar Design Limitations • Procedural dependency • The presence of branches in an instruction sequence complicates the operation dependency on branch and cannot be executed until the branch is executed
Comparison of fundamental limitation
Superscalar Design Issues • Instruction Issue Policy: In what order are instructions issued to the execution unit and in what order do they finish? • In-order issue with in-order completion • Simplest method, but severely limits performance • Strictly ordering of instructions: data and procedural dependencies or resource conflicts delay all subsequent instructions • “Slow” execution of some instructions delay all subsequent instructions Instructions are fetched 2 at a time. The next pair must wait until the execution phase is cleared (in-order completion). Instruction issuing stalls when there is a conflict of functional units. Elapsed time is 8 cycles
Superscalar Design Issues • In-order issue with out-of-order completion • Any number of instructions can be executed at a time • Instruction issue is still limited by resource conflicts or data and procedural dependency • Output dependency resulting from out-of-order completion must be resolved • “Instruction” interrupts can be tricky • Look at the same I1-I6 example with out-of-order completion • I2 can finish earlier than I1 and allows I3 to finish earlier. This saves 1 cycle
Superscalar Design Issues • Out-of-order issue with out-of-order completion • Decode and execute stages are decoupled via an instruction buffer “window” • Decoded instructions are “stored” in the window awaiting execution • Functional units will take instructions from the window in an attempt to stay busy • “Antidependency” class of data dependency must be dealt with • Look at the previous example with out-of-order completion and a queue for out-of-order issue • I6 can finish earlier than I5. This saves 1 cycle compared to previous
Superscalar Design Issues • Register Naming • Output dependency and antidependency occur because register contents may not reflect the correct ordering from the program may result in pipiline stall • Output dependency and antidependency are eliminated by the use of a register “pool” as follows • For each instruction that writes to a register X, a “new” register X is instantiated • Multiple “register Xs” can co-exist • Consider S1: R3 = R3+ R5 S2: R4 = R3 +1 S3: R3 = R5 + 1 S4: R7 = R3 + R4 S3 can not complete before S2 starts as S2 needs a value in R3 and S3 changes R3
Superscalar Design Issues • Register Naming • Consider S1: R3 = R3+ R5 S2: R4 = R3 +1 S3: R3 = R5 + 1 S4: R7 = R3 + R4 Becomes… S1: R3b = R3a+ R5a S2: R4b = R3b +1 S3: R3c = R5a + 1 S4: R7b = R3c + R4b S3 can not complete before S2 starts as S2 needs a value in R3 and S3 changes R3
Superscalar Design Issues • Impact on machine parallelism • Adding (ALU) functional units without register renaming support may not be cost-effective • Performance is limited by data dependency • Out-of-order issue benefits from large instruction buffer windows • Easier for a functional unit to find a pending instruction • Branch Prediction • Delayed branch is not good for superscalar • Multiple instructions need to execute in delay slot causing instruction dependence problems • For superscalar processor, branch prediction is used instead of delayed branch