250 likes | 449 Views
Enhancement : PART 2. CPU and Memory: Design, Implementation, and Enhancement. Adapted from: The Architecture of Computer Hardware and Systems Software: An Information Technology Approach 3rd Edition, Irv Englander John Wiley and Sons 2003 Wilson Wong, Bentley College
Enhancement : PART 2 CPU and Memory:Design, Implementation, and Enhancement Adapted from: The Architecture of Computer Hardware and Systems Software: An Information Technology Approach 3rd Edition, Irv Englander John Wiley and Sons 2003 Wilson Wong, Bentley College Linda Senne, Bentley College
Enhancement : PART 2 Topics: CISC vs RISC….part1 Address Modes….part1 Cache……part1 Pipelining…..part2 Scalar and Super Scalar….part2
Background Much of this material is based on the “Data Path” (see figure next slide)
The “DATA PATH” • As you study the DATA PATH figure here are some things to note: • The external bus connects the MAIN MEMORY and the Bus Interface Unit (BIU); also referred to as the CPU local bus • Data and instructions have separate cache within the CPU • The Prefetch Unit is looking for the next instruction in memory (or cache) and loading it into the prefetch queues (recall the serial nature of programming)
The “DATA PATH” • The Branch Prediction Unit is looking for the next instruction in memory BASED ON A branch-type instruction • Example: if the LMC code is executing at address 40, and the code is a 768 -- branch to address 68 if the Accumulator = 0; the Branch Prediction Unit goes and gets the instructions in 68,69,70…etc • The Pentium uses prefetch queues that are 64 bytes deep
The “DATA PATH” The sequence of events: • CPU initiates a fetch request—sent over the BIU • Memory subsystem gets needed data/inst received by BIU • BIU forwards Instructions instruction cache; data data cache • The prefetcher searches code cache for next instruction instruction queues (D1) • From the 2 prefetch queues, instructions are moved to the control unit to determine if both can be executed at the same time or just one (D2) • Concurrently, (and if the instruction is a branch type) the Branch Prediction Unit tries to determine what branch will be taken and fills the instruction queues (D1)
Pipelining • Fetching an instruction from memory is a major bottleneck. • So, the first step in pipelining is to get as many instructions as possible into instruction cache • The actions of fetching and decoding are broken down into “stages”… • Many texts use the assembly line concept as an analogy for pipelining • See next page for a five-stage pipeline
Pipelining • Notes for the previous slide… • During clock cycle 1, stage S1 is working on instruction 1, fetching it from memory • During clock cycle 2, stage S2 decodes instruction 1, while S1 fetches instruction 2 • During clock cycle 3, stage S3 fetches the operands for instruction 1; stage S2 decodes instruction 2; stage S1 fetches instruction 3. • During clock cycle 4, stage S4 executes instruction ___, S3 fetches operands for instruction ___, S2 decodes instruction ___, and S1 fetches instruction ___.
Pipelining U-pipeline V-pipeline
Pipelining • Notes for the previous slide… • Only one instruction is being complete at a time (scalar) • Two instructions must not conflict over resources of the other • Either the complier checks or • Conflicts are detected during execution • The u-pipeline (top) is the main pipeline • Can execute any Pentium instruction • The other v-pipeline (bottom) only executes simple integer instructions
Pipelining • The numbers…. • Suppose cycle time is 2nsec. Then for ONE instruction to complete is 2nsc X 5 stages = 10nsec (called latency) • But every clock cycle (2nsec) an instruction completes! • Look: 1 instruction = 2nsec • =1inst/2(10-9)sec • =1,000,000,000inst/2sec • 500,000,000 inst/sec • Or 500MIPS This is like 24inches = 2 feet….. so, 24in/2feet = 2in/feet
Pipelining • Test Question: Suppose cycle time is 7nsec and there is a 8 stage pipeline A) Calculate the latency B) calculate the MIPS ----------------------------------------------------- Solution: a) 7nsec x 8 = 56 nsec latency b) 7nsec = 1 instruction (1/7)109 inst/sec = (1/7)103106inst/sec Or 143MIPS…..(rounded) ………………………………………………………………………………………note 106 = M
Pipelining Summary • Assembly-line technique to allow overlapping between fetch-execute cycles of sequences of instructions • Only one instruction is being executed to completion at a time • Pipelining is also known as Scalar processing • Average instruction execution is approximately equal to the clock speed of the CPU • Problems from stalling • Instructions have different numbers of steps • Problems from branching
Pipelining Questions • Q: A program has 500 instructions. Each instruction averages 6 steps to complete; How many CPU cycles will it take to complete if it is implemented on a CPU that has pipelining capability? • Assume there are no branches or dependencies between instructions (unlikely, but just for academic purposes…) • Solution: Pipelining assumes each INSTRUCTION STEP completes in one CPU cycle; so • Total=500 inst x 6 steps/inst = 3,000 CPU cycles
Super Scalar • Process more than one instruction per clock cycle • Separate fetch and execute cycles as much as possible • Buffers for fetch and decode phases • Parallel execution units • The DATA PATH of the Pentium CPU is SUPER SCALAR
Branch Problem Solutions • Separate pipelines for both possibilities • Probabilistic approach • Requiring the following instruction to not be dependent on the branch • Instruction Reordering (superscalar processing)
Superscalar Issues • Out-of-order processing – dependencies (hazards) • Data dependencies • Branch (flow) dependencies and speculative execution • Parallel speculative execution or branch prediction • Branch History Table • Register access conflicts • Logical registers
Other Enhancements • Timing Issues • Microprogrammed Implementation • Hardware Implementation
Hardware Implementation • Hardware – operations are implemented by logic gates • Advantages • Speed • RISC designs are simple and typically implemented in hardware
Pipelining Questions • Q: A program has 500 instructions. Each instruction averages 6 steps to complete; How many CPU cycles will it take to complete if it is implemented on a CPU that has SUPER SCALAR capability? • Assume there are no branches or dependencies between instructions (unlikely, but just for academic purposes…) • Solution: SUPER SCALAR assumes each INSTRUCTION completes in one CPU cycle; so • Total=500 inst x 1 cycle/inst = 500 CPU cycles
Microprogrammed Implementation • Microcode are tiny programs stored in ROM that replace CPU instructions • Advantages • More flexible • Easier to implement complex instructions • Can emulate other CPUs • Disadvantage • Requires more clock cycles