CPU and Memory: Design, Implementation, and Enhancement

Enhancement : PART 2 CPU and Memory:Design, Implementation, and Enhancement Adapted from: The Architecture of Computer Hardware and Systems Software: An Information Technology Approach 3rd Edition, Irv Englander John Wiley and Sons 2003 Wilson Wong, Bentley College Linda Senne, Bentley College

Enhancement : PART 2 Topics: CISC vs RISC….part1 Address Modes….part1 Cache……part1 Pipelining…..part2 Scalar and Super Scalar….part2

Background Much of this material is based on the “Data Path” (see figure next slide)

The “DATA PATH”

The “DATA PATH” • As you study the DATA PATH figure here are some things to note: • The external bus connects the MAIN MEMORY and the Bus Interface Unit (BIU); also referred to as the CPU local bus • Data and instructions have separate cache within the CPU • The Prefetch Unit is looking for the next instruction in memory (or cache) and loading it into the prefetch queues (recall the serial nature of programming)

The “DATA PATH” • The Branch Prediction Unit is looking for the next instruction in memory BASED ON A branch-type instruction • Example: if the LMC code is executing at address 40, and the code is a 768 -- branch to address 68 if the Accumulator = 0; the Branch Prediction Unit goes and gets the instructions in 68,69,70…etc • The Pentium uses prefetch queues that are 64 bytes deep

The “DATA PATH” The sequence of events: • CPU initiates a fetch request—sent over the BIU • Memory subsystem gets needed data/inst received by BIU • BIU forwards Instructions  instruction cache; data  data cache • The prefetcher searches code cache for next instruction  instruction queues (D1) • From the 2 prefetch queues, instructions are moved to the control unit to determine if both can be executed at the same time or just one (D2) • Concurrently, (and if the instruction is a branch type) the Branch Prediction Unit tries to determine what branch will be taken and fills the instruction queues (D1)

Pipelining • Fetching an instruction from memory is a major bottleneck. • So, the first step in pipelining is to get as many instructions as possible into instruction cache • The actions of fetching and decoding are broken down into “stages”… • Many texts use the assembly line concept as an analogy for pipelining • See next page for a five-stage pipeline

Pipelining

Pipelining • Notes for the previous slide… • During clock cycle 1, stage S1 is working on instruction 1, fetching it from memory • During clock cycle 2, stage S2 decodes instruction 1, while S1 fetches instruction 2 • During clock cycle 3, stage S3 fetches the operands for instruction 1; stage S2 decodes instruction 2; stage S1 fetches instruction 3. • During clock cycle 4, stage S4 executes instruction ___, S3 fetches operands for instruction ___, S2 decodes instruction ___, and S1 fetches instruction ___.

Pipelining U-pipeline V-pipeline

Pipelining • Notes for the previous slide… • Only one instruction is being complete at a time (scalar) • Two instructions must not conflict over resources of the other • Either the complier checks or • Conflicts are detected during execution • The u-pipeline (top) is the main pipeline • Can execute any Pentium instruction • The other v-pipeline (bottom) only executes simple integer instructions

Pipelining • The numbers…. • Suppose cycle time is 2nsec. Then for ONE instruction to complete is 2nsc X 5 stages = 10nsec (called latency) • But every clock cycle (2nsec) an instruction completes! • Look: 1 instruction = 2nsec • =1inst/2(10-9)sec • =1,000,000,000inst/2sec • 500,000,000 inst/sec • Or 500MIPS This is like 24inches = 2 feet….. so, 24in/2feet = 2in/feet

Pipelining • Test Question: Suppose cycle time is 7nsec and there is a 8 stage pipeline A) Calculate the latency B) calculate the MIPS ----------------------------------------------------- Solution: a) 7nsec x 8 = 56 nsec latency b) 7nsec = 1 instruction  (1/7)109 inst/sec = (1/7)103106inst/sec Or 143MIPS…..(rounded) ………………………………………………………………………………………note 106 = M

Pipelining Summary • Assembly-line technique to allow overlapping between fetch-execute cycles of sequences of instructions • Only one instruction is being executed to completion at a time • Pipelining is also known as Scalar processing • Average instruction execution is approximately equal to the clock speed of the CPU • Problems from stalling • Instructions have different numbers of steps • Problems from branching

Pipelining Questions • Q: A program has 500 instructions. Each instruction averages 6 steps to complete; How many CPU cycles will it take to complete if it is implemented on a CPU that has pipelining capability? • Assume there are no branches or dependencies between instructions (unlikely, but just for academic purposes…) • Solution: Pipelining assumes each INSTRUCTION STEP completes in one CPU cycle; so • Total=500 inst x 6 steps/inst = 3,000 CPU cycles

Super Scalar • Process more than one instruction per clock cycle • Separate fetch and execute cycles as much as possible • Buffers for fetch and decode phases • Parallel execution units • The DATA PATH of the Pentium CPU is SUPER SCALAR

Super Scalar

Scalar vs. Superscalar Processing

Branch Problem Solutions • Separate pipelines for both possibilities • Probabilistic approach • Requiring the following instruction to not be dependent on the branch • Instruction Reordering (superscalar processing)

Superscalar Issues • Out-of-order processing – dependencies (hazards) • Data dependencies • Branch (flow) dependencies and speculative execution • Parallel speculative execution or branch prediction • Branch History Table • Register access conflicts • Logical registers

Other Enhancements • Timing Issues • Microprogrammed Implementation • Hardware Implementation

Hardware Implementation • Hardware – operations are implemented by logic gates • Advantages • Speed • RISC designs are simple and typically implemented in hardware

Pipelining Questions • Q: A program has 500 instructions. Each instruction averages 6 steps to complete; How many CPU cycles will it take to complete if it is implemented on a CPU that has SUPER SCALAR capability? • Assume there are no branches or dependencies between instructions (unlikely, but just for academic purposes…) • Solution: SUPER SCALAR assumes each INSTRUCTION completes in one CPU cycle; so • Total=500 inst x 1 cycle/inst = 500 CPU cycles

Microprogrammed Implementation • Microcode are tiny programs stored in ROM that replace CPU instructions • Advantages • More flexible • Easier to implement complex instructions • Can emulate other CPUs • Disadvantage • Requires more clock cycles

CPU and Memory: Design, Implementation, and Enhancement

CPU and Memory: Design, Implementation, and Enhancement

Presentation Transcript

Design and Implementation

گارگاه کامپیوتر CPU and Main Memory

CPU and memory usage. Memstat tool

SEQ CPU Implementation

Computer Hardware Components: CPU, Memory, and I/O

CHAPTER 7: The CPU and Memory

Design and Implementation

CPU Design

Design and Implementation*

SEQ CPU Implementation

SEQ CPU Implementation

Sequential CPU Implementation

Sequential CPU Implementation

Best Natural Memory and Brain Enhancement Supplements

Global Cognition and Memory Enhancement Drugs Market

Design and Implementation of Signatures in Transactional Memory Systems

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement

Design and Implementation

Design and Implementation

Design and Implementation Issues

Design and Implementation*

CHAPTER 7: The CPU and Memory