EC6009 ADVANCED COMPUTER ARCHITECTURE

EC6009 ADVANCED COMPUTER ARCHITECTURE Review of fundamentals of CPU, Memory and IO - Trends in Technology, Power, Energy and Cost, Dependability- Performance Evaluation. UNIT I FUNDAMENTALS OF COMPUTER DESIGN

INTRODUCTION • 65–70 years back the first general purpose electronic computer was created. • Today less than $500 mobile computer that has more performance, more main memory and more disk storage than a computer in 1985 for $1 million. • This rapid improvement has come both from advances in the technology used to build computers and from innovations in computer design. • RISC based machine focused on two critical performance techniques. Exploitation of Instruction Level Parallelism (initially through pipelining and later through multiple instruction issue) Use of Caches. • For many applications, the highest performance microprocessors of today outperform the supercomputer of less than 10 years ago. • Dramatic improvement in cost-performance leads to new classes of computers.

INTRODUCTION • The last decade saw the rise of smart cell phones and tablet computers, which are many people are using as their primary computing platform instead of PCs. • These mobile client devices are increasingly using the internet to access warehouses containing tens of thousands of servers. • Mainframe computers and high performance Supercomputers all are collections of microprocessors. • Today the nature of application also changes. Speech, sound, images and videos are becoming increasingly important along with predictable response time that is so critical to the user experience. • An inspiring example is Google Goggles. • This application lets you hold up your cell phone to point its camera at an object, and the image is sent wirelessly over the internet to a WSC that recognize the object and tells you interesting information about it. • Read the bar code on a book cover to tell you if a book is available online and its price. • Since 2003, single-processor performance improvement has dropped to less than 22% per year due to the twin hurdles of maximum power dissipation and the lack of more ILP. • In 2004, Intel canceled its high-performance uniprocessor projects and joined others in declaring that the road to higher performance would be via multiple processors per chip rather than via faster uniprocessors.

REVIEW OF FUNDAMENTALS OF CPU The functional blocks in a computer are 1. ALU 2. Control Unit 3. Memory 4. Input Unit 5. Output Unit • The ALU contains necessary electronic circuits to perform arithmetic and logical operations. • The Control Unit analyses each instruction in the program and sends the relevant control Signals to all other units – ALU, Memory, Input and Output Unit. • The program is fed into the computer through the input unit and stored in the memory. In order to execute the program, the instructions have to be fetched from memory one by one. This fetching of instruction is done by the control unit. • After an instruction is fetched, the control unit decodes the instruction. According to the instruction, the control unit issues control signals to other units. • After an instruction is executed, the result of the instruction is stored in memory or stored temporarily in the control unit or ALU, so that this can be used by the next instruction. • The results of a program are taken out of the computer through the output unit. • The control unit and ALU are collectively known as Central Processing Unit (CPU).

REVIEW OF FUNDAMENTALS OF CPU • The physical units in a computer such as the CPU, Memory, Input and Output units form the Hardware. • The Compilers as well as user programs (high level language or machine language) form the software. • Hardware works as dictated by the software. The operating system is a special software that manages the H/W and S/W. Arithmetic and Logic Unit: The ALU has hardware circuits which perform primitive arithmetic and logical operations. The H/W sections in ALU are • Adder • Accumulator • General Purpose Register • Counters • Shifters • Complementer. Adder: adds two numbers and gives the result. Accumulator: Register which temporarily holds the results of a previous operation in the ALU.

REVIEW OF FUNDAMENTALS OF CPU General Purpose Register: When an operand is stored in main memory, it takes time to retrieve it. If it is stored within the CPU, it is immediately available to the CPU. The GPR’s store different types of information 1. Operand 2. Operand address 3. Constant Since they are used for multiple purposes, these registers are known as GPR’s. Scratch Pad Memory or Registers: During Complex operations like multiplication, division etc., it is necessary to store intermediate results temporarily. For this purpose there are usually one or more scratch pad registers. These are purely internal H/W resources and not addressable by program. Shifter and Complementer: The shifter provides left and right shift required for various operations. The complementer provides 2’s complement of binary numbers.

REVIEW OF FUNDAMENTALS OF CPU CONTROL UNIT: The control unit is the most complex unit in a computer. Its main functions are 1. Fetching instructions 2. Analyzing the OPCODE 3. Generating control signals for performing various operations. H/W resources of a control unit: Program Counter or Instruction Address Counter (IAC): IAC contains the memory address of the next instruction to be fetched. When an instruction is fetched, the IAC is incremented so that it points to the address of the next instruction. Every instruction contains an opcode. In addition it may contain one or more of the following. • Operand • Operand address • Register address PSW Register : It contains various status bits describing the current condition of the CPU. These are known as flags. Two such flags are • Interrupt Enable: When this bit is 1, CPU will recognize interrupt requests. When this bit is 0, interrupt requests will be ignored by the CPU and they remain pending. The NMI is an exception to this. 2.Overflow: When this bit is 1, it indicates there is an overflow condition in ALU in the previous Arithmetic operation.

MEMORY AND IO The Memory is organized in to locations. Each memory location is known as one memory Word. Memory Types: Older computers use magnetic core memory while the present day we use Semiconductor Memory. Core memory is non-volatile where semiconductor memory is volatile. semiconductor memory is of two types: SRAM and DRAM. SRAM preserves the contents of all the locations as long as the power supply is present. DRAM memory can retain the content of any location only for a few milliseconds. Random Access and Sequential Access Memories: In a RAM access time is same for all locations. (Core and Semiconductor Memories are RAM) In a sequential access memory, the read or write access is sequential. The time taken For accessing the first location is the shortest and the time taken for the last location is the Longest. ( Magnetic tape)

MEMORY AND IO Memory Organization: The Memory unit consists of the following sections: • Memory Address Register (MAR) • Memory Data Register (MDR) • Memory Control Logic • Memory cells For the read operation, the CPU does the following sequence: • Sends the address to MAR. • Sends READ signal to memory control unit. The Memory control unit decodes the address bits and identifies the location to be accessed. Then it initiates a read operation of the memory. The memory takes some amount of time to present the contents of the location in MDR. (iii) After a sufficient time interval, the CPU transfers the information from MDR.

MEMORY AND IO For Write operation, the CPU does the following sequence: • Sends address to MAR. • Sends data to MDR. • Sends WRITE signal to memory control unit. The Memory control unit decodes the address bits and identifies the location Where the write operation has to be performed. It then routes the MDR Contents to memory and initiates the write operation. Memory Access Time: The time taken by the memory to supply the contents of a location , from the time it receives ‘Read’ is called the Memory Access time. Core Memory 800ns and semiconductor memory 100ns. Memory Cycle Time: The memory access time plus the additional recovery Time (memory is busy due to internal operation) is known as Memory Cycle time. Auxiliary Memory: • Floppy Disk drive • Hard Disk drive • Magnetic tape drive 4. CD-ROM. Input / Output Units: Common input units are Keyboard, floppy disk, hard disk, magnetic tape, mouse, light pen, Scanner, Optical disk, etc. Common Output units are display terminal, printer, plotter, floppy disk drive , Hard disk drive, magnetic tape drive and optical disk drive, etc.

TRENDS IN TECHNOLOGY INTEGRATED CIRCUIT LOGIC TECHNOLOGY: • Transistor density increases by 35% per year . • Increases in die size ranging from 10% to 20% per year. • The combined effect is a growth in transistor count on a chip of about 40% to 55% per year or doubling every 18 to 24 months. • This trend is popularly known as Moore’s law.

TRENDS IN TECHNOLOGY

TRENDS IN TECHNOLOGY SEMICONDUCTOR DRAM: • Capacity per DRAM chip has increased by about 25% to 40% per year recently, doubling roughly every two to three years.

TRENDS IN TECHNOLOGY SEMICONDUCTOR FLASH: (Electrically Erasable Programmable Read- Only Memory) • Non-Volatile semiconductor memory – standard storage device in PMD’s. • Capacity per Flash chip has increased by about 50% to 60% per year recently, doubling roughly every two years . • Flash memory is 15 to 20 times cheaper per bit than DRAM. MAGNETIC DISK TECHNOLOGY: • Prior to 1990, density increased by about 30% per year, doubling in 3 years. Increased 100% per year in 1996. Since 2004, it has dropped back to 40% per year. • Disks are 15 to 25 times cheaper per bit than Flash. • Disks are 300 to 500 times cheaper per bit than DRAM. • This technology is central to server and warehouse scale storage. NETWORK TECHNOLOGY: • Network performance depends on both on the performance of switches and Performance of the transmission system.

PERFORMANCE TRENDS BANDWIDTH OR THROUGHPUT: • It is the total amount of work done in a given time, such as megabytes per second for a disk transfer. LATENCY OR RESPONSE TIME: • It is the time between the start and completion of an event, such as milliseconds for a disk access. TRENDS IN POWER AND ENERGY IN IC: • For CMOS chips, the primary energy has been in switching transistors, also called dynamic energy. • The energy required per transistor is proportional to the Product of the capacitive load driven by the transistor and the square of the voltage. Energy dynamic α Capacitive load X Voltage 2

TRENDS IN POWER AND ENERGY IN IC: • The energy of pulse of the logic transition 0→ 1 → 0 or 1 → 0 → 1. The energy of a single transition(0→ 1 or 1 → 0) is then Energy dynamic α ½ X Capacitive load X Voltage 2 • The power required per transistor is the product of the energy of a transition multiplied by the frequency of transition. Power dynamic α ½ X Capacitive load X Voltage 2 X Frequency switched • Dynamic power and energy are greatly reduced by lowering the voltages. Voltages have dropped from 5V to just under 1V in 20 Years.

TRENDS IN POWER AND ENERGY IN IC: Do nothing well: • Most µp today turn off the clock of inactive modules to save energy and dynamic power. For ex, if no floating-point instructions are executing, the clock of the floating point unit is disabled. If some cores are idle, their clocks are stopped. Dynamic Voltage-Frequency Scaling (DVFS): • PMD, laptops and servers have periods of low activity where there is no need to operate at the highest clock frequency and voltages. • Modern µp’s offer a few clock frequencies and voltages – operate at lower power and energy. • Power savings via DVFS – a server may be operated at 3 different clock rates : 2.4 GHz, 1.8 GHz and 1 GHz. Design for typical case: • PMDs and laptops are often idle, memory and storage offer low power modes to save energy – extend battery life time. • On-chip temperature sensors to detect when activity should be reduced automatically to avoid overheating.

TRENDS IN POWER AND ENERGY IN IC: Overlocking: • Intel offered Turbo mode in 2008 – chip decides it is safe to run at a higher clock rate for a short time – few cores until temperature starts to rise. • For a single threaded code, these microprocessors can turn off all cores but one and run it at an even higher clock rate. • Operating System turn off Turbo mode – no notification once it is enabled- programs vary in performance due to room temperature. • Power static α Current static X Voltage • Static power is proportional to number of devices. Increasing number of transistors, increases power even if they are idle. • SRAM caches need power to maintain the storage values. • Processor is a portion of the whole energy cost of a system – use faster, less energy-efficient processor to allow the rest of the system to go into a sleep mode – race-to-halt.

TRENDS IN COST: Cost of an IC: • Cost of IC = Cost of die + Cost of testing die + Cost of packaging and final test Final test Yield • Cost of die = Cost of wafer Dies per wafer X Die yield • Dies per wafer = ∏ X (Wafer diameter/2)2 _ ∏ X Wafer diameter Die area √(2X Die area) • Problem 1: Find the number of dies per 300 mm (30 cm) wafer for a die that is 1.5 cm on a side and for a die that is 1.0 cm on a side. • Dies per wafer 1.5 cm (Die area = 1.5 X 1.5 =2.25 cm2) = 270 • Dies per wafer 1.0 cm (Die area = 1 X 1 = 1cm2) = 640

DEPENDABILITY • Dependability is a measure of system availability, reliability, and its maintainability. • Infrastructure providers started offering Service Level Agreement (SLA) to guarantee that their networking or power service would be dependable. • For example they would pay the customer a penalty if they didn’t meet an agreement more than some hours per month. Two main measures of dependability: Module Reliability: • Mean Time To Failure (MTMF) – reliability measure – reciprocal of MTTF is a rate of failures. • Service interruption is measured as Mean Time To Repair (MTTR). • Mean Time Between Failures (MTBF) = MTTF + MTTR. Module Availability = MTTF / (MTTF + MTTR)

MEASURING, REPORTING AND SUMMARIZING PERFORMANCE • Amazon.com administrator may say a computer is faster when it completes more transactions per hour. • The computer user is interested in reducing response time –the time between the start and the completion of an event - referred as execution time. • The operator of warehouse scale computer may be interested in increasing throughput – the total amount of work done in a given time. • We often want to relate the performance of two different computers say X and Y. The phrase X is faster than Y i.e the response or execution time is lower on X than Y for the given task. In particular X is n time faster than Y. Execution time Y = n Execution time X • Since Execution time is reciprocal of performance. η = Performance X Performance Y

QUANTITATIVE PRINCIPLES OF COMPUTER DESIGN PRINCIPLE OF LOCALITY: • Programs tend to reuse data and instructions they have used recently. The principle of locality applies to data accesses, though not as strongly as to code accesses. • Two different types of locality have been observed. • Temporal Locality: Recently accessed items are likely be accessed in the near future. • Spatial Locality: Items whose addresses are near one another tend to be referenced close together in time. AMDAHL’S LAW: • It states that the performance improvement can be gained from using faster mode of execution is limited by the fraction of the time faster mode can be used. • Speedup = performance for entire task using the enhancement when possible performance for entire task without using the enhancement Alternatively • Speedup = Execution time for entire task without using the enhancement Execution time for entire task using the enhancement when possible

QUANTITATIVE PRINCIPLES OF COMPUTER DESIGN Execution time new = Execution time old X ((1 – Fraction enhanced) + Fraction enhanced / Speedup enhanced ) Speedup overall = Execution time old / Execution time new = 1 / (1 – Fraction enhanced) + Fraction enhanced / Speedup enhanced PROBLEM 1: • Suppose that we want to enhance the processor used for web serving. The new processor is 10 times faster on computation in the web serving application than the original processor. Assuming that the original processor is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the enhancement? Fraction enhanced = 0.4 Speedup enhanced = 10 Speedup overall = 1.56

THE PROCESSOR PERFORMANCE EQUATION • Essentially all computers are constructed using a clock running at a constant rate. These discrete time events are called ticks, clock ticks, clock periods, clocks, cycles or clock cycles. • CPU time = CPU clock cycles for a program X Clock cycle time (or) CPU time = CPU clock cycles for a program / Clock rate Clock cycles Per Instruction CPI = CPU clock cycles for a program / Instruction Count

EC6009 ADVANCED COMPUTER ARCHITECTURE

EC6009 ADVANCED COMPUTER ARCHITECTURE

Presentation Transcript

CS203 – Advanced Computer Architecture

Advanced Computer Architecture

CSCI 8150 Advanced Computer Architecture

CSCI 8150 Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture

Computer Architecture Advanced Topics

Computer Architecture Advanced Branch Prediction

Computer Architecture Advanced Topics

CMPE 421 Advanced Computer Architecture

6.893: Advanced VLSI Computer Architecture

Advanced Computer Architecture

Advanced Computer Architecture CSE 8383

Advanced Computer Architecture

Advanced Computer Architecture Lecture 18

Advanced Computer Architecture

CS355 Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture

ECE729 : Advanced Computer Architecture

CMPE 421 Advanced Computer Architecture

CSE 8383 - Advanced Computer Architecture

Advanced Computer Architecture

Advanced Computer Architecture 5MD00