480 likes | 588 Views
Princess Sumaya University for Technology. Computer Architecture. Dr. Esam Al_Qaralleh. Performance & cost. Performance Evolution. 1970s Mainframes dominated – performance improved 25—30%/yr Mostly due to improved architecture + some technology aids 1980s
E N D
Princess Sumaya University for Technology Computer Architecture Dr. Esam Al_Qaralleh
Performance & cost
Performance Evolution • 1970s • Mainframes dominated – performance improved 25—30%/yr • Mostly due to improved architecture + some technology aids • 1980s • VLSI + microprocessor became the foundation • Technology improves at 35%/yr
Performance Evolution (Cont.) • 1980s (Cont.) • Compiler focus brought on the great CISC vs. RISC debate • With the exception of Intel – RISC won the argument • RISC performance improved by 50%/year initially • Of course RISC is not as simple anymore and the compiler is a key part of the game • Does not matter how fast your computer is, if the compiler wastes most of it due to the inability to generate efficient code • With the exploitation of instruction-level parallelism (pipeline + super-scalar) and the use of caches, performance is further enhanced CISC: Complex Instruction Set Computing RISC: Relegate Important Stuff to the Compiler (Reduced Instruction Set Computing)
Growth in Performance (Figure 1.1) Mainly due to advanced architecture ideas Technology driven
Optimizing the Design • Usually the functional requirements are set by the company/marketplace • Which design is optimal dependent on the choice of metric • Cost minimized simple design • Performance maximized complex design or better technology • Time to market minimized also favors simplicity • Oh – and you only get one shot • Requires heaps of simulation and must quantify everything • Inherent requirements for deep infrastructure and support • Plus you must predict the trends…
Cost • Clearly a market place issue -- profit as a function of volume • Let’s focus on hardware costs • Factors impacting cost • Learning curve – manufacturing costs decrease over time • Yield – the percentage of manufactured devices that survives the testing procedure • Volume is also a key factor in determine cost • Commodities are products that are sold by multiple vendors in large volumes and are essentially identical. (laptops)
Integrated Circuits Costs Die Cost goes roughly with die area
Cost of an Integrated Circuit Die Yield is the fraction or percentage of good dies on a wafer number is a parameter that corresponds roughly to the number of masking level, a measure on manufacturing complexity, critical to die yield ( = 4.0 is a good estimate).
Example: Finding the number of dies • Find the number of die per 30-cm wafer for a die that is 0.7 cm on a side. • Ans: The total die area is 049 cm2. Thus (30/2)2 30 Dies per wafer = ------------- ---------------- = 1347 0.49 ( 2 0.49)0.5
Example: Finding the die yield • Find the die yield for dies that are 1 cm on a side and 0.7 cm on a side, assuming a defect density of 0.6 per cm2. Ans: The total die areas are 1 cm2 and 0.49 cm2. For the larger die yield is Die yield={1+(0.6 1)/4}-4=0.57 For the smaller die, it is Die yield = {1+(0.6 0.49)/4}-4=0.75
Computer Designers and Chip Costs • The computer designer affects die size, and hence cost, both by what functions are included on or excluded from the die and by the number of I/O pins
Definitions of Time • Time can be defined in different ways, depending on what we are measuring: • Response time : Total time to complete a task, including time spent executing on the CPU, accessing disk and memory, waiting for I/O and other processes, and operating system overhead. • CPU execution time : Total time a CPU spends computing on a given task (excludes time for I/O or running other programs). This is also referred to as simply CPU time. • User CPU time : Total time CPU spends in the program • System CPU execution time : Total time operating systems spends executing tasks for the program. • For example, a program may have a system CPU time of 22 sec., a user CPU time of 90 sec., a CPU execution time of 112 sec., and a response time of 162 sec..
performance Time to do the task (Execution Time) – execution time, response time,latency Tasks per day, hour, week, sec, ns. .. (Performance) – performance, throughput, bandwidth Response time– the time between the start and the completion of a task Thus, to maximize performance, need to minimize execution time If X is n times faster than Y, then Throughput – the total amount of work done in a given time Important to data center managers Decreasing response time almost always improves throughput
Calculating CPU Performance • Want to distinguish elapsed time and the time spent on our task • CPU execution time (CPU time) – time the CPU spends working on a task • Does not include time waiting for I/O or running other programs • Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program
Calculating CPU Performance (Cont.) • We tend to count instructions executed = IC • Note looking at the object code is just a start • What we care about is the dynamic count - e.g. don’t forget loops, recursion, branches, etc. • CPI (Clock Per Instruction) is a figure of merit
Calculating CPU Performance (Cont.) • 3 Focus Factors -- Cycle Time, CPI, IC • Sadly - they are interdependent and making one better often makes another worse (but small or predictable impacts) • Cycle time depends on HW technology and organization • CPI depends on organization (pipeline, caching...) and ISA • IC depends on ISA and compiler technology • Often CPI’s are easier to deal with on a per instruction basis
# CPU clock cycles # Instructions Average clock cycles = x for a program for a program per instruction Clock Cycles per Instruction • Not all instructions take the same amount of time to execute • One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction • Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute • A way to compare two different implementations of the same ISA
Effective CPI • Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging n Overall effective CPI = (CPIi x ICi) i = 1 • Where ICi is the count (percentage) of the number of instructions of class i executed • CPIi is the (average) number of clock cycles per instruction for that instruction class • n is the number of instruction classes • The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs
Example of Computing CPU time • If a computer has a clock rate of 50 MHz, how long does it take to execute a program with 1,000 instructions, if the CPI for the program is 3.5? • Using the equation CPU time = Instruction count x CPI / clock rate gives CPU time = 1000 x 3.5 / (50 x 106) • If a computer’s clock rate increases from 200 MHz to 250 MHz and the other factors remain the same, how many times faster will the computer be? CPU time old clock rate new 250 MHz ------------------- = ---------------------- = ---------------- = 1.25 CPU time new clock rate old 200 MHZ
n AM = 1/n Timei i = 1 Comparing and Summarizing Performance • How do we summarize the performance for benchmark set with a single number? • The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM) • Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.)) • Where Timei is the execution time for the ith program of a total of n programs in the workload • A smaller mean indicates a smaller average execution time and thus improved performance
Choosing Programs to Evaluate Performance • Real applications – clearly the right choice • Porting and eliminating system-dependent activities • User burden -- to know which of your programs you really care about • Modified (or scripted) applications • Enhance portability or focus on particular aspects of system performance • Kernels – small, key pieces of real programs • Best used to isolate performance of individual features to explain the reasons from differences in performance of real programs • i.e. testing memory/ALU/branches intructions • Not real programs however -- no user really uses them
Choosing Programs to Evaluate Performance (Cont.) • Toy benchmarks – quicksort, puzzle • Beginning programming assignment • Synthetic benchmarks • Try to match the average frequency of operations and operands of a large set of programs • No user really runs them -- not even pieces of real programs • They typically reside in cache & don’t test memory performance • At the very least you must understand what the benchmark code is in order to understand what it might be measuring • Companies thrive or bust on benchmark performance • Hence they optimize for the benchmark • BEWARE ALWAYS!!
Benchmark Suites • SPEC (Standard Performance Evaluation Corporation) • http://www.spec.org • Desktop benchmarks • CPU-intensive: SPEC CPU2000 • Graphic-intensive: SPECviewperf • Server benchmarks • CPU throughput-oriented: SPECrate • I/O activity: SPECSFS (NFS), SPECWeb • Transaction processing: TPC (Transaction Processing Council) • Embedded benchmarks • EEMBC (EDN Embedded Microprocessor Benchmark Consortium)
Other Performance Metrics • Power consumption – especially in the embedded market where battery life is important (and passive cooling) • For power-limited applications, the most important metric is energy efficiency
CPI Inst. Count Cycle Time Evaluating ISAs • Design-time metrics: • Can it be implemented, in how long, at what cost? • Can it be programmed? Ease of compilation? • Static Metrics: • How many bytes does the program occupy in memory? • Dynamic Metrics: • How many instructions are executed? How many bytes does the processor fetch to execute the program? • How many clocks are required per instruction? Best Metric: Time to execute the program! depends on the instructions set, the processor organization, and compilation techniques.
Other Problems • Let’s assume we can get the test jig specified properly • See the following example • Which is better? • By how much? • Are the program equally important?
Some Aggregate Job Mix Options • Arithmetic Mean - provides a simple average • Does not account for weight - all programs treated equal • Weighted arithmetic mean • Weight is the frequency % of use • Better but beware the dominant program time • Depend on the reference machine
Normalized Time Metrics • Geometric Mean • Has the nice property that: • Ratio of the means = Mean of the ratios • Consistent no matter which machine is the reference • Better than arithmetic means but • Don’t form accurate prediction models – don’t predict execution time • Still have to remain cautious
Normalized Time Metrics Arithmetic mean should not be used to average normalized execution time
Make the Common Case Fast • Need to validate that it is common or uncommon • Often • Common cases are simpler than uncommon cases • e.g. exceptions like overflow, interrupts, ... • Truly simple is usually both cheap and fast - best of both worlds • Trick is to quantify the advantage of a proposed enhancement
Amdahl’s Law • Defines speedup gained from a particular feature • Depends on 2 factors • Fraction of original computation time that can take advantage of the enhancement - e.g. the commonality of the feature • Level of improvement gained by the feature • Amdahl’s law Quantification of the diminishing return principle
Amdahl's Law (Cont.) Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected
Simple Example • Important Application: • FPSQRT 20% • FP instructions account for 50% • Other 30% • Designers say same cost to speedup: • FPSQRT by 40x • FP by 2x • Other by 8x • Which one should you invest? • Straightforward plug in the numbers & compare BUT what’s your guess?? Amdahl’s Law says nothing about cost
Example of Amdahl’s Law • Floating point instructions are improved to run twice as fast, but only 10% of the time was spent on these instructions originally. How much faster is the new machine? 1 ExTimeold ExTimenew Speedup= = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 1 Speedup= = 1.053 (1 - 0.1) + 0.1/2 • The new machine is 1.053 times as fast, or 5.3% faster. • How much faster would the new machine be if floating point instructions become 100 times faster? 1 Speedup= = 1.109 (1 - 0.1) + 0.1/100
Estimating Performance Improvements • Assume a processor currently requires 10 seconds to execute a program and processor performance improves by 50 percent per year. • By what factor does processor performance improve in 5 years? (1 + 0.5)^5 = 7.59 • How long will it take a processor to execute the program after 5 years? ExTimenew = 10/7.59 = 1.32 seconds
Performance Example • Computers M1 and M2 are two implementations of the same instruction set. • M1 has a clock rate of 50 MHz and M2 has a clock rate of 75 MHz. • M1 has a CPI of 2.8 and M2 has a CPI of 3.2 for a given program. • How many times faster is M2 than M1 for this program? • What would the clock rate of M1 have to be for them to have the same execution time? ExTimeM1 ICM1 x CPIM1 / Clock RateM1 2.8/50 = = = 1.31 ExTimeM2 ICM2 x CPIM2 / Clock RateM2 3.2/75
Simple Example • Suppose we have made the following measurements: • Frequency of FP operations (other than FPSQR) =25% • Average CPI of FP operations=4.0 • Average CPI of other instructions=1.33 • Frequency of FPSQR=2% • CPI of FPSQR=20 • Two design alternatives • Reduce the CPI of FPSQR to 2 • Reduce the average CPI of all FP operations to 2