780 likes | 912 Views
Computer Performance and Cost. Pradondet Nilagupta Fall 2005 (original notes from Randy Katz, & Prof. Jan M. Rabaey , UC Berkeley Prof. Shaaban, RIT). IR. Regs. Review: What is Computer Architecture?. Interfaces. API. ISA. Link. I/O Chan. Technology. Machine Organization.
E N D
Computer Performance and Cost Pradondet Nilagupta Fall 2005 (original notes from Randy Katz, & Prof. Jan M. Rabaey ,UC Berkeley Prof. Shaaban, RIT) 204521 Digital System Architecture
IR Regs Review: What is Computer Architecture? Interfaces API ISA Link I/O Chan Technology Machine Organization Applications Measurement & Evaluation Computer Architect 204521 Digital System Architecture
Memory Controller NICs Memory Review: Computer System Components CPU Core 1 GHz - 3.8 GHz 4-way Superscaler RISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware speculation All Non-blocking caches L1 16-128K 1-2 way set associative (on chip), separate or unified L2 256K- 2M 4-32 way set associative (on chip) unified L3 2-16M 8-32 way set associative (off or on chip) unified L1 L2 L3 CPU Examples: Alpha, AMD K7: EV6, 200-400 MHz Intel PII, PIII: GTL+ 133 MHz Intel P4 800 MHz Caches SDRAM PC100/PC133 100-133MHZ 64-128 bits wide 2-way inteleaved ~ 900 MBYTES/SEC )64bit) Double Date Rate (DDR) SDRAM PC3200 200 MHZ DDR 64-128 bits wide 4-way interleaved ~3.2 GBYTES/SEC (one 64bit channel) ~6.4 GBYTES/SEC (two 64bit channels) RAMbus DRAM (RDRAM) 400MHZ DDR 16 bits wide (32 banks) ~ 1.6 GBYTES/SEC Front Side Bus (FSB) Off or On-chip adapters I/O Buses Current Standard Example: PCI, 33-66MHz 32-64 bits wide 133-528 MBYTES/SEC PCI-X 133MHz 64 bit 1024 MBYTES/SEC Memory Bus Controllers Disks Displays Keyboards Networks I/O Devices: I/O Subsystem North Bridge South Bridge Chipset 204521 Digital System Architecture
The Architecture Process Estimate Cost & Performance Sort New concepts created Good ideas Mediocre ideas Bad ideas 204521 Digital System Architecture
P C M Performance Measurement and Evaluation • Many dimensions to computer performance • CPU execution time • by instruction or sequence • floating point • integer • branch performance • Cache bandwidth • Main memory bandwidth • I/O performance • bandwidth • seeks • pixels or polygons per second • Relative importance depends on applications 204521 Digital System Architecture
Evaluation Tools • Benchmarks, traces, & mixes • macrobenchmarks & suites • application execution time • microbenchmarks • measure one aspect of performance • traces • replay recorded accesses • cache, branch, register • Simulation at many levels • ISA, cycle accurate, RTL, gate, circuit • trade fidelity for simulation rate • Area and delay estimation • Analysis • e.g., queuing theory MOVE 39%BR 20%LOAD 20%STORE 10%ALU 11% LD 5EA3ST 31FF….LD 1EA2…. 204521 Digital System Architecture
Microbenchmarks measure one performance dimension cache bandwidth main memory bandwidth procedure call overhead FP performance weighted combination of microbenchmark performance is a good predictor of application performance gives insight into the cause of performance bottlenecks Macrobenchmarks application execution time measures overall performance, but on just one application Perf. Dimensions Micro Macro Applications Benchmarks 204521 Digital System Architecture
Benchmarks measure the whole system application compiler operating system architecture implementation Popular benchmarks typically reflect yesterday’s programs computers need to be designed for tomorrow’s programs Benchmark timings often very sensitive to alignment in cache location of data on disk values of data Benchmarks can lead to inbreedingor positive feedback if you make an operation fast (slow) it will be used more (less) often so you make it faster (slower) and it gets used even more (less) and so on… Some Warnings about Benchmarks 204521 Digital System Architecture
Choosing Programs To Evaluate Performance Levels of programs or benchmarks that could be used to evaluate performance: • Actual Target Workload: Full applications that run on the target machine. • Real Full Program-based Benchmarks: • Select a specific mix or suite of programs that are typical of targeted applications or workload (e.g SPEC95, SPEC CPU2000). • Small “Kernel” Benchmarks: • Key computationally-intensive pieces extracted from real programs. • Examples: Matrix factorization, FFT, tree search, etc. • Best used to test specific aspects of the machine. • Microbenchmarks: • Small, specially written programs to isolate a specific aspect of performance characteristics: Processing: integer, floating point, local memory, input/output, etc. 204521 Digital System Architecture
Types of Benchmarks Cons Pros • Very specific. • Non-portable. • Complex: Difficult • to run, or measure. • Representative Actual Target Workload • Portable. • Widely used. • Measurements • useful in reality. • Less representative • than actual workload. Full Application Benchmarks • Easy to “fool” by designing hardware to run them well. Small “Kernel” Benchmarks • Easy to run, early in the design cycle. • Peak performance results may be a long way from real application performance • Identify peak performance and potential bottlenecks. Microbenchmarks 204521 Digital System Architecture
SPEC: System Performance Evaluation Cooperative The most popular and industry-standard set of CPU benchmarks. • SPECmarks, 1989: • 10 programs yielding a single number (“SPECmarks”). • SPEC92, 1992: • SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs). • SPEC95, 1995: • SPECint95 (8 integer programs): • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex • SPECfp95 (10 floating-point intensive programs): • tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 • Performance relative to a Sun SuperSpark I (50 MHz) which is given a score of SPECint95 = SPECfp95 = 1 • SPEC CPU2000, 1999: • CINT2000 (11 integer programs). CFP2000 (14 floating-point intensive programs) • Performance relative to a Sun Ultra5_10 (300 MHz) which is given a score of SPECint2000 = SPECfp2000 = 100 204521 Digital System Architecture
SPEC First Round • One program: 99% of time in single line of code • New front-end compiler could improve dramatically 204521 Digital System Architecture
Impact of Means on SPECmark89 for IBM 550(without and with special compiler option) Ratio to VAX: Time: Weighted Time: Program Before After Before After Before After gcc 30 29 49 51 8.91 9.22 espresso 35 34 65 67 7.64 7.86 spice 47 47 510 510 5.69 5.69 doduc 46 49 41 38 5.81 5.45 nasa7 78 144 258 140 3.43 1.86 li 34 34 183 183 7.86 7.86 eqntott 40 40 28 28 6.68 6.68 matrix300 78 730 58 6 3.43 0.37 fpppp 90 87 34 35 2.97 3.07 tomcatv 33 138 20 19 2.01 1.94 Mean 54 72 124 108 54.42 49.99 Geometric Arithmetic Weighted Arith. Ratio 1.33 Ratio 1.16 Ratio 1.09 204521 Digital System Architecture
SPEC CPU2000 Programs Benchmark Language Descriptions 164.gzip C Compression 175.vpr C FPGA Circuit Placement and Routing 176.gcc C C Programming Language Compiler 181.mcf C Combinatorial Optimization 186.crafty C Game Playing: Chess 197.parser C Word Processing 252.eon C++ Computer Visualization 253.perlbmk C PERL Programming Language 254.gap C Group Theory, Interpreter 255.vortex C Object-oriented Database 256.bzip2 C Compression 300.twolf C Place and Route Simulator 168.wupwise Fortran 77 Physics / Quantum Chromodynamics 171.swim Fortran 77 Shallow Water Modeling 172.mgrid Fortran 77 Multi-grid Solver: 3D Potential Field 173.applu Fortran 77 Parabolic / Elliptic Partial Differential Equations 177.mesa C 3-D Graphics Library 178.galgel Fortran 90 Computational Fluid Dynamics 179.art C Image Recognition / Neural Networks 183.equake C Seismic Wave Propagation Simulation 187.facerec Fortran 90 Image Processing: Face Recognition 188.ammp C Computational Chemistry 189.lucas Fortran 90 Number Theory / Primality Testing 191.fma3d Fortran 90 Finite-element Crash Simulation 200.sixtrack Fortran 77 High Energy Nuclear Physics Accelerator Design 301.apsi Fortran 77 Meteorology: Pollutant Distribution CINT2000 (Integer) CFP2000 (Floating Point) Source: http://www.spec.org/osg/cpu2000/ Programs application domain: Engineering and scientific computation 204521 Digital System Architecture
Top 20 SPEC CPU2000 Results (As of March 2002) Top 20 SPECint2000 Top 20 SPECfp2000 # MHz Processor int peak int base MHz Processor fp peak fp base 1 1300 POWER4 814 790 1300 POWER4 1169 1098 2 2200 Pentium 4 811 790 1000 Alpha 21264C 960 776 3 2200 Pentium 4 Xeon 810 788 1050 UltraSPARC-III Cu 827 701 4 1667 Athlon XP 724 697 2200 Pentium 4 Xeon 802 779 5 1000 Alpha 21264C 679 621 2200 Pentium 4 801 779 6 1400 Pentium III 664 648 833 Alpha 21264B 784 643 7 1050 UltraSPARC-III Cu 610 537 800 Itanium 701 701 8 1533 Athlon MP 609 587 833 Alpha 21264A 644 571 9 750 PA-RISC 8700 604 568 1667 Athlon XP 642 596 10 833 Alpha 21264B 571 497 750 PA-RISC 8700 581 526 11 1400 Athlon 554 495 1533 Athlon MP 547 504 12 833 Alpha 21264A 533 511 600 MIPS R14000 529 499 13 600 MIPS R14000 500 483 675 SPARC64 GP 509 371 14 675 SPARC64 GP 478 449 900 UltraSPARC-III 482 427 15 900 UltraSPARC-III 467 438 1400 Athlon 458 426 16 552 PA-RISC 8600 441 417 1400 Pentium III 456 437 17 750 POWER RS64-IV 439 409 500 PA-RISC 8600 440 397 18 700 Pentium III Xeon 438 431 450 POWER3-II 433 426 19 800 Itanium 365 358 500 Alpha 21264 422 383 20 400 MIPS R12000 353 328 400 MIPS R12000 407 382 Source: http://www.aceshardware.com/SPECmine/top.jsp 204521 Digital System Architecture
Common Benchmarking Mistakes (1/2) • Only average behavior represented in test workload • Skewness of device demands ignored • Loading level controlled inappropriately • Caching effects ignored • Buffer sizes not appropriate • Inaccuracies due to sampling ignored 204521 Digital System Architecture
Common Benchmarking Mistakes (2/2) • Ignoring monitoring overhead • Not validating measurements • Not ensuring same initial conditions • Not measuring transient (cold start) performance • Using device utilizations for performance comparisons • Collecting too much data but doing too little analysis 204521 Digital System Architecture
Architectural Performance Laws and Rules of Thumb 204521 Digital System Architecture
Measurement and Evaluation • Architecture is an iterative process: • Searching the space of possible designs • Make selections • Evaluate the selections made • Good measurement tools are required to accurately evaluate the selection. 204521 Digital System Architecture
Measurement Tools • Benchmarks, Traces, Mixes • Cost, delay, area, power estimation • Simulation (many levels) • ISA, RTL, Gate, Circuit • Queuing Theory • Rules of Thumb • Fundamental Laws 204521 Digital System Architecture
Measuring and Reporting Performance • What do we mean by one Computer is faster than another? • program runs less time • Response time or execution time • time that users see the output • Elapsed time • A latency to complete a task including disk accesses, memory accesses, I/O activities, operating system overhead • Throughput • total amount of work done in a given time 204521 Digital System Architecture
Performance “Increasing and decreasing” ????? We use the term “improve performance” or “ improve execution time” When we mean increase performance and decrease execution time . improve performance = increase performance improve execution time = decrease execution time 204521 Digital System Architecture
Metrics of Performance Application Answers per month Operations per second Programming Language Compiler (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s ISA Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins 204521 Digital System Architecture
Does Anybody Really Know What Time it is? • User CPU Time (Time spent in program) • System CPU Time (Time spent in OS) • Elapsed Time (Response Time = 159 Sec.) • (90.7+12.9)/159 * 100 = 65%, % of lapsed time that is CPU time. 45% of the time spent in I/O or running other programs UNIX Time Command: 90.7u 12.9s 2:39 65% 204521 Digital System Architecture
Example UNIX time command 90.7u 12.95 2:39 65% user CPU time is 90.7 sec system CPU time is 12.9 sec elapsed time is 2 min 39 sec. (159 sec) % of elapsed time that is CPU time is 90.7 + 12.9 = 65% 159 204521 Digital System Architecture
Time CPU time • time the CPU is computing • not including the time waiting for I/O or running other program User CPU time • CPU time spent in the program System CPU time • CPU time spent in the operating system performing task requested by the program decrease execution time CPU time = User CPU time + System CPU time 204521 Digital System Architecture
Performance System Performance • elapsed time on unloaded system CPU performance • user CPU time on an unloaded system 204521 Digital System Architecture
Plane DC to Paris Speed Passengers Throughput Boeing 747 6.5 hours 610 mph 470 286,700 BAD/Sud Concorde 3 hours 1350 mph 132 178,200 Two notions of “performance” Which has higher performance? ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns. .. – throughput, bandwidth Response time and throughput often are in opposition 204521 Digital System Architecture
Example • Time of Concorde vs. Boeing 747? • Concord is 1350 mph / 610 mph = 2.2 times faster • = 6.5 hours / 3 hours • Throughput of Concorde vs. Boeing 747 ? • Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” • Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster” • Boeing is 1.6 times (“60%”)faster in terms of throughput • Concord is 2.2 times (“120%”) faster in terms of flying time • We will focus primarily on execution time for a single job 204521 Digital System Architecture
Computer Performance Measures: Program Execution Time(1/2) • For a specific program compiled to run on a specific machine (CPU) “A”, the following parameters are provided: • The total instruction count of the program. • The average number of cycles per instruction (average CPI). • Clock cycle of machine “A” 204521 Digital System Architecture
Computer Performance Measures: Program Execution Time(2/2) • How can one measure the performance of this machine running this program? • Intuitively the machine is said to be faster or has better performance running this program if the total execution time is shorter. • Thus the inverse of the total measured program execution time is a possible performance measure or metric: PerformanceA = 1 / Execution TimeA How to compare performance of different machines? What factors affect performance? How to improve performance?!!!! 204521 Digital System Architecture
PerformanceA Execution TimeB Speedup = n = = PerformanceB Execution TimeA Comparing Computer Performance Using Execution Time • To compare the performance of two machines (or CPUs) “A”, “B” running a given specific program: PerformanceA = 1 / Execution TimeA PerformanceB = 1 / Execution TimeB • Machine A is n times faster than machine B means (or slower? if n < 1) : 204521 Digital System Architecture
Example For a given program: Execution time on machine A: ExecutionA = 1 second Execution time on machine B: ExecutionB = 10 seconds The performance of machine A is 10 times the performance of machine B when running this program, or: Machine A is said to be 10 times faster than machine B when running this program. The two CPUs may target different ISAs provided the program is written in a high level language (HLL) 204521 Digital System Architecture
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU Execution Time: The CPU Equation • A program is comprised of a number of instructions executed , I • Measured in: instructions/program • The average instruction executed takes a number of cycles per instruction (CPI) to be completed. • Measured in: cycles/instruction, CPI • CPU has a fixed clock cycle time C = 1/clock rate • Measured in: seconds/cycle • CPU execution time is the product of the above three parameters as follows: T = I x CPI x C execution Time per program in seconds Average CPI for program CPU Clock Cycle Number of instructions executed 204521 Digital System Architecture
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU Execution Time: Example • A Program is running on a specific machine with the following parameters: • Total executed instruction count: 10,000,000 instructions Average CPI for the program: 2.5 cycles/instruction. • CPU clock rate: 200 MHz. (clock cycle = 5x10-9 seconds) • What is the execution time for this program: CPU time = Instruction count x CPIx Clock cycle = 10,000,000 x 2.5 x 1 / clock rate = 10,000,000 x 2.5 x 5x10-9 = .125 seconds 204521 Digital System Architecture
CPU Time = Instruction count x CPIx Clock cycle Depends on: Program Used Compiler ISA Instruction Count I Depends on: Program Used Compiler ISA CPU Organization Depends on: CPU Organization Technology (VLSI) CPI Clock Cycle C Aspects of CPU Execution Time T = I x CPI x C (executed) (Average CPI) 204521 Digital System Architecture
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Factors Affecting CPU Performance Instruction Count I CPI Clock Cycle C Program X X X Compiler X Instruction Set Architecture (ISA) X X X X Organization (CPU Design) Technology (VLSI) X 204521 Digital System Architecture
Speedup = Old Execution Time = Iold x CPIold x Clock cycleold New Execution Time Inew x CPInew x Clock Cyclenew Performance Comparison: Example • From the previous example: A Program is running on a specific machine with the following parameters: • Total executed instruction count, I: 10,000,000 instructions • Average CPI for the program: 2.5 cycles/instruction. • CPU clock rate: 200 MHz. • Using the same program with these changes: • A new compiler used: New instruction count 9,500,000 New CPI: 3.0 • Faster CPU implementation: New clock rate = 300 MHZ • What is the speedup with the changes? Speedup = (10,000,000 x 2.5 x 5x10-9) / (9,500,000 x 3 x 3.33x10-9 ) = .125 / .095 = 1.32 or 32 % faster after changes. 204521 Digital System Architecture
Instruction Types & CPI • Given a program with n types or classes of instructions executed on a given CPU with the following characteristics: Ci = Count of instructions of typei CPIi = Cycles per instruction for typei Then: CPI = CPU Clock Cycles / Instruction Count I Where: Instruction Count I = S Ci i = 1, 2, …. n 204521 Digital System Architecture
Instruction class CPI A 1 B 2 C 3 Instruction counts for instruction class Code Sequence A B C 1 2 1 2 2 4 1 1 Instruction Types & CPI: An Example • An instruction set has three instruction classes: • Two code sequences have the following instruction counts: • CPU cycles for sequence 1 = 2 x 1 + 1 x 2 + 2 x 3 = 10 cycles CPI for sequence 1 = clock cycles / instruction count = 10 /5 = 2 • CPU cycles for sequence 2 = 4 x 1 + 1 x 2 + 1 x 3 = 9 cycles CPI for sequence 2 = 9 / 6 = 1.5 For a specific CPU design 204521 Digital System Architecture
CPIi x Fi CPI Instruction Frequency & CPI • Given a program with n types or classes of instructions with the following characteristics: Ci = Count of instructions of typei CPIi = Average cycles per instruction of typei Fi = Frequency or fraction of instruction typei executed = Ci/ total executed instruction count = Ci/ I Then: Fraction of total execution time for instructions of type i = 204521 Digital System Architecture
CPIi x Fi Base Machine (Reg / Reg) Op Freq, Fi CPIi CPIi x Fi % Time ALU 50% 1 .5 23% = .5/2.2 Load 20% 5 1.0 45% = 1/2.2 Store 10% 3 .3 14% = .3/2.2 Branch 20% 2 .4 18% = .4/2.2 CPI Typical Mix Instruction Type Frequency & CPI: A RISC Example Program Profile or Executed Instructions Mix Sum = 2.2 204521 Digital System Architecture
Performance Terminology “X is n% faster than Y” means: ExTime(Y) Performance(X) n --------- = -------------- = 1 + ----- ExTime(X) Performance(Y) 100 n= 100(Performance(X) - Performance(Y)) Performance(Y) n = 100(ExTime(Y) - ExTime(X)) ExTime(X) 204521 Digital System Architecture
n = 100(ExTime(Y) - ExTime(X)) ExTime(X) Example Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X? n = 100(15 - 10) 10 n = 50% 204521 Digital System Architecture
Speedup Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fractionenhanced of the task by a factor Speedupenhanced , and the remainder of the task is unaffected, then what is ExTime(E) = ? Speedup(E) = ? 204521 Digital System Architecture
Amdahl’s Law • States that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time faster mode can be used Speedup = Performance for entire task using the enhancement Performance for the entire task without using the enhancement or Speedup = Execution time without the enhancement Execution time for entire task using the enhancement 204521 Digital System Architecture
Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 1 ExTimeold ExTimenew Speedupoverall = = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 204521 Digital System Architecture
Example of Amdahl’s Law • Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew= Speedupoverall = 204521 Digital System Architecture
Example of Amdahl’s Law • Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew= ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold 1 Speedupoverall = 1.053 = 0.95 204521 Digital System Architecture
Performance Enhancement Calculations: Amdahl's Law • The performance enhancement possible due to a given design improvement is limited by the amount that the improved feature is used • Amdahl’s Law: Performance improvement or speedup due to enhancement E: Execution Time without E Performance with E Speedup(E) = -------------------------------------- = --------------------------------- Execution Time with E Performance without E • Suppose that enhancement E accelerates a fraction F of the execution time by a factor S and the remainder of the time is unaffected then: Execution Time with E = ((1-F) + F/S) X Execution Time without E Hence speedup is given by: Execution Time without E 1 Speedup(E) = --------------------------------------------------------- = -------------------- ((1 - F) + F/S) X Execution Time without E (1 - F) + F/S 204521 Digital System Architecture