Lecture 3: Measuring and Evaluating Performance

Lecture 3: Measuring and Evaluating Performance Michael B. Greenwald Computer Architecture CIS 501 Fall 1999

General Information • Class: TR 1:30-3, in LRSM AuditoriumRecitation: T 10:30-12 in Moore 225 • Instructor: Professor Michael GreenwaldOffice: Moore (GRW), room 260email: cis501@cis.upenn.eduOffice hours: R10:30-12noon or by appt. • TA: Sotiris IoannidisOffice: Moore, room 102eemail: sotiris@dsl.cis.upenn.eduOffice hours: TR5-6PM or by appt. • Secretary: Christine MetzOffice: Moore, room 556

For your edification • Newsgroup: comp.arch

Measurement Tools Measure • Benchmarks, Traces, Mixes • Hardware: Cost, delay, area, power estimation • Simulation (many levels) • ISA, RT, Gate, Circuit • Queuing Theory • Rules of Thumb • Fundamental “Laws”/Principles Experiment Analyze Design All produce “measures”: what do measures mean? How do they compare?

DC to Paris Speed Passengers Throughput (pmph) 6.5 hours 610 mph 470 286,700 3 hours 1350 mph 132 178,200 The Bottom Line: Performance (and Cost) Plane Boeing 747 BAD/Sud Concorde • Which is better? It depends if you are trying to win a race from DC to Paris, or you are trying to move the most people.

Costs • Performance metrics are mostly useless without understanding costs.

Integrated Circuits Costs Die cost = Wafer cost Dies per Wafer * Die yield Dies per wafer =  * ( Wafer_diam / 2)2 –  * Wafer_diam – Test dies Die Area  2 * Die Area Die Yield = Wafer yield * 1 + Die Cost = Wafer cost * 1 +  * ( Wafer_diam / 2)2 –  * Wafer_diam Die Area  2 * Die Area  Defects_per_unit_area * Die_Area  { }  Defects_per_unit_area * Die_Area  { } Die Cost goes roughly with die area4

Meaning of “Execution Time”(a.k.a. Response time) • Wall-clock-time, response time, elapsed-time: latency (including idle time) • vs. CPU Time: non-idle • System vs. User time: both elapsed and CPU • system performance: elapsed time on unloaded system (includes OS + idle time) • CPU performance: user CPU time on unloaded system

The Bottom Line: Performance (and Cost) • "X is n times faster than Y" means • ExTime(Y) 3 610 Performance(X) • ------------  ------------------- • ExTime(X)6.5 1350 Performance(Y) • Speed of Boeing 747 vs. Concorde • Throughput of Boeing 747 vs. Concorde .461 .451

Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected

Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 1 ExTimeold ExTimenew Speedupoverall = = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced

Metrics of Performance Application Answers per month Operations per second Programming Language Compiler (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s ISA Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins Key metric is “time”, but many measures naturally expressed as “rates”

Metrics of Performance Application Answers per month (months/answer) Operations per sec (secs/op) Programming Language Compiler (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s ISA Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins Key metric is “time”, “natural” rates can be converted to time.

Metrics of Performance Application Answers per month Operations per second Programming Language Compiler (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s ISA Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins Most meaningful measure is at top. Other measures may not be independent.

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Aspects of CPU Performance 1/cycleTime Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X X Technology X

Cycles Per Instruction “Average Cycles per Instruction” • CPI = (CPU Time * Clock Rate) / Instruction Count • = Cycles / Instruction Count Invest Resources where time is Spent! n CPU time = CycleTime * CPI * I i i i = 1 “Instruction Frequency” n CPI = CPI * F where F = I i i i i i = 1 Instruction Count

Example: Calculating CPI Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch 20% 2 .4 (27%) 1.5 ALU 20% 1 .2 (11%) Load 10% 2 .2 (11%) Store 20% 2 .4 (22%) Branch 50% 2 1.0 (56%) 1.8 Typical Mix (GIGO)

Evaluating Performance of Computer Systems • How do we approximate the clients workload (real programs (weighted) used by client)? • Toy Benchmarks: small programs with known results • Synthetic Benchmarks: random ops that match avg. (guessed) frequency of operations. • Kernels: small key pieces of real programs • Real programs:without knowing weighting, or whether exhaustive, choose your own sample. • Benchmark suites: collection of all of the above for basis of comparison. For reporting measurements (regardless of how representative), the key feature is reproducibility.

SPEC: System Performance Evaluation Cooperative • http://www.spec.org/ (http://www.spec.org/osg) • First Round 1989 • 10 programs yielding a single number (“SPECmarks”) • Second Round 1992 • SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs) • Compiler Flags unlimited. March 93 of DEC 4000 Model 610: spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)= memcpy(b,a,c)” wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas

SPEC: System Performance Evaluation Cooperative • Third Round 1995 • new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point) • “benchmarks useful for 3 years” • Single flag setting for all programs: SPECint_base95, SPECfp_base95 • CPU98 (now called CPU2000?) • under development • add SMT (System MultiTasking benchmarks)

Why summarize performance? Accuracy vs. Complexity • Marketing department want simple way of saying “Ours is better!” • Consumers want a simple way of knowing whether they’ve gotten their money’s worth.

How to Summarize Performance • Track total execution time of all (weighted) benchmarks (Convert to exec. time, sum, convert back to measure, and average). • Arithmetic mean (weighted arithmetic mean) : (Ti)/n or (Wi*Ti) • Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) : n/ (1/Ri) or n/ (Wi/Ri) • Compare to reference architecture (spec92 vs. VAX11/780): normalized execution time • handy for scaling performance (e.g., X times faster than SPARCstation 10) • But do not take the arithmetic mean of normalized execution time, use the geometric((norm timei)^(1/n)

Normalized Execution Times and Geometric Means • Geometric mean independent of normalization. • Numerators the same, regardless of reference machine. Denominator cancels out in any machine-machine comparison.

Normalized Execution Times and Geometric Means • Geometric mean does not track execution time. • Halving a 1 usec. prog. has same effect as halving a 10 hour prog.

SPEC First Round • One program: 99% of time in single line of code • New front-end compiler could improve dramatically

Lecture 3: Measuring and Evaluating Performance