CS465 Performance Revisited (Chapter 1)

CS465PerformanceRevisited(Chapter 1) Be able to compare performance of simple system configurations and understand the performance implications of architectural choices

Performance and Cost: Purchasing vs Design Views • Our goal is to understand cost & performance implications of architectural choices. Consider 2 views: • Purchasing perspective: given 4 machines, which measure yields the best decision? • best performance • least cost • best performance / cost • Design perspective: select the design that yields • best performance • least cost • best performance / cost • Both require • basis for comparison • metric for evaluation

Performance • Measure, Report, and Summarize • Make intelligent choices • See through the marketing hype • Key to understanding underlying organizational motivationWhy is some hardware better than others for different programs?What factors of system performance are hardware related? (e.g., Do we need a new machine, or a new operating system?)How does the machine's instruction set affect performance?

Performance • Which of these determines performance? • # of cycles to execute program? • # of instructions in program? • # of cycles per second? • average # of cycles per instruction? • average # of instructions per second? • Common pitfall: thinking one of the variables is indicative of performance when it really isn’t. • Performance is determined by execution time Ó1998 Morgan Kaufmann Publishers

Two notions of “performance” Airplane Passengers Range (mi) Speed (mph) Boeing 777 375 4630 610 Boeing 747 470 4150 610 BAC/Sud Concorde 132 4000 1350 Douglas DC-8-50 146 8720 544 Which is the best measure of performance? Passenger capacity, range, speed, throughput, travel time Which has higher performance? Speed: Concorde is fastest; Range: Douglas DC-8-50 is longest °

Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 BAD/Sud Concodre 3 hours 1350 mph 132 178,200 Two notions of “Performance” - 2 • Throughput (passenger-milesperhour): Boeing 747 is highest • Cost of operation vs Cost of operation per passenger-miles per hour? CPU: Time to do the task – execution time, response time,latency Tasks per day, hour, week, sec, ns.. (Capacity) – throughput, bandwidth Response time and throughput are often in opposition

How to Measure Performance • Two approaches • User perspective: Response time / Execution time • Computer Center Manager perspective: Throughput based on number of jobs completed • How quickly is each job completed vs Total Amount of work done • We focus on execution time for a single job

Measures • Performance is inversely proportional to the execution time. • Performancex = 1/execution timex • Execution time decreases by 4 implies that performance has increased by 4. • “x is n times faster than y” means that • Performancex / Performancey = n

Measuring Execution Time • Elapsed Time • includes everything (disk and memory accesses, I/O, etc.) • a useful number, but often not good for CPU assessment • CPU time • doesn't include I/O or time spent running other programs • CPU time = system time + user time • Our focus: user CPU time • time spent executing the lines of code that are "in" our program

time Clock Cycles • Instead of reporting execution time in seconds, we often use cycles • Clock “ticks” indicate when to start activities (one abstraction): • cycle time = time between ticks = seconds per cycle • clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)A 200 Mhz. clock has a cycle time Ó Morgan Kaufmann Publishers

How to Improve Performance? Performance can be enhanced by either:________ the # of required cycles for a program, or________ the clock cycle time or, ________ the clock rate (inverse of clock cycle time).

1st instruction 2nd instruction 3rd instruction ... 4th 5th 6th time How many cycles are required for a program? • Is it safe to assume that # of cycles = # of instructions? This assumption is incorrect. Different instructions take different amounts of time. For example: add, lw in MIPS. Why do some instructions take more time than others? Remember – these are not lines of C code – these are machine instructions.

Different numbers of cycles for different instructions time • Multiplication/division takes more time than addition • Floating point operations take longer than integer ones • Accessing memory takes more time than accessing registers • Cycles to execute an instruction can be different for different machines. • In the same family of computers – cycles to execute an instruction can be different • Important point: changing the cycle time often changes the number of cycles required for various instructions (more later)

Example • Our favorite program runs in 10 seconds on computer A, which has a 400 Mhz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?"Don't Panic, can easily work this out from basic principles Ó Morgan Kaufmann Publishers

Example Clock CyclesA = CPU timeA * Clock Rate = 10 s * 400 * 106 c/s = 4 *109 1.2 * Clock CyclesA 1.2 * 4 *109 cycles Clock RateB = ------------------------- = -------------------------- CPU timeB 6 seconds Clock RateB = 800 cycles per second = 800 MHz Execution time from 10 s to 6s  Clock from 400 MHz to 800 MHz.

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Factors Affecting Computer Performance

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle “Average cycles per instruction” CPI • CPI = Clock Cycles to execute program / Instruction Count • = (CPU Time * Clock Rate) / Instruction Count Invest Resources where time is Spent! n CPU time = ClockPeriod *  CPIi * Ci ClockPeriod = ClockCycleTime CPIi = CPI for instr class i Ci = Count of instr class i instructions executed i = 1 n CPI = CPI * F where Fi = Ci i i i = 1 Instruction Count "instruction frequency"

CPI Example • For a program,Machine A has a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has a clock cycle time of 20 ns. and a CPI of 1.2 • For machine A CPU time = IC  CPI  Clock cycle time CPU time = IC  2.0  10 ns = 20 IC ns • For machine B CPU time = IC  1.2  20 ns = 24 IC ns

# of Instructions Example A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of CThe second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.Which sequence will be faster? How much?What is the CPI for each sequence?

# of Instructions Example • A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of CThe second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.Which sequence will be faster? How much?What is the CPI for each sequence?

Example – Instruction Mix / CPI What consumes the most CPU time? High speed cache reduces load to 2 cycles Branch prediction reduces branch to 1 cycle Two ALU instructions per cycle

F F/S Amdahl's Law Compute Task = Component that can be parallelized + Component that is serial Speedup due to enhancement E (impacts parallelizable part): ExTime w/o E Performance with E Speedup(E) = -------------------- = ------------------------- ExTime with E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then what is the speedup? ExTime(without E) = ((1-F) + F) X ExTime(without E) ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = 1 (1-F) + F/S

Summary: Instruction set design (MIPS) • Use general purpose registers with a load-store architecture: YES • Provide at least 16 general purpose registers plus separate floating-point registers: 31 GPR & 32 FPR • Support basic addressing modes: displacement (with an address offset size of 12 to 16 bits), immediate (size 8 to 16 bits), and register deferred; : YES: 16 bits for immediate, displacement (disp=0 => register deferred) • All addressing modes apply to all data transfer instructions : YES • Use fixed instruction encoding if interested in performance and use variable instruction encoding if interested in code size : Fixed • Support these data sizes and types: 8-bit, 16-bit, 32-bit integers and 32-bit and 64-bit IEEE 754 floating point numbers: YES • Support these simple instructions, since they will dominate the number of instructions executed: load, store, add, subtract, move register-register, and, shift, compare equal, compare not equal, branch (with a PC-relative address at least 8-bits long), jump, call, and return: YES • Aim for a minimalist instruction set: YES

CPI Instruction Count Cycle Time How to Evaluate Instruction Sets? Metric we use : Time to execute the program NOTE: this depends on instructions set, processor organization, and compilation techniques.

Some Popular Performance Measures MIPS = Millions of Instructions per second • Easy to understand; faster machines have higher MIPS Problems • Different computers have different instruction sets. How does one compare MIPS across platforms • MIPS based on specific programs • MIPS depends on compilers MIPS are not a function of the CPU alone BENCHMARKS – BASIS FOR COMPARISON

MFLOPS • Millions of floating point operations per second • Problems • Different machines have different set of floating point operations • Programs require a varying mix of floating point operations.

How to evaluate? • Target workload • Depends on user environment • Depends on “current” usage pattern • Standard workloads (benchmarks) • Should not be narrow – manufacturers design to benchmark – design appropriate instruction sets!!! • SPEC 2000 • A mix of tasks • CINT2000 – integer • CFP2000 – floating point • SPECweb 99 • Focus on webserver thruput

SPEC ratings for Pentium 3 and Pentium 4 Fig 4.6 in 3rd Edition

Relative Performance of 3 Intel processors Fig 4.8 in 3rd Edition

Relative Energy Efficiency Fig 4.9 in 3rd Edition

CS465 Performance Revisited (Chapter 1)

CS465 Performance Revisited (Chapter 1)

Presentation Transcript

WebQuests Revisited

CHAPTER 2: RELEVANCE REVISITED

TCP--Revisited

Parametric Tiling Revisited

Classes Revisited

Indexing - revisited

The Learning Leader …revisited

Concurrency revisited

CIRENSE revisited

Performance Improvement Revisited

Hyperthermia Revisited

Performance Evaluation: Markov Models, revisited

CS465 -Unix

CS465 Compiler Design Course webpage

CHAPTER 2: RELEVANCE REVISITED

CHAPTER 2: RELEVANCE REVISITED

Financial Math Revisited

Normal Distribution Revisited

Relevance feedback revisited

PERFORMANCE APPRAISALS

Deadlock Detection revisited

Chapter 4 Revisited