COE 308

COE 308 Term - 051 Dr Abdelhafid Bouhraoua Performance

Need for Performance Goal: To Have Some Predictability Over Computer Usage

Need for Performance Goal: To Have Some Predictability Over Computer Usage Consequence: To Be Able To Adequately Choose The Right Computer For A Given Application

Examples where Performance is needed • High Accessibility • Data-Base Server • Web Server • Banking System • High Speed • Astronomy • Genetic Research • Weather Prediction • Low Cost • POS Terminal • Portable Device • Cell Phone • Embedded Apps • (Appliances, Toys, …)

Defining Performance • Speed ? • Accessibility ? • Cost ?

Defining Performance • Speed ? • Accessibility ? • Cost ? Only Speed Is Considered in This Context

What Speed ? Which Plane Has Higher Performance ?

What Speed ? Which Plane Has Higher Performance ? • Time to do the task (Execution Time) – execution time, response time, latency • Tasks per day, hour, week, sec, ns. .. (Performance) – throughput, bandwidth Response time and throughput often are in opposition

Definitions • Performance is in units of things-per-second • bigger is better • If we are primarily concerned with response time: Performance(x) = 1/Execution_time(x) " X is n times faster than Y" means: Performance(X) n = ----------------------------------------- Performance(Y)

Throughput and Response Time • Time of Concorde vs. Boeing 747? • Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours • Throughput of Concorde vs. Boeing 747 ? • Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” • Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster” • Boeing is 1.6 times (“60%”)faster in terms of throughput • Concord is 2.2 times (“120%”) faster in terms of flying time We will focus primarily on execution time for a single job

Relative Performance Computer A is n Times Faster Than Computer B if:

Relative Performance Computer A is n Times Faster Than Computer B if: Performance A ----------------------------------------- = n Performance B

Relative Performance Computer A is n Times Faster Than Computer B if: Performance A ----------------------------------------- = n Performance B Or Execution Time B ------------------------------------------ = n Execution Time A

Metrics and their Relation Most Basic Metrics: Clock Cycles, Clock Cycle Time, CPU Time, # of Instructions per program CPU Time = CPU Clk Cycles/Program * Clk Cycle Time CPU Clk Cycles/Program CPU Time = ----------------------------------------------------------------------------------- Clock Rate (Frequency) CPU Cycles/Program = Instr./Program x Average Cycles/Inst.

CPI = CPI (Cycles Per Instruction) Average Cycles Per Instruction CPI = (CPU Time /Clock Cycle Time) / Instruction Count = Clock Cycles / Instruction Count n: number of instructions in the Instruction Set CPIi: number of clock cycles Instruction i takes to execute Ii: Count of instructions of type i in the program CPU time = Clock Cycle Time * CPI = Clock Cycles / Instruction Count Divide CPU time by Clock Cycle Time and Instruction Count to get the CPI Fi: Frequency of Instructions Fi = Ii /Instruction Count

CPI = CPI (Cycles Per Instruction) Average Cycles Per Instruction CPI = (CPU Time /Clock Cycle Time) / Instruction Count = Clock Cycles / Instruction Count n: number of instructions in the Instruction Set CPIi: number of clock cycles Instruction i takes to execute Ii: Count of instructions of type i in the program CPU time = Clock Cycle Time * CPI = Clock Cycles / Instruction Count Divide CPU time by Clock Cycle Time and Instruction Count to get the CPI Fi: Frequency of Instructions Fi = Ii /Instruction Count Invest Resource Where Time Is Spent

Metrics and their Relation- Revisited - Seconds CPU TIME = ------------------------- Program Instructions Cycles Seconds CPU TIME = ----------------------------------- X -------------------------------- X ------------------------- Program Instruction Cycle Implementation/ Compiler Optimization Dependant CPI - Variable Clock Cycle – Fixed

Example • Example (RISC processor) • Typical Mix • Base Machine (Reg / Reg) • Op Freq CPI(i) CPI(i) x Freq • ALU 50% 1 .5 • Load 20% 5 1.0 • Store 10% 3 .3 • Branch 20% 2 .4 • How much faster would the machine be if a better data cache • reduced the average load time to 2 cycles? • How does this compare with using branch prediction to shave a • cycle off the branch time? • What if two ALU instructions could be executed at once?

Answering 1. • Computing the CPI Before Improvement: • Op Freq CPI(i) CPI(I) x Freq • ALU 50% 1 .5 • Load 20% 5 1.0 • Store 10% 3 .3 • Branch 20% 2 .4 • ----------- • CPI1 = .5x1 + .2x5 + .1%x3 +.2x2 = 2.2 • Computing the CPI After Improvement: • Op Freq CPI(i) CPI(i) x FreQ • ALU 50% 1 .5 • Load 20% 2 .4 • Store 10% 3 .3 • Branch 20% 2 .4 • ----------- • CPI2 = .5x1 + .2x2 + .1%x3 +.2x2 = 1.6

Answering 1. (cont.) How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? Answer: It is n times faster with:

Answering 1. (cont.) How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? Answer: It is n times faster with: n = CPU Time Before Imp. / CPU Time After Imp. = Clock Cycle Time * CPI1 * Inst. Count / Clock Cycle Time * CPI2 * Inst. Count = CPI1 / CPI2 = 2.2 / 1.6 = 1.375

Answering 1. (cont.) • How much faster would the machine be if a better data cache • reduced the average load time to 2 cycles? • Answer: • It is n times faster with: • n = CPU Time Before Imp. / CPU Time After Imp. • = Clock Cycle Time * CPI1 * Inst. Count / • Clock Cycle Time * CPI2 * Inst. Count • = CPI1 / CPI2 = 2.2 / 1.6 = 1.375 • We Say: • CPU is 1.375 times faster, or • CPU is 37.50% faster

Answering 2. How does this compare with using branch prediction to shave a cycle off the branch time? Answer: “Shaving” a cycle off the branch time means CPI of branch is reduced by one cycle • Computing the CPI After Improvement: • Op Freq CPI(I) CPI(i) x Freq • ALU 50% 1 .5 • Load 20% 5 1.0 • Store 10% 3 .3 • Branch 20% 1 .2 • ----------- • CPI2 = .5x1 + .2x5 + .1%x3 +.2x1= 2.0 Reducing the Load time produces better performances than reducing the branch time

Answering 3. What if two ALU instructions could be executed at once? Answer: Two instructions executed at once means: For one instruction, it takes virtually half the time to execute on machine B. So, CPI(i)B = CPI(i)A/2 • Computing the CPI of Machine B • Op Freq CPI(i) CPI(I) x Freq • ALU 50% .5 .25 • Load 20% 5 1.0 • Store 10% 3 .3 • Branch 20% 2 .4 • ----------- • CPI1 = .5x1 + .2x5 + .1%x3 +.2x2 = 1.95

Time % Evaluation How to determine which class of instructions takes the highest time ? • Evaluate Time Percentages of Instructions • Cannot be Directly Measured (Program has Mixed Instructions) • Need to be Computed Using CPI and Frequency

Time % Evaluation • Given: • Ic: Instruction Count • Ii: Instruction Count for Instruction Class i • Fi: Frequency of Instructions of Class i • Tc: Clock Cycle Time • CPIi: Clock Cycles/Instruction for Class i • CPI: Average Clock Cycles / Instruction for the whole program • Pi: Percentage of time for instruction of Class i CPUtime = CPI x Ic x Tc CPUtimei= CPIi x Ii x Tc Ii = Ic x Fi CPUtimei= CPIi x Ic x Fi x Tc Pi = CPUtimei / CPUtime Pi = CPIi x Ic x Fi x Tc / (CPI x Ic x Tc) CPIi x Fi CPI Pi =

Amdahl’s Law Speed-up due to Enhancement E

Amdahl’s Law Speed-up due to Enhancement E Execution Time w/o E Performance w/ E Speedup = --------------------------------- = ----------------------------- Execution Time w/ E Performance w/o E

Amdahl’s Law Speed-up due to Enhancement E Execution Time w/o E Performance w/ E Speedup = --------------------------------- = ----------------------------- Execution Time w/ E Performance w/o E Suppose that Enhancement E accelerate a portion F Only by a factor S TFE TA TFA TE

Amdahl’s Law New Enhancement touched only a fraction F of the whole execution time TA and reduced this fraction by a factor S while keeping the remainder part of TA unchanged TE = TA – TFA + TFE TA – TFA is unchanged TFA = TA * F F is a fraction of TA TFE = TFA/S = TA * F/S Time is reduced by a factor S TE = TA – TA*F + TA * F/S Means:

Amdahl’s Law New Enhancement touched only a fraction F of the whole execution time TA and reduced this fraction by a factor S while keeping the remainder part of TA unchanged TE = TA – TFA + TFE TA – TFA is unchanged TFA = TA * F F is a fraction of TA TFE = TFA/S = TA * F/S Time is reduced by a factor S TE = TA – TA*F + TA * F/S Means: 1 ------------------ (1-F + (F/S)) Speedup = TE = TA * (1 – F + (F/S))

Benchmarks • Few users run same program over and over • Need Programs specially developed to compare performance • Best Reference: Real Application • Real Application NOT common to all users Benchmarks are Programs developed for the sole purpose of Performance Evaluation

Typical Workload

Full Application Benchmark

Small Benchmarks

SPEC95 • Eighteen application benchmarks (with inputs) reflecting a technical computing workload • Eight integer • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex • Ten floating-point intensive • tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 • Must run with standard compiler flags • eliminate special undocumented incantations that may not even generate working code for real programs

Fallacies and Pitfalls • Amdahl’s law sets limits only and is NOT unlimited • Improvement of one aspect cannot improve the overall performance by a factor proportional to the size of the improvement • Hardware-independent metrics DO NOT predict performance • Code size, Impl. of software systems • Using MIPS (Millions of Inst. Per Second) as a performance metric • Instructions have different CPI • MIPS metric vary from one program to the other on the SAME CPU.

COE 308

COE 308

Presentation Transcript

COE

COE Labs

COE 308

COE 308

COE 308

COE-589

COE Biology

COE 308

COE 308

COE 308

Biology COE

CoE/ARB

308

COE 205

COE 205

coe

Coe

COE 308

COE 308