380 likes | 567 Views
COE 308. Term - 051 Dr Abdelhafid Bouhraoua Performance. Need for Performance. Goal: To Have Some Predictability Over Computer Usage. Need for Performance. Goal: To Have Some Predictability Over Computer Usage. Consequence: To Be Able To Adequately Choose The Right Computer
E N D
COE 308 Term - 051 Dr Abdelhafid Bouhraoua Performance
Need for Performance Goal: To Have Some Predictability Over Computer Usage
Need for Performance Goal: To Have Some Predictability Over Computer Usage Consequence: To Be Able To Adequately Choose The Right Computer For A Given Application
Examples where Performance is needed • High Accessibility • Data-Base Server • Web Server • Banking System • High Speed • Astronomy • Genetic Research • Weather Prediction • Low Cost • POS Terminal • Portable Device • Cell Phone • Embedded Apps • (Appliances, Toys, …)
Defining Performance • Speed ? • Accessibility ? • Cost ?
Defining Performance • Speed ? • Accessibility ? • Cost ? Only Speed Is Considered in This Context
What Speed ? Which Plane Has Higher Performance ?
What Speed ? Which Plane Has Higher Performance ? • Time to do the task (Execution Time) – execution time, response time, latency • Tasks per day, hour, week, sec, ns. .. (Performance) – throughput, bandwidth Response time and throughput often are in opposition
Definitions • Performance is in units of things-per-second • bigger is better • If we are primarily concerned with response time: Performance(x) = 1/Execution_time(x) " X is n times faster than Y" means: Performance(X) n = ----------------------------------------- Performance(Y)
Throughput and Response Time • Time of Concorde vs. Boeing 747? • Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours • Throughput of Concorde vs. Boeing 747 ? • Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” • Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster” • Boeing is 1.6 times (“60%”)faster in terms of throughput • Concord is 2.2 times (“120%”) faster in terms of flying time We will focus primarily on execution time for a single job
Relative Performance Computer A is n Times Faster Than Computer B if:
Relative Performance Computer A is n Times Faster Than Computer B if: Performance A ----------------------------------------- = n Performance B
Relative Performance Computer A is n Times Faster Than Computer B if: Performance A ----------------------------------------- = n Performance B Or Execution Time B ------------------------------------------ = n Execution Time A
Metrics and their Relation Most Basic Metrics: Clock Cycles, Clock Cycle Time, CPU Time, # of Instructions per program CPU Time = CPU Clk Cycles/Program * Clk Cycle Time CPU Clk Cycles/Program CPU Time = ----------------------------------------------------------------------------------- Clock Rate (Frequency) CPU Cycles/Program = Instr./Program x Average Cycles/Inst.
CPI = CPI (Cycles Per Instruction) Average Cycles Per Instruction CPI = (CPU Time /Clock Cycle Time) / Instruction Count = Clock Cycles / Instruction Count n: number of instructions in the Instruction Set CPIi: number of clock cycles Instruction i takes to execute Ii: Count of instructions of type i in the program CPU time = Clock Cycle Time * CPI = Clock Cycles / Instruction Count Divide CPU time by Clock Cycle Time and Instruction Count to get the CPI Fi: Frequency of Instructions Fi = Ii /Instruction Count
CPI = CPI (Cycles Per Instruction) Average Cycles Per Instruction CPI = (CPU Time /Clock Cycle Time) / Instruction Count = Clock Cycles / Instruction Count n: number of instructions in the Instruction Set CPIi: number of clock cycles Instruction i takes to execute Ii: Count of instructions of type i in the program CPU time = Clock Cycle Time * CPI = Clock Cycles / Instruction Count Divide CPU time by Clock Cycle Time and Instruction Count to get the CPI Fi: Frequency of Instructions Fi = Ii /Instruction Count Invest Resource Where Time Is Spent
Metrics and their Relation- Revisited - Seconds CPU TIME = ------------------------- Program Instructions Cycles Seconds CPU TIME = ----------------------------------- X -------------------------------- X ------------------------- Program Instruction Cycle Implementation/ Compiler Optimization Dependant CPI - Variable Clock Cycle – Fixed
Example • Example (RISC processor) • Typical Mix • Base Machine (Reg / Reg) • Op Freq CPI(i) CPI(i) x Freq • ALU 50% 1 .5 • Load 20% 5 1.0 • Store 10% 3 .3 • Branch 20% 2 .4 • How much faster would the machine be if a better data cache • reduced the average load time to 2 cycles? • How does this compare with using branch prediction to shave a • cycle off the branch time? • What if two ALU instructions could be executed at once?
Answering 1. • Computing the CPI Before Improvement: • Op Freq CPI(i) CPI(I) x Freq • ALU 50% 1 .5 • Load 20% 5 1.0 • Store 10% 3 .3 • Branch 20% 2 .4 • ----------- • CPI1 = .5x1 + .2x5 + .1%x3 +.2x2 = 2.2 • Computing the CPI After Improvement: • Op Freq CPI(i) CPI(i) x FreQ • ALU 50% 1 .5 • Load 20% 2 .4 • Store 10% 3 .3 • Branch 20% 2 .4 • ----------- • CPI2 = .5x1 + .2x2 + .1%x3 +.2x2 = 1.6
Answering 1. (cont.) How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? Answer: It is n times faster with:
Answering 1. (cont.) How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? Answer: It is n times faster with: n = CPU Time Before Imp. / CPU Time After Imp. = Clock Cycle Time * CPI1 * Inst. Count / Clock Cycle Time * CPI2 * Inst. Count = CPI1 / CPI2 = 2.2 / 1.6 = 1.375
Answering 1. (cont.) • How much faster would the machine be if a better data cache • reduced the average load time to 2 cycles? • Answer: • It is n times faster with: • n = CPU Time Before Imp. / CPU Time After Imp. • = Clock Cycle Time * CPI1 * Inst. Count / • Clock Cycle Time * CPI2 * Inst. Count • = CPI1 / CPI2 = 2.2 / 1.6 = 1.375 • We Say: • CPU is 1.375 times faster, or • CPU is 37.50% faster
Answering 2. How does this compare with using branch prediction to shave a cycle off the branch time? Answer: “Shaving” a cycle off the branch time means CPI of branch is reduced by one cycle • Computing the CPI After Improvement: • Op Freq CPI(I) CPI(i) x Freq • ALU 50% 1 .5 • Load 20% 5 1.0 • Store 10% 3 .3 • Branch 20% 1 .2 • ----------- • CPI2 = .5x1 + .2x5 + .1%x3 +.2x1= 2.0 Reducing the Load time produces better performances than reducing the branch time
Answering 3. What if two ALU instructions could be executed at once? Answer: Two instructions executed at once means: For one instruction, it takes virtually half the time to execute on machine B. So, CPI(i)B = CPI(i)A/2 • Computing the CPI of Machine B • Op Freq CPI(i) CPI(I) x Freq • ALU 50% .5 .25 • Load 20% 5 1.0 • Store 10% 3 .3 • Branch 20% 2 .4 • ----------- • CPI1 = .5x1 + .2x5 + .1%x3 +.2x2 = 1.95
Time % Evaluation How to determine which class of instructions takes the highest time ? • Evaluate Time Percentages of Instructions • Cannot be Directly Measured (Program has Mixed Instructions) • Need to be Computed Using CPI and Frequency
Time % Evaluation • Given: • Ic: Instruction Count • Ii: Instruction Count for Instruction Class i • Fi: Frequency of Instructions of Class i • Tc: Clock Cycle Time • CPIi: Clock Cycles/Instruction for Class i • CPI: Average Clock Cycles / Instruction for the whole program • Pi: Percentage of time for instruction of Class i CPUtime = CPI x Ic x Tc CPUtimei= CPIi x Ii x Tc Ii = Ic x Fi CPUtimei= CPIi x Ic x Fi x Tc Pi = CPUtimei / CPUtime Pi = CPIi x Ic x Fi x Tc / (CPI x Ic x Tc) CPIi x Fi CPI Pi =
Amdahl’s Law Speed-up due to Enhancement E
Amdahl’s Law Speed-up due to Enhancement E Execution Time w/o E Performance w/ E Speedup = --------------------------------- = ----------------------------- Execution Time w/ E Performance w/o E
Amdahl’s Law Speed-up due to Enhancement E Execution Time w/o E Performance w/ E Speedup = --------------------------------- = ----------------------------- Execution Time w/ E Performance w/o E Suppose that Enhancement E accelerate a portion F Only by a factor S TFE TA TFA TE
Amdahl’s Law New Enhancement touched only a fraction F of the whole execution time TA and reduced this fraction by a factor S while keeping the remainder part of TA unchanged TE = TA – TFA + TFE TA – TFA is unchanged TFA = TA * F F is a fraction of TA TFE = TFA/S = TA * F/S Time is reduced by a factor S TE = TA – TA*F + TA * F/S Means:
Amdahl’s Law New Enhancement touched only a fraction F of the whole execution time TA and reduced this fraction by a factor S while keeping the remainder part of TA unchanged TE = TA – TFA + TFE TA – TFA is unchanged TFA = TA * F F is a fraction of TA TFE = TFA/S = TA * F/S Time is reduced by a factor S TE = TA – TA*F + TA * F/S Means: 1 ------------------ (1-F + (F/S)) Speedup = TE = TA * (1 – F + (F/S))
Benchmarks • Few users run same program over and over • Need Programs specially developed to compare performance • Best Reference: Real Application • Real Application NOT common to all users Benchmarks are Programs developed for the sole purpose of Performance Evaluation
SPEC95 • Eighteen application benchmarks (with inputs) reflecting a technical computing workload • Eight integer • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex • Ten floating-point intensive • tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 • Must run with standard compiler flags • eliminate special undocumented incantations that may not even generate working code for real programs
Fallacies and Pitfalls • Amdahl’s law sets limits only and is NOT unlimited • Improvement of one aspect cannot improve the overall performance by a factor proportional to the size of the improvement • Hardware-independent metrics DO NOT predict performance • Code size, Impl. of software systems • Using MIPS (Millions of Inst. Per Second) as a performance metric • Instructions have different CPI • MIPS metric vary from one program to the other on the SAME CPU.