1 / 37

COE 308

COE 308. Term - 051 Dr Abdelhafid Bouhraoua Performance. Need for Performance. Goal: To Have Some Predictability Over Computer Usage. Need for Performance. Goal: To Have Some Predictability Over Computer Usage. Consequence: To Be Able To Adequately Choose The Right Computer

hedya
Download Presentation

COE 308

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COE 308 Term - 051 Dr Abdelhafid Bouhraoua Performance

  2. Need for Performance Goal: To Have Some Predictability Over Computer Usage

  3. Need for Performance Goal: To Have Some Predictability Over Computer Usage Consequence: To Be Able To Adequately Choose The Right Computer For A Given Application

  4. Examples where Performance is needed • High Accessibility • Data-Base Server • Web Server • Banking System • High Speed • Astronomy • Genetic Research • Weather Prediction • Low Cost • POS Terminal • Portable Device • Cell Phone • Embedded Apps • (Appliances, Toys, …)

  5. Defining Performance • Speed ? • Accessibility ? • Cost ?

  6. Defining Performance • Speed ? • Accessibility ? • Cost ? Only Speed Is Considered in This Context

  7. What Speed ? Which Plane Has Higher Performance ?

  8. What Speed ? Which Plane Has Higher Performance ? • Time to do the task (Execution Time) – execution time, response time, latency • Tasks per day, hour, week, sec, ns. .. (Performance) – throughput, bandwidth Response time and throughput often are in opposition

  9. Definitions • Performance is in units of things-per-second • bigger is better • If we are primarily concerned with response time: Performance(x) = 1/Execution_time(x) " X is n times faster than Y" means: Performance(X) n = ----------------------------------------- Performance(Y)

  10. Throughput and Response Time • Time of Concorde vs. Boeing 747? • Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours • Throughput of Concorde vs. Boeing 747 ? • Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” • Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster” • Boeing is 1.6 times (“60%”)faster in terms of throughput • Concord is 2.2 times (“120%”) faster in terms of flying time We will focus primarily on execution time for a single job

  11. Relative Performance Computer A is n Times Faster Than Computer B if:

  12. Relative Performance Computer A is n Times Faster Than Computer B if: Performance A ----------------------------------------- = n Performance B

  13. Relative Performance Computer A is n Times Faster Than Computer B if: Performance A ----------------------------------------- = n Performance B Or Execution Time B ------------------------------------------ = n Execution Time A

  14. Metrics and their Relation Most Basic Metrics: Clock Cycles, Clock Cycle Time, CPU Time, # of Instructions per program CPU Time = CPU Clk Cycles/Program * Clk Cycle Time CPU Clk Cycles/Program CPU Time = ----------------------------------------------------------------------------------- Clock Rate (Frequency) CPU Cycles/Program = Instr./Program x Average Cycles/Inst.

  15. CPI = CPI (Cycles Per Instruction) Average Cycles Per Instruction CPI = (CPU Time /Clock Cycle Time) / Instruction Count = Clock Cycles / Instruction Count n: number of instructions in the Instruction Set CPIi: number of clock cycles Instruction i takes to execute Ii: Count of instructions of type i in the program CPU time = Clock Cycle Time * CPI = Clock Cycles / Instruction Count Divide CPU time by Clock Cycle Time and Instruction Count to get the CPI Fi: Frequency of Instructions Fi = Ii /Instruction Count

  16. CPI = CPI (Cycles Per Instruction) Average Cycles Per Instruction CPI = (CPU Time /Clock Cycle Time) / Instruction Count = Clock Cycles / Instruction Count n: number of instructions in the Instruction Set CPIi: number of clock cycles Instruction i takes to execute Ii: Count of instructions of type i in the program CPU time = Clock Cycle Time * CPI = Clock Cycles / Instruction Count Divide CPU time by Clock Cycle Time and Instruction Count to get the CPI Fi: Frequency of Instructions Fi = Ii /Instruction Count Invest Resource Where Time Is Spent

  17. Metrics and their Relation- Revisited - Seconds CPU TIME = ------------------------- Program Instructions Cycles Seconds CPU TIME = ----------------------------------- X -------------------------------- X ------------------------- Program Instruction Cycle Implementation/ Compiler Optimization Dependant CPI - Variable Clock Cycle – Fixed

  18. Example • Example (RISC processor) • Typical Mix • Base Machine (Reg / Reg) • Op Freq CPI(i) CPI(i) x Freq • ALU 50% 1 .5 • Load 20% 5 1.0 • Store 10% 3 .3 • Branch 20% 2 .4 • How much faster would the machine be if a better data cache • reduced the average load time to 2 cycles? • How does this compare with using branch prediction to shave a • cycle off the branch time? • What if two ALU instructions could be executed at once?

  19. Answering 1. • Computing the CPI Before Improvement: • Op Freq CPI(i) CPI(I) x Freq • ALU 50% 1 .5 • Load 20% 5 1.0 • Store 10% 3 .3 • Branch 20% 2 .4 • ----------- • CPI1 = .5x1 + .2x5 + .1%x3 +.2x2 = 2.2 • Computing the CPI After Improvement: • Op Freq CPI(i) CPI(i) x FreQ • ALU 50% 1 .5 • Load 20% 2 .4 • Store 10% 3 .3 • Branch 20% 2 .4 • ----------- • CPI2 = .5x1 + .2x2 + .1%x3 +.2x2 = 1.6

  20. Answering 1. (cont.) How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? Answer: It is n times faster with:

  21. Answering 1. (cont.) How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? Answer: It is n times faster with: n = CPU Time Before Imp. / CPU Time After Imp. = Clock Cycle Time * CPI1 * Inst. Count / Clock Cycle Time * CPI2 * Inst. Count = CPI1 / CPI2 = 2.2 / 1.6 = 1.375

  22. Answering 1. (cont.) • How much faster would the machine be if a better data cache • reduced the average load time to 2 cycles? • Answer: • It is n times faster with: • n = CPU Time Before Imp. / CPU Time After Imp. • = Clock Cycle Time * CPI1 * Inst. Count / • Clock Cycle Time * CPI2 * Inst. Count • = CPI1 / CPI2 = 2.2 / 1.6 = 1.375 • We Say: • CPU is 1.375 times faster, or • CPU is 37.50% faster

  23. Answering 2. How does this compare with using branch prediction to shave a cycle off the branch time? Answer: “Shaving” a cycle off the branch time means CPI of branch is reduced by one cycle • Computing the CPI After Improvement: • Op Freq CPI(I) CPI(i) x Freq • ALU 50% 1 .5 • Load 20% 5 1.0 • Store 10% 3 .3 • Branch 20% 1 .2 • ----------- • CPI2 = .5x1 + .2x5 + .1%x3 +.2x1= 2.0 Reducing the Load time produces better performances than reducing the branch time

  24. Answering 3. What if two ALU instructions could be executed at once? Answer: Two instructions executed at once means: For one instruction, it takes virtually half the time to execute on machine B. So, CPI(i)B = CPI(i)A/2 • Computing the CPI of Machine B • Op Freq CPI(i) CPI(I) x Freq • ALU 50% .5 .25 • Load 20% 5 1.0 • Store 10% 3 .3 • Branch 20% 2 .4 • ----------- • CPI1 = .5x1 + .2x5 + .1%x3 +.2x2 = 1.95

  25. Time % Evaluation How to determine which class of instructions takes the highest time ? • Evaluate Time Percentages of Instructions • Cannot be Directly Measured (Program has Mixed Instructions) • Need to be Computed Using CPI and Frequency

  26. Time % Evaluation • Given: • Ic: Instruction Count • Ii: Instruction Count for Instruction Class i • Fi: Frequency of Instructions of Class i • Tc: Clock Cycle Time • CPIi: Clock Cycles/Instruction for Class i • CPI: Average Clock Cycles / Instruction for the whole program • Pi: Percentage of time for instruction of Class i CPUtime = CPI x Ic x Tc CPUtimei= CPIi x Ii x Tc Ii = Ic x Fi CPUtimei= CPIi x Ic x Fi x Tc Pi = CPUtimei / CPUtime Pi = CPIi x Ic x Fi x Tc / (CPI x Ic x Tc) CPIi x Fi CPI Pi =

  27. Amdahl’s Law Speed-up due to Enhancement E

  28. Amdahl’s Law Speed-up due to Enhancement E Execution Time w/o E Performance w/ E Speedup = --------------------------------- = ----------------------------- Execution Time w/ E Performance w/o E

  29. Amdahl’s Law Speed-up due to Enhancement E Execution Time w/o E Performance w/ E Speedup = --------------------------------- = ----------------------------- Execution Time w/ E Performance w/o E Suppose that Enhancement E accelerate a portion F Only by a factor S TFE TA TFA TE

  30. Amdahl’s Law New Enhancement touched only a fraction F of the whole execution time TA and reduced this fraction by a factor S while keeping the remainder part of TA unchanged TE = TA – TFA + TFE TA – TFA is unchanged TFA = TA * F F is a fraction of TA TFE = TFA/S = TA * F/S Time is reduced by a factor S TE = TA – TA*F + TA * F/S Means:

  31. Amdahl’s Law New Enhancement touched only a fraction F of the whole execution time TA and reduced this fraction by a factor S while keeping the remainder part of TA unchanged TE = TA – TFA + TFE TA – TFA is unchanged TFA = TA * F F is a fraction of TA TFE = TFA/S = TA * F/S Time is reduced by a factor S TE = TA – TA*F + TA * F/S Means: 1 ------------------ (1-F + (F/S)) Speedup = TE = TA * (1 – F + (F/S))

  32. Benchmarks • Few users run same program over and over • Need Programs specially developed to compare performance • Best Reference: Real Application • Real Application NOT common to all users Benchmarks are Programs developed for the sole purpose of Performance Evaluation

  33. Typical Workload

  34. Full Application Benchmark

  35. Small Benchmarks

  36. SPEC95 • Eighteen application benchmarks (with inputs) reflecting a technical computing workload • Eight integer • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex • Ten floating-point intensive • tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 • Must run with standard compiler flags • eliminate special undocumented incantations that may not even generate working code for real programs

  37. Fallacies and Pitfalls • Amdahl’s law sets limits only and is NOT unlimited • Improvement of one aspect cannot improve the overall performance by a factor proportional to the size of the improvement • Hardware-independent metrics DO NOT predict performance • Code size, Impl. of software systems • Using MIPS (Millions of Inst. Per Second) as a performance metric • Instructions have different CPI • MIPS metric vary from one program to the other on the SAME CPU.

More Related