1 / 58

Enhancing Scientific Applications with Performance Grid Enablement

Explore performance, benchmarking, and parallel profiling for optimal results in grid computing. Learn how to improve execution time and rate of scientific applications.

fayt
Download Presentation

Enhancing Scientific Applications with Performance Grid Enablement

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Special Topics in Grid Enablement of Scientific Applications (CIS 6612) Performance/Profiling Presented by Javier Figueroa Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu

  2. Presentation Outline • Performance • Benchmarking • Parallel Performance • Profiling • Dimemas • Paraver • Acknowledgements

  3. Why Performance? • To do a time-consuming operation in less time • I am an aircraft engineer • I need to run a simulation to test the stability of the wings at high speed • I’d rather have the result in 5 minutes than in 5 hours so that I can complete the aircraft final design sooner. • To do an operation before a tighter deadline • I am a weather prediction agency • I am getting input from weather stations/sensors • I’d like to make the forecast for tomorrow before tomorrow

  4. Why Performance? • To do a high number of operations per seconds • I am the CTO of Amazon.com • My Web server gets 1,000 hits per seconds • I’d like my Web server and my databases to handle 1,000 transactions per seconds so that customers do not experience delays • Also called scalability • Amazon does “process” several GBytes of data per seconds

  5. Performance - Software • “High Performance Code” • Performance concerns • Correctness • You will see that when trying to make code go fast one often breaks it • Readability • Fast code typically requires more lines! • Modularity can hurt performance • e.g., Too many classes • Portability • Code that is fast on machine A can be slow on machine B • At the extreme, highly optimized code is not portable at all, and in fact is done in hardware!

  6. Performance - Time • Time between the start and the end of an operation • Also called running time, elapsed time, wall-clock time, response time, latency, execution time, ... • Most straightforward measure: “my program takes 12.5s on a Pentium 3.5GHz” • Can be normalized to some reference time • Must be measured on a “dedicated” machine

  7. Performance - Rate • Used often so that performance can be independent on the “size” of the application • e.g., compressing a 1MB file takes 1 minute. compressing a 2MB file takes 2 minutes. The performance is the same. • Millions of instructions / sec (MIPS) • MIPS = instruction count / (execution time * 106) = clock rate / (CPI * 106) • But Instructions Set Architectures are not equivalent • 1 CISC instruction = many RISC instructions • Programs use different instruction mixes • May be ok for same program on same architectures

  8. Performance – Rate cont. • Millions of floating point operations /sec (MFlops) • Very popular, but often misleading • e.g., A high MFlops rate in a badly designed algorithm could have poor application performance • Application-specific: • Millions of frames rendered per second • Millions of amino-acid compared per second • Millions of HTTP requests served per seconds • Application-specific metrics are often preferable and others may be misleading • MFlops can be application-specific thought • For instance: • I want to add to n-element vectors • This requires 2*n Floating Point Operations • Therefore MFlops is a good measure

  9. What is “Peak” Performance? • Resource vendors always talk about peak performance rate • computed based on specifications of the machine • For instance: • I build a machine with 2 floating point units • Each unit can do an operation in 2 cycles • My CPU is at 1GHz • Therefore I have a 1*2/2 =1GFlops Machine • Problem: • In real code you will never be able to use the two floating point units constantly • Data needs to come from memory and cause the floating point units to be idle • Typically, real code achieves only a (often small) fraction of the peak performance

  10. Performance - Benchmarks • Since many performance metrics turn out to be misleading, people have designed benchmarks • Example: SPEC Benchmark • Integer benchmark • Floating point benchmark • These benchmarks are typically a collection of several codes that come from “real-world software” • The question “what is a good benchmark?” is difficult • If the benchmarks do not correspond to what you’ll do with your computer, then the benchmark results are not relevant to you

  11. Performance – A Program • Concerned with improving a program’s performance • For a given platform, take a given program • Run it and measure its wall-clock time • Enhance it, run it again an quantify the performance improvement • i.e., the reduction in wall-clock time • For each version compute its performance • preferably as a relevant performance rate • so that you can say: the best implementation we have so far goes “this fast” (perhaps a % of the peak performance)

  12. Performance – Program Speedup • We need a metric to quantify the impact of your performance enhancement • Speedup: ratio of “old” time to “new” time • old time = 2h • new time = 1h • speedup = 2h / 1h = 2h • Sometimes one talks about a “slowdown” in case the “enhancement” is not beneficial • Happens more often than one thinks

  13. Parallel Performance • The notion of speedup is completely generic • By using a rice cooker I’ve achieved a 1.20 speedup for rice cooking • For parallel programs on defines the Parallel Speedup (we’ll just say “speedup”): • Parallel program takes time T1 on 1 processor • Parallel program takes time Tp on p processors • Parallel Speedup(p) = T1 / Tp • In the ideal case, if my sequential program takes 2 hours on 1 processor, it will take 1 hour on 2 processors: called linear speedup

  14. Speedup linear speedup speedup superlinear speedup!! sub-linear speedup number of processors

  15. Superlinear Speedup? • There are several possible causes • Algorithm • e.g., with optimization problems, throwing many processors at it increases the chances that one will “get lucky” and find the optimum fast • Hardware • e.g., with many processors, it is possible that the entire application data resides in cache (vs. RAM) or in RAM (vs. Disk)

  16. Amdahl’s Law • Consider a program whose execution consists of two phases • One sequential phase • One phase that can be perfectly parallelized (linear speedup) T1 = time spent in phase that cannot be parallelized. Sequential program T1 T2 Old time: T = T1 + T2 T2 = time spent in phase that can be parallelized. Parallel program T1’ = T1 T2’ <= T2 T2’ = time spent in parallelized phase New time: T’ = T1’ + T2’

  17. Amdahl’s Law cont. • P1 = 11%, P2 = 18%, P3 = 23%, P4 = 48%, which add up to 100% • S1 = 1 or 100%, P2 is sped up 5x, so S2 = 500%, P3 is sped up 20x, so S3 = 2000%, and P4 is sped up 1.6x, so S4 = 160% • P1/S1 + P2/S2 + P3/S3 + P4/S4, we find the running time is 0.11/1 + 0.18/5 + 0.23/20 + 0.48/1.6 = 0.4575 or a little less than ½ the original running time which we know is 1 • the overall speed boost is 1/0.4575 = 2.186 or a little more than double the original speed using the formula (P1/S1 + P2/S2 + P3/S3 + P4/S4)−1

  18. Amdahl’s Law - Parallelization • P is the proportion of a program that can be made parallel (i.e. benefit from parallelization), and (1 − P) is the proportion that cannot be parallelized (remains serial), then the maximum speedup that can be achieved by using N processors is

  19. Parallel Efficiency • Definition: Eff(p) = S(p) / p • Typically < 1, unless linear or superlinear speedup • Used to measure how well the processors are utilized • If increasing the number of processors by a factor 10 increases the speedup by a factor 2, perhaps it’s not worth it: efficiency drops by a factor 5 • Important when purchasing a parallel machine for instance: if due to the application’s behavior efficiency is low, forget buying a large cluster

  20. Scalability • Measure of the “effort” needed to maintain efficiency while adding processors • For a given problem size, plot Efd(p) for increasing values of p • It should stay close to a flat line • Isoefficiency: At which rate does the problem size need to be increase to maintain efficiency • By making a problem ridiculously large, one can typically achieve good efficiency • Problem: is it how the machine/code will be used?

  21. Performance Measurement • how does one measure the performance of a program in practice? • Two issues: • Measuring wall-clock times • We’ll see how it can be done shortly • Measuring performance rates • Measure wall clock time (see above) • “Count” number of “operations” (frames, flops, amino-acids: whatever makes sense for the application) • Either by actively counting (count++) • Or by looking at the code and figure out how many operations are performed • Divide the count by the wall-clock time

  22. Why is Performance Poor? • Performance is poor because the code suffers from a performance bottleneck • Definition • An application runs on a platform that has many components • CPU, Memory, Operating System, Network, Hard Drive, Video Card, etc. • Pick a component and make it faster • If the application performance increases, that component was the bottleneck!

  23. Removing a Bottleneck • Hardware Upgrade • Is sometimes necessary • But can only get you so far and may be very costly • e.g., memory technology • Code Modification • The bottleneck is there because the code uses a “resource” heavily or in non-intelligent manner • We will learn techniques to alleviate bottlenecks at the software level

  24. Identifying a Bottleneck • It can be difficult • You’re not going to change the memory bus just to see what happens to the application • But you can run the code on a different machine and see what happens • One Approach • Know/discover the characteristics of the machine • Observe the application execution on the machine • Tinker with the code • Run the application again • Repeat • Reason about what the bottleneck is

  25. Memory Latency and Bandwidth • The performance of memory is typically defined by Latency and Bandwidth (or Rate) • Latency: time to read one byte from memory • measured in nanoseconds these days • Bandwidth: how many bytes can be read per seconds • measured in GB/sec • Note that you don’t have bandwidth = 1 / latency! • Reading 2 bytes in sequence may be cheaper than reading one byte only • Let’s see why...

  26. Latency and Bandwidth • Latency = time for one “data item” to go from memory to the CPU • In-flight = number of data items that can be “in flight” on the memory bus • Bandwidth = Capacity / Latency • Maximum number of items I get in a latency period • Pipelining: • initial delay for getting the first data item = latency • if I ask for enough data items (> in-flight), after latency seconds, I receive data items at rate bandwidth • Just like networks, hardware pipelines, etc. “memory bus” Memory CPU

  27. Latency and Bandwidth • Why is memory bandwidth important? • Because it gives us an upper bound on how fast one could feed data to the CPU • Why is memory latency important? • Because typically one cannot feed the CPU data constantly and one gets impacted by the latency • Latency numbers must be put in perspective with the clock rate, i.e., measured in cycles

  28. Memory Bottleneck: Example • Fragment of code: a[i] = b[j] + c[k] • Three memory references: 2 reads, 1 write • One addition: can be done in one cycle • If the memory bandwidth is 12.8GB/sec, then the rate at which the processor can access integers (4 bytes) is: 12.8*1024*1024*1024 / 4 = 3.4GHz • The above code needs to access 3 integers • Therefore, the rate at which the code gets its data is ~ 1.1GHz • But the CPU could perform additions at 4GHz! • Therefore: The memory is the bottleneck • And we assumed memory worked at the peak!!! • We ignored other possible overheads on the bus • In practice the gap can be around a factor 15 or higher

  29. A realistic approach: Profiling • A profiler is a tool that monitors the execution of a program and reports the computational characteristics used in different parts of the code i.e. dynamic program analysis • Useful to identify “expensive” functions • Profiling cycle • Compile the code with the profiler • Run the code • Identify the most expensive function • Optimize that function • call it less often if possible, make it faster, etc. • Repeat

  30. Profiler Types based on Output • Flat profiler • Flat profiler's compute the average call times, from the calls, and do not breakdown the call times based on the callee or the context. • Call-Graph profiler • Call Graph profilers show the call times, and frequencies of the functions, and also the call-chains involved based on the callee. However context is not preserved.

  31. Methods of data gathering • Event based profilers • In Programming languages listed, all of them have event-based profilers • Java: JVM-Profiler Interface JVM API provides hooks to profiler, for trapping events like calls, class-load, unload, thread enter leave. • Python: Python profilers are profile module, hotspot which are call-graph based, and use the 'sys.set_profile()' module to trap events like c_{call,return,exception}, python_{call,return,exception}.

  32. Methods of data gathering • Statistical profilers • Some profilers operate by sampling. • Some profilers instrument the target program with additional instructions to collect the required information • The resulting data are not exact, but a statistical approximation. • Some of the most commonly used statistical profilers are GNU's gprof, Oprofile and SGI's Pixie.

  33. Methods of data gathering • Instrumentation • Manual: Done by the programmer, e.g. by adding instructions to explicitly calculate runtimes. • Compiler assisted: Example: "gcc -pg ..." for gprof, "quantify g++ ..." for Quantify • Binary translation: The tool adds instrumentation to a compiled binary. Example: ATOM • Runtime instrumentation: Directly before execution the code is instrumented. The program run is fully supervised and controlled by the tool. Examples: PIN, Valgrind • Runtime injection: More lightweight than runtime instrumentation. Code is modified at runtime to have jumps to helper functions. Example: DynInst • Hypervisor: Data are collected by running the (usually) unmodified program under a hypervisor. Example: SIMMON • Simulator: Data are collected by running under an Instruction Set Simulator. Example: SIMMON

  34. UPC – BSC Profiling Tools • Dimemas • Simulation tool for the parametric analysis of the behavior of message-passing applications on a configurable parallel platform. • Paraver • A very powerful performance visualization and analysis tool based on traces that can be used to analyze any information that is expressed on its input trace format.

  35. Dimemas • Development • Tuning • Benchmarking • Training Utilization - Features • Predict behavior of target architecture • Forecast effects of code optimization • What if analysis

  36. Dimemas cont. • Integrated with visualization tool • Extensive application characterization • Squeezing the information that a single instrumented run can provide Usage

  37. Dimemas cont. • enables the user to develop and tune parallel applications on a workstation • providing an accurate prediction of their performance on the parallel target machine • The Dimemas simulator reconstructs the time behavior of a parallel application on a machine modeled by a set of performance parameters

  38. Dimemas – DiP Environment

  39. Dimemas - Benefits • the possibility of not using a parallel machine to run the application to get the traces • the possibility to analyze different application choices without changing neither rerunning the application itself

  40. Dimemas – Benefits cont. • The loop involving Dimemas and Paraver, the simulator and the visualization tool. • The user has the chance to analyze the application behavior when some parameters are changed, for example, what will happen to application execution time is the execution time of a given function is reduced in 50%?

  41. Dimemas - Prediction • A tracefile can be either obtained in a dedicated parallel machine or in a shared sequential machine. • Using the instrumentation records, Dimemas will rebuild the application behavior, using the tracefile and the architectural parameters defined using a GUI. Per process CPU time is used as opposed to elapsed time.

  42. Dimemas – Prediction Advantages and Disadvantages • Cost reduction • Analysis of non-existent machines • Quality of performance prediction • Performance

  43. Dimemas – Output Information • Basic results • execution time and speedup • For each task • computation time • blocking time • information related to messages: number of sends, number of received (divided in number of receives that produces a block because a message is not local to the task, and number of messages that promotes no block).

  44. Dimemas – Output Information cont. • Basic results

  45. Dimemas – Output Information cont. • Application Analysis • Is my application well balanced? Are there any dependencies? • Can we improve the application grouping messages? Does latency influence application time? • Are we planning to run our application in a low bandwidth network? Does bandwidth have influence in application time? • Do resources limit application time?

  46. Dimemas – Output Information cont. • Additional Analysis • What is analysis • Critical path • Synthetic perturbation analysis

  47. Paraver • Paraver is a flexible performance visualization and analysis tool that can be used to analyze • MPI • OpenMP • MPI+OpenMP • Java • Hardware counters profile • Operating system activity... and many other things you may think of

  48. Paraver – DiP Environment

  49. Paraver – Features • Motif based GUI • Detailed quantitative analysis of program performance • Concurrent comparative analysis of several traces • Fast analysis of very large traces

  50. Paraver – Features cont. • Support for mixed message passing and shared memory (networks of SMPs) • Customizable semantics of the visualized information Cooperative work, sharing views of the tracefile • Building of derived metrics

More Related