1 / 23

Performance analysis workflow

Performance analysis workflow. Final Review | Felix Wolf (GRS). Tuning objectives for HPC systems. Success metrics Scientific output (success stories) Performance / scalability (e.g., record number of particles, Top500) Focus d ifferent for different stakeholders.

mary
Download Presentation

Performance analysis workflow

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance analysis workflow Final Review | Felix Wolf (GRS)

  2. Tuning objectives for HPC systems • Success metrics • Scientific output (success stories) • Performance / scalability (e.g., record number of particles, Top500) • Focus different for different stakeholders

  3. Holistic performance system analysis System-wide performance screening of individual jobs EU Russia

  4. Approach

  5. HOPSA tools

  6. Overall HOPSA workflow Performance Screening Performance Diagnosis In-depth analysis Mandatory job screening with LWM2 and ClustrX User Intra-node performance Basic application + system metrics Paraver ThreadSpotter Job digest Application-leveltuning LAPTASystem performance database Vampir Inter-node performance Pro-active performance consulting Scalasca (Cube) Traces with application + system metrics Job Info of User Administrator Global workload data + job digests Full access System-level tuning LAPTA system-level analysis

  7. Successive refinement of performance data Profile CUBE-4 Minimize intrusion 1 Identify pivotal iterations Visual exploration Time-series Profile CUBE-4 2 Instrument with Score-P Scalasca wait-state analysis System level 0 3 Dimemas what-if analysis Problem? Digest Event trace OTF-2 Node level Paraver Visual exploration ThreadSpotter Vampir

  8. Tool integration Threadspotter Memory profile Application measured with ThreadSpotter Link memory +threadinganalysis Explore memory behavior Link Dimemas Profile CUBE-4 Visual exploration Worst-instance visualization Cube Application linked to Score-P What-if scenarios Scalasca wait-state analysis OTF2 to PRV conversion Trace PRV Trace OTF-2 Visual exploration Paraver Application measured with Extrae System metrics done to do LAPTA Vampir

  9. How to identify system bottlenecks

  10. Correlation of performance data across jobs App A System nodes App B • Time slices allow correlations between simultaneously running jobs to be identified App C Metric intensity App D Time-slices Correlation

  11. Conclusion • Users will learn of performance problems early on • Users have guidance in applying the tools to tune their codes • Selection of tools • Order of application • Seamless integration • Way of distinguishing system from application issues • Also helped us to better structure our trainings

  12. Light-Weight Measurement Module (LWM2) Final Review | Felix Wolf (GRS)

  13. Outline • Introduction • Profiling methodology • Low overhead • Inter-application interference • Results

  14. Lightweight Monitoring Module • Light-weight application profiler designed for system-wide performance screening of all jobs on a system • Key feature – allows identification of cross-application interferences • Designed for ease of use • Requires no application modification, compilation or relinking • Generates simple and useful performance output • Supports MPI, OpenMP and CUDA programming models

  15. LWM2 job digest

  16. LW2M profiling methodology • Light-weight combination of • Direct instrumentation via interposition wrappers (only communication and I/O metrics, no time measurements) • Low-overhead sampling • Aggregation of data for • Every time slice (all metrics, except sampling based) • Whole execution (execution digest) Periodic samples Application Execution Time-slices MPI calls, HWCs, etc. MPI calls, HWCs, etc. MPI calls, HWCs, etc. MPI calls, HWCs, etc.

  17. LWM2 profiling methodology (signals) • Signals counted separately per thread • Combined only at the end Thread storage Thread 1 Process storage Thread 0 Thread storage Thread 2

  18. LWM2 profiling methodology (wrappers) • Data from wrappers is stored in separate storage • Data is aggregated in situ per time slice Wrapper calls Thread storage Thread 1 Combines data at the end of time-slice Wakes up at the end of time-slice Process storage Thread 0 Heart-beatthread Thread storage Aggregates data into time-slices Thread 2

  19. LWM2 overhead • SPEC MPI 2007 benchmark suite

  20. Inter-application interference • Applications on HPC system usually have • Exclusive access to processors, local memory • Shared access to communication network and I/O subsystem • Results in applications effecting each others performance • Not captured with classic performance analysis tools • Requires capturing performance of all applications running on a system • Possible with LWM2

  21. File I/O interference • Setup: Simultaneous execution of application to detect I/O interference • Initial run: Constant file I/O alone, without any induced noise • Second run: Constant file I/O, concurrently with periodic file I/O as induced noise • Results: Clear drop in I/O performance (50%), longer execution time (35%)

  22. LWM2 status • Development completed • Release published on HOPSA Web site • Opt-in deployments at • GraphIT system, Moscow State University • JUDGE, Jülich Supercomputing Center • Being evaluated at • Todiand Pilatus, CSCS, Switzerland • Indy, EPCC, UK (APOS) • Research publication with MSU under preparation

  23. Summary

More Related