HPCToolkit Evaluation Report

HPCToolkit Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive note

Basic Information • Name: HPCToolkit • Developer: Rice University • Current versions: • HPCView: • Website: • http://www.hipersoft.rice.edu/hpctoolkit/ • Contact: • John Mellor-Crummey (johnmc@cs.rice.edu) • Rob Fowler (rjf@cs.rice.edu)

Introduction • HPCToolkit - A suite of tools that aid the programmer in collecting, organizing, and displaying profile data. • hpcviewer • Sorts by any collected metric, from any processes displayed • Displays samples at various levels in call hierarchy through “flattening” • Allows user to focus in on interesting sections of the program through “zooming” • Hpcquick • Simplifies process by integrating hpcprof and hpcview • hpcview • Creates “browsable” performance databases in html, or for use in hpcviewer • bloop • Relate samples to loops, even in significant changes have been affected by optimization • hpcprof • Related samples to source lines. • hpcrun • collects profiles by sampling hardware performance counters

Available Metrics in HPCToolkit • Metrics, obtained by sampling/profiling • PAPI Hardware counters • Any other source for data profiles that can output data in “profile-like input format” (not tested) • Wallclock time (WALLCLK) • Can’t get PAPI metrics and Wallclock time in a single run  • Derived metrics • Combination of existing metrics created by specifying a mathematical formula in an XML configuration file. • Source Code Correlation • Metrics reflect exclusive time spent in function based on counter overflow events • Metrics correlated at the source line level and the loop level • Metrics are related back to source code loops (even if code has been significantly altered by optimization) (“bloop”)

Main Window in hpcviewer Figure 1: Main window in hpcviewer

HPCToolkit (hpcrun) – Overhead • All programs executed correctly when instrumented • < 20 % overhead on all benchmarks when recording just PAPI_TOT_CYC (default option)

Notes on testing • Used lam, instead of mpich for testing • When MPICH mpirun used with hpcrun, hpcrun complains about a “– p” option, even though it was not given • Needed to reduce size of message in big-message.c because of LAM • Unable to get NBP - LU to run using LAM • Major stumbling blocks of hpctoolkit bottleneck identification • Since profile data is not related back to the callsite in the user’s code, but rather the actual function, it is difficult to determine where in the user’s code the problem lies. • Profiling recording wallclock time was glitchy, some profiles contained very little useful information.

Bottleneck Identification: Performance Tool Test Suite: CAMEL, LU • Testing metric: what did profile data tell us? • CAMEL: TOSS-UP • Profile showed work equally distributed across the processes • Unable to determine communication costs from PAPI hardware counters • NAS LU: NOT TESTED • Unable to get LU benchmark to run successfully using LAM • needed to use LAM because could not get MPICH to work with hpcrun

Bottleneck Identification: Performance Tool Test Suite: PPerfMark • Big message: Fail • Profiling wallclock time didn’t produce a profile with information in it • Cycle count is misleading and doesn’t reveal time spent in communication • Diffuse procedure: PASSED • Profile showed large amount of time spent in bottleneck procedure • Time is diffused across processes • Hot procedure: PASSED • Profile showed large amount of time spent in bottleneck procedure • Intensive server: TOSS-UP • Profile showed large amount of time spent in waste_time() on on one process • The other processes show time spent in functions outside of user code, which is difficult to use for bottleneck identification • Ping pong: TOSS-UP • From profile it’s clear that within user code, the time is spent in two different loops • Profile shows time spent in functions outside of user code, which is difficult to use for bottleneck identification • Random barrier: TOSS-UP • Profile shows lots of time spent in waste_time() • Profile does not show communication pattern amongst processes • Small messages: TOSS-UP • Profile reveals only one process spends time in Grecv_messages • Profile shows time spent in functions outside of user code, which is difficult to use for bottleneck identification • System time: TOSS-UP • Profile show lots of time spent in kill, and execlp • It’s difficult to relate this information back to the call site in waste-time • Wrong way: FAIL • Profile does not show communication pattern amongst processes • Profile shows time spent in functions outside of user code, which is difficult to use for bottleneck identification

Evaluation (1) • Available metrics: 3/5 • Use PAPI hardware counters (or others on • New metrics can be derived from existing ones • No statistics regarding communication are provided • In theory could use profile from any source if formatted properly • Cost: 5/5 • HPCToolkit is freely available • Documentation quality: 2.5/5 • Documentation is in the form of a ppt presentation, and man pages • One comprehensive user manual would be helpful • Extendibility: 3.5/5 • HPCToolkit source code is freely available • No tracing support • Requires the use of PAPI for hpcrun (profile creation) • Filtering and aggregation: 2.5/5 • Only hardware counter values are recorded

Evaluation (2) • Hardware support: 4/5 • IA32, Opteron, Itaniun + Linux w/PAPI, MIPS+Irix, Alpha+Tru64 • Heterogeneity support: ?/5 (not tested) • Installation:4/5 • Installation on Linux platform not bad • Requires PAPI to be installed • Interoperability: 3.5/5 • Profile data stored in XML format • Works with SGI’s ssrun, and Compaq’s uprofile on MPS and Alpha respectively • Learning curve: 3.5/5 • The interface is fairly intuitive, but takes some use to get comfortable with the notion of “flattening” • The separation of the tools for platform support causes increase user overhead • Manual overhead: 3.5/5 • It is fairly straightforward to measure at the source line and loop level • It is not possible to turn on and off sampling for selected parts of the source code • Specifying derived functions in XML is awkward • Measurement accuracy: 3.5/5 • Overhead is less than 20% when recording a single PAPI hardware counter

Evaluation (3) • Multiple analyses: 1/5 • Comparison and ordering of hardware counter values is the only form of analysis • Multiple executions: 2.5/5 • Comparison of metrics from multiple runs is possible • There is not built-in scalability, or optimization comparison • Multiple views: 1.5/5 • A single view of profile data correlated with source code is provided • Only profile data (not trace data) is viewable • Performance bottleneck identification: 2/5 • All metrics can be sorted in increasing or decreasing order • “Flattening” approach increases ease of comparison some • Bottleneck identification requires significant user insight when selecting which hardware counters to use, and in locating points for improvement • Profiling/tracing support: 1.5/5 • Only profiling is supported • Hardware counters must be used • Profiling is done on source line, and loop level • Communication profiling is not available

Evaluation (4) • Response time: 2/5 • Data is not available in HPCToolkit until after execution completes and performance data is processed • Software support: 4/5 • Supports sequential and parallel programs • Difficulty is running with MPICH, though it is mentioned in tutorial presentation • Source code correlation: 4/5 • Source code correlation of profile data is the main view offered • System stability: 3.5/5 • Hpcviewer works well • Unable to obtain useful performance data for some of the pperf benchmarks • Technical support: ?/5 • Tech support not requested

Conclusions • The components of HPCToolkit work well for sequential code. • Have access to available (native event) PAPI counters on the system. • Can derive new metrics from sampled metrics using hpcview • Data is correlated with source code • Only simple display of profiled metrics and source code correlation is provided • Whether a metric should be created, hidden, or shown in hpcviewer must be specified before it is run. • Collection of multiple metrics may require multiple runs • Parallel code may be difficult to analyze • Different methods for launching parallel programs achieve varying levels of ease and usefulness with hpcrun • Requires that line mapping information be present in all executables/libraries to be analyzed (“-g” option in many compilers) • The ability to display inclusive time spent at callsites in user code, rather than exclusive time spent in all functions would increase the usefulness of the tool tremendously.

References • 1. HPCToolkit website • http://www.hipersoft.rice.edu/hpctoolkit/ • 2. HPCToolkit SC Tutorial Presenation • http://www.hipersoft.rice.edu/hpctoolkit/sc04/index.html

HPCToolkit Evaluation Report

HPCToolkit Evaluation Report

Presentation Transcript

TAU Evaluation Report

TAU Evaluation Report

Email Evaluation Final Report

Revised Officer Evaluation Report

UIT 2014 – EVALUATION REPORT

VHF Performance Evaluation Report

Final Evaluation Report 2013

PERFORMANCE EVALUATION FITNESS REPORT

HPCToolkit / Rice University

MPICL/ParaGraph Evaluation Report

Implementation Evaluation Report

EVALUATION: FINAL REPORT

Evaluation Report

mpiP Evaluation Report

MPE/Jumpshot Evaluation Report

KOJAK Evaluation Report

Dynaprof Evaluation Report

evaluation Report

SITE EVALUATION REPORT

Point Matching Evaluation Report

NAME Evaluation Report

Pesticide Evaluation Report