TAU Evaluation Report

TAU Evaluation Report Adam Leko, Hung-Hsun Su UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive note

Basic Information • Name: Tuning and Analysis Utilities (TAU) • Developer: University of Oregon • Current version: • TAU 2.14.4 • Program database toolkit 3.3.1 • Website: http://www.cs.uoregon.edu/research/paracomp/tau/tautools/ • Contact: • Sameer Shende: sameer@cs.uoregon.edu

TAU Overview • Performance tool suite that offers profiling and tracing of programs • Available instrumentation methods: source (manual), source (automatic), binary (DynInst) • Supported languages: C, C++, Fortran, Python, Java, SHMEM (TurboSHMEM and Cray SHMEM), OpenMP, MPI, Charm • Hardware counter support • Relies on existing toolkits and libraries for some functionality • PDToolkit and Opari for automatic source instrumentation • DynInst for runtime binary instrumentation • PCL and PAPI for hardware counter information • libvtf3, slog2sdk, and EPILOG for exporting trace files

TAU Architecture

Configuring & Installing TAU • TAU relies on several existing toolkits for efficient usage, but some of these toolkits are time-consuming to install • PDToolkit, PAPI, etc • Users must choose between modes at compile time using ./configure script • Profiling via -PROFILE, tracing via -TRACE • TAU must also be notified about the location of supported languages and compilers • -mpilib=/path/to/mpi/lib • -dyninst=/path/to/dyninst • -pdt=/path/to/pdt • Other supported languages/libraries handled in a similar manner • This results in a very flexible installation process • Users can easily install different configurations of TAU in their home directory • However, several configuration options are mutually exclusive, such as • Profiling and tracing • Using PAPI counters vs. gettimeofday or TSC counters • Profiling w/callpaths vs. profiling with extra statistics • Unfortunately, mutually exclusive nature of things proves to be annoying • Would be nice if TAU supported (for instance) tracing and profiling without compiling & installing twice! • Luckily, software compiles quickly on modern machines, so this is not fatal • However, TAU relies on several environment variables, which makes switching between installations cumbersome

The Many Faces of TAU • Two main methods of operation: profiling and tracing • Profile mode • Reports aggregate spent in each function per each node/thread • Several profile recording options • Report min/max/std. dev of times using the -TRACESTATS configure option • Attempt to compensate for profiling overhead (-COMPENSATE) • Record memory stats while profiling (-PROFILEMEMORY, -PROFILEHEADROOM) • Stop profiling after a certain function depth (-DEPTHLIMIT) • Record call trees in profile (-PROFILECALLPATH) • Record phase of program in profiles (-PROFILEPHASE, requires manual instrumentation of phases) • If instrumented code uses the TAU_INIT macros, can also pass arguments to compiled, instrumented program to restrict what is recorded at runtime • --profile main+func2 • Metrics that can be recorded: wall clock time (via gettimeofday or several hardware-specific timers) or hardware counter metrics (via PAPI or PCL) • Data visualized using pprof (text-based) or paraprof (Java-based GUI) • Profile data can be exported to KOJAK’s cube viewer • Profile data can be imported from Vampir VTF traces

The Many Faces of TAU (2) • Trace mode • Records timestamps for function entry/exit points • Or arbitrary code section points via manual instrumentation • Also records messages sent/received for MPI programs • No trace visualizer, but can export to • ALOG: Upshot/nupshot • Paraver’s trace format • SLOG-2: Jumpshot • VTF: Vampir/Intel Trace Analyzer 5 • SDDF: Format used by Pablo/SvPablo • EPILOG: KOJAK’s trace format

TAU Instrumentation: Profile Mode • Source-level instrumentation • tau_instrument (which requires PDToolkit) is used to produce an instrumented source code for C, C++, and Fortran files • For OpenMP code, TAU can use OPARI (from KOJAK) • Users may insert instrumentation using TAU’s simple API (TAU_PROFILE_START, TAU_PROFILE_STOP) • When compiling, must use stub Makefiles which define compilation macros like CFLAGS, LDFLAGS, etc. • This can complicate the compile & link cycle greatly, especially if fully automatic source instrumentation is desired • Selective instrumentation is supported through a flag to tau_instrument • Give a file containing which functions to include or exclude from instrumentation • Can tau_reduce use in conjuction with existing profiles to exclude functions matching certain criteria, like • numcalls > 10000 & usecs/call < 2 • Binary-level instrumentation • Based on DynInst, considered “experimental” according to documentation • Use tau_run wrapper script with instrumentation file in same format as selective instrumentation file

TAU Instrumentation: Trace Mode • Source-level instrumentation • Same procedure as in profile mode • Binary instrumentation • Can link against MPI wrapper library (only re-linking necessary) • Runtime instrumentation for trace mode is not supported using DynInst

Source Instrumentation Process

Instrumentation Test Suite: Problems • Problem with using selective instrumentation + MPI wrapper library + PAPI metrics • Only instrumenting main in CAMEL caused several floating point instructions to be attributed to MPI_Send and MPI_Recv instead of main • For timing measurements and overhead measurements, used wallclock time with the low-overhead -LINUXTIMERS option • Some code had to be modified before feeding it through PDToolkit’s cparse • cparse usese the Edison Design Group’s parser, which is stricter about some things than other compilers • ANSI C/standard Fortran code poses no problems, though • NAS NPB LU benchmark (NPBv3.1-MPI) would not run with TAU libraries • Segfaults, “signal 11s” when using either LAM or MPICH with only MPI wrapper libraries (profiling & tracing) • Modified, updated version (3.2) of LU comes with TAU • Had problems compiling and running this • Gave TAU the benefit of the doubt for the rest of the evaluations • Guessed at what TAU profile would tell us had it been working with LU for bottleneck tests • LU timing overheads omitted from overhead measurements

Instrumentation Overhead: Notes • Performed automatic instrumentation of CAMEL using tau_instrument • Like KOJAK, program execution time was several orders of magnitude slower • This is likely due to the use of very small functions which normally get inlined by the compiler • For profile measurements on the following slides, only main was instrumented • Under this scenario, profiling and tracing overhead was almost nonexistent (<1%) • Instrumentation points chosen for overhead measurements • Profiling • CAMEL: all MPI calls, main enter + exit • PPerfMark suite: all MPI calls, all function calls • Used –PROFILECALLPATH configuration option • Using other profile flavors (without call paths, with extra stats) made a negligible difference on overall profile overhead • Tracing • CAMEL: all MPI calls • PPerfMark suite: all MPI calls • Similar to what we have done for other tools • Benchmarks marked with * had high variability in runtimes

Instrumentation Overhead: Notes (2) • Used LAM for all measurements • Some benchmarks with high overhead (small-messages, wrong-way, ping-pong) had slightly smaller overhead using MPICH • Small messages: 54.2% vs. 483.316% • Wrong way: 24.5% vs. 28.573% • Ping-pong: 51.5% vs. 56.259% • Probably due to LAM running faster (especially on small-messages) and execution time being limited by I/O time for writing trace file • Same I/O time, smaller execution time -> higher % overhead • In general, overhead for profiling and tracing extremely low except for a few cases • High profile overhead programs with small functions that get called a lot • small-messages, wrong-way, ping-pong, CAMEL with everything instrumented • High trace overhead for programs with large traces generated very quickly • small-messages, wrong-way, ping-pong • tau_reduce provides a nice way to help reduce instrumentation overhead, although an initial profile must be first gathered

Instrumentation Overhead: Profiles

Instrumentation Overhead: Traces

Visualizations: pprof • Gives text-based dump of profile files, similar to gprof/prof output • Example (partial) output: … USER EVENTS Profile :NODE 7, CONTEXT 0, THREAD 0 --------------------------------------------------------------------------------------- NumSamples MaxValue MinValue MeanValue Std. Dev. Event Name --------------------------------------------------------------------------------------- 4 2016 4 516 866.2 Message size received from all nodes 86 28 28 28 0 Message size sent to all nodes --------------------------------------------------------------------------------------- FUNCTION SUMMARY (total): --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 5:44.295 5:45.407 8 1440 43175935 main() 0.2 541 553 8 232 69161 MPI_Init() 0.2 541 553 8 232 69161 main() => MPI_Init() 0.2 543 543 704 0 772 MPI_Recv() 0.2 543 543 704 0 772 main() => MPI_Recv() 0.0 9 9 8 0 1136 MPI_Barrier() …

Visualizations: paraprof • paraprof provides visual representations of same data given by pprof • Used to be a Tcl/Tk application known as “racy” • Racy has been deprecated, but is still included with TAU for historical reasons • Java application with three main views • Main profile view • Histograms (next slides) • Three-dimensional visualization (next slides) • Main profile view (right top) • “Function ledger” maps colors to function names (right, bottom left) • Overall time for each function displayed as a stacked bar chart • Can click on each function to get detailed information (right, bottom right) • No line-level source code correlation • Can infer this information Indirectly if call paths are used Main profile view Function ledger Function details view

Visualizations: paraprof (2) • paraprof can also show histogram views for each function of the main profile view (right) • Simply show histogram of aggregate time for a function across all threads • Histogram to right shows that most functions spent around 75.8 seconds (midpoint between min and max) in MPI_Barrier

Visualizations: paraprof (3) • paraprof also can display three-dimensional displays of profile data • Bar and triangle meshes axes • Time spent in each function (height) • Which function (width) • Which node (depth) • Scatter plot lets you pick axes • Plots support transparency, rotation, and highlighting a particular function or node • Surprisingly responsive for a Java application!

Bottleneck Identification Test Suite • Testing metric: what did pprof/paraprof tell us from wallclock time profiles? • Since no built-in trace visualizer, we ignored what could be done with other trace tools • Programs correctness not affected by instrumentation  • Except for our version of LU  • CAMEL: PASSED • Showed work evenly distributed among nodes • When full tracing used, can easily show which functions take the most wall clock time • LU: FAILED • Could not run, got segfaults using MPICH or LAM • Even if it worked, it would be very difficult/impossible to garner communication patterns from profile views • Big messages: PASSED • Profile showed most of application time dominated by MPI calls to send and receive • Diffuse procedure: TOSS-UP • Profile showed most time taken by MPI_Barrier calls • However, profile also showed the bottleneck procedure (which is dispersed across all nodes) taking up a negligible amount of overall time • Really need a trace view to see diffuse behavior of program • Hot procedure: PASSED • Profile clearly shows that one function is responsible for most execution time

Bottleneck Identification Test Suite (2) • Intensive server: PASSED • Profile showed most time spent in MPI_Recv for all nodes except first node • Profile also illustrated most time for first node spent in waste_time • Ping-pong: PASSED • Easy to see from profile that most time is being spent in MPI_Send and MPI_Recv • pprof and paraprof also showed a large number of MPI calls • Random barrier: TOSS-UP • Profile showed most time being spent in MPI_Barrier • However, random nature of barrier not shown by profile • Trace view is necessary to see random barrier behavior • Small messages: PASS • Profile showed one process spending most time in MPI_Send and the other process in MPI_Recv • System time: FAILED • No built-in way to separate wall clock time into system time vs. user time • PAPI metrics can’t record system time vs. user time either • Wrong order: FAILED • Impossible to see communication behavior without a trace

TAU General Comments • Good things • Supports profiling & tracing • Very portable • Wide range of software support • Several programming models & libraries supported • Visualization tools seem very stable • Good support for exporting data to other tools • Things that could use improvement • Dependence on other software for basic functionality (instrumentation via PDToolkit or DynInst) makes installation difficult • Source code correlation could be better • Only at the function or function call level (with call paths) • Export is nice, but lots of things are easier to do directly in other tools • For example, mpicc -mpilog to get a trace for Jumpshot instead of cparse, tau_instrument, wrapper Makefiles, … • TAU does add automatic instrumentation for profiling functions, which is an added benefit • Three-dimensional visualizations are nice but “Cube” viewer from KOJAK is easier to use and displays data in a very concise manner • Text is also hard to read on three-dimensional views for function names • Some interoperability features (export to SLOG-2 and ALOG) do not work well in version we tested • TAU could potentially serve as a base for our UPC and SHMEM performance tool

TAU: Adding UPC & SHMEM • SHMEM • Not much extra work needed • Have already created weak binding patches for GPSHMEM & created a wrapper library that calls the appropriate TAU functions • UPC • If we have source code instrumentation, then just put in TAU* instrumentation calls in the appropriate places • If we do binary instrumentation, we’ll probably have to make major modifications to DynInst • In any case, once the UPC instrumentation problem is solved, adding support for UPC into TAU will not be too hard • However, how to instrument UPC programs while retaining low overhead? • Also, how to extend TAU to support more advanced analyses? • Support for profiles and traces a nice bonus

Evaluation (1) • Available metrics: 4/5 • Supports recording execution time (broken down into call trees) • Supports several methods of gathering profile data • Supports all PAPI metrics for profiles • Cost: 5/5 • Free! • Documentation quality: 3.5/5 • User’s manual very good, but out of date • For example, three-dimensional visualizations not covered in manual • Extensibility: 4/5 • Open source, uses documented APIs • Can add support for new languages using source instrumentation • Filtering and aggregation: 2.5/5 • Filtering & aggregation available through profile view • No advanced filter or custom aggregation methods built in for traces

Evaluation (2) • Hardware support: 5/5 • Many platforms supported: 64-bit Linux (Opteron, Itanium, Alpha, SPARC); IBM SP2 (AIX); IBM BlueGene/L; AlphaServer (Tru64); SPARC-based clusters (Solaris); SGI (IRIX 6.x) systems, including Indy, Power Challenge, Onyx, Onyx2, Origin 200, 2000, 3000 series; NEC SX-5; Cray X1, T3E; Apple OS X; HP RISC systems (HP-UX) • Heterogeneity support: 0/5(not supported) • Installation: 2.5/5 • As simple as ./configure with options, then make install • However, dependence on other software for source or binary instrumentation makes installation time-consuming • Interoperability: 5/5 • Profile files use simple ASCII format; trace files use documented binary format • Can export to VAMPIR, Jumpshot/upshot (ALOG & SLOG-2), CUBE, SDDF, Paraver • Learning curve: 2.5/5 • Learning how to use the different Makefile wrappers and command-line programs takes a while • After a short period, instrumentation & tool usage relatively easy

Evaluation (3) • Manual overhead: 4/5 • Automatic instrumentation of MPI calls on all platforms • Automatic instrumentation of all functions or a selected group of functions • Call path support gives almost the same information as instrumenting call sites • MPI and OpenMP instrumentation support • Measurement accuracy: 5/5 • CAMEL overhead < 1% for profiling and tracing when a few functions were instrumented • Overall, accuracy pretty good except for a few cases • Multiple executions: 3/5 • Can relate profile metrics between runs in paraprof • Can store performance data in DBMS (PerfDB) • Seems like PerfDB is in a preliminary state, though • Multiple analyses & views: 4/5 • Both profiling and tracing are supported (although no built-in trace viewer) • Profile view has stacked bar charts, “regular” views, three-dimensional views, and histograms

Evaluation (4) • Performance bottleneck identification: 3.5/5 • No automatic bottleneck identification • Profile viewer helpful for identifying methods that take most time • Lack of built-in trace viewer makes identification of some bottlenecks impossible, but trace export means could combine with several other viewers to cover just about anything • Profiling/tracing support: 4/5 • Tracing & profiling supported • Default trace file format size reasonable but not most compact • Response time: 3/5 • Loading profiles after run almost instantaneous using paraprof viewer • Exporting traces to other tools time consuming (have to run tau_merge, tau_convert, etc; a few extra disk I/Os) • Software support: 5/5 • Supports OpenMP, MPI, and several other programming models • A wide range of compilers are supported • Can support linking against any library, but does not instrument library functions • Source code correlation: 2/5 • Supported down to the function and function call site level (when collecting call paths is enabled) • Searching: 0/5 (not supported)

Evaluation (5) • System stability: 3/5 • Software is generally stable • Bugs encountered: • Segfaults on instrumented version of our LU code • SLOG-2 export seems to give Jumpshot-4 some trouble (several “unsupported event” messages on a few exported traces) • Exporting to ALOG format puts stray “: %d” lines in ALOG file • Technical support: 5/5 • Good response from our contact (Sameer), most emails answered within 48 hours with useful information

TAU Evaluation Report