1 / 28

TAU Evaluation Report

TAU Evaluation Report . Adam Leko, Hung-Hsun Su UPC Group HCS Research Laboratory University of Florida. Color encoding key: Blue: Information Red: Negative note Green: Positive note. Basic Information. Name: Tuning and Analysis Utilities (TAU) Developer: University of Oregon

Antony
Download Presentation

TAU Evaluation Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TAU Evaluation Report Adam Leko, Hung-Hsun Su UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive note

  2. Basic Information • Name: Tuning and Analysis Utilities (TAU) • Developer: University of Oregon • Current version: • TAU 2.14.4 • Program database toolkit 3.3.1 • Website: http://www.cs.uoregon.edu/research/paracomp/tau/tautools/ • Contact: • Sameer Shende: sameer@cs.uoregon.edu

  3. TAU Overview • Performance tool suite that offers profiling and tracing of programs • Available instrumentation methods: source (manual), source (automatic), binary (DynInst) • Supported languages: C, C++, Fortran, Python, Java, SHMEM (TurboSHMEM and Cray SHMEM), OpenMP, MPI, Charm • Hardware counter support • Relies on existing toolkits and libraries for some functionality • PDToolkit and Opari for automatic source instrumentation • DynInst for runtime binary instrumentation • PCL and PAPI for hardware counter information • libvtf3, slog2sdk, and EPILOG for exporting trace files

  4. TAU Architecture

  5. Configuring & Installing TAU • TAU relies on several existing toolkits for efficient usage, but some of these toolkits are time-consuming to install • PDToolkit, PAPI, etc • Users must choose between modes at compile time using ./configure script • Profiling via -PROFILE, tracing via -TRACE • TAU must also be notified about the location of supported languages and compilers • -mpilib=/path/to/mpi/lib • -dyninst=/path/to/dyninst • -pdt=/path/to/pdt • Other supported languages/libraries handled in a similar manner • This results in a very flexible installation process • Users can easily install different configurations of TAU in their home directory • However, several configuration options are mutually exclusive, such as • Profiling and tracing • Using PAPI counters vs. gettimeofday or TSC counters • Profiling w/callpaths vs. profiling with extra statistics • Unfortunately, mutually exclusive nature of things proves to be annoying • Would be nice if TAU supported (for instance) tracing and profiling without compiling & installing twice! • Luckily, software compiles quickly on modern machines, so this is not fatal • However, TAU relies on several environment variables, which makes switching between installations cumbersome

  6. The Many Faces of TAU • Two main methods of operation: profiling and tracing • Profile mode • Reports aggregate spent in each function per each node/thread • Several profile recording options • Report min/max/std. dev of times using the -TRACESTATS configure option • Attempt to compensate for profiling overhead (-COMPENSATE) • Record memory stats while profiling (-PROFILEMEMORY, -PROFILEHEADROOM) • Stop profiling after a certain function depth (-DEPTHLIMIT) • Record call trees in profile (-PROFILECALLPATH) • Record phase of program in profiles (-PROFILEPHASE, requires manual instrumentation of phases) • If instrumented code uses the TAU_INIT macros, can also pass arguments to compiled, instrumented program to restrict what is recorded at runtime • --profile main+func2 • Metrics that can be recorded: wall clock time (via gettimeofday or several hardware-specific timers) or hardware counter metrics (via PAPI or PCL) • Data visualized using pprof (text-based) or paraprof (Java-based GUI) • Profile data can be exported to KOJAK’s cube viewer • Profile data can be imported from Vampir VTF traces

  7. The Many Faces of TAU (2) • Trace mode • Records timestamps for function entry/exit points • Or arbitrary code section points via manual instrumentation • Also records messages sent/received for MPI programs • No trace visualizer, but can export to • ALOG: Upshot/nupshot • Paraver’s trace format • SLOG-2: Jumpshot • VTF: Vampir/Intel Trace Analyzer 5 • SDDF: Format used by Pablo/SvPablo • EPILOG: KOJAK’s trace format

  8. TAU Instrumentation: Profile Mode • Source-level instrumentation • tau_instrument (which requires PDToolkit) is used to produce an instrumented source code for C, C++, and Fortran files • For OpenMP code, TAU can use OPARI (from KOJAK) • Users may insert instrumentation using TAU’s simple API (TAU_PROFILE_START, TAU_PROFILE_STOP) • When compiling, must use stub Makefiles which define compilation macros like CFLAGS, LDFLAGS, etc. • This can complicate the compile & link cycle greatly, especially if fully automatic source instrumentation is desired • Selective instrumentation is supported through a flag to tau_instrument • Give a file containing which functions to include or exclude from instrumentation • Can tau_reduce use in conjuction with existing profiles to exclude functions matching certain criteria, like • numcalls > 10000 & usecs/call < 2 • Binary-level instrumentation • Based on DynInst, considered “experimental” according to documentation • Use tau_run wrapper script with instrumentation file in same format as selective instrumentation file

  9. TAU Instrumentation: Trace Mode • Source-level instrumentation • Same procedure as in profile mode • Binary instrumentation • Can link against MPI wrapper library (only re-linking necessary) • Runtime instrumentation for trace mode is not supported using DynInst

  10. Source Instrumentation Process

  11. Instrumentation Test Suite: Problems • Problem with using selective instrumentation + MPI wrapper library + PAPI metrics • Only instrumenting main in CAMEL caused several floating point instructions to be attributed to MPI_Send and MPI_Recv instead of main • For timing measurements and overhead measurements, used wallclock time with the low-overhead -LINUXTIMERS option • Some code had to be modified before feeding it through PDToolkit’s cparse • cparse usese the Edison Design Group’s parser, which is stricter about some things than other compilers • ANSI C/standard Fortran code poses no problems, though • NAS NPB LU benchmark (NPBv3.1-MPI) would not run with TAU libraries • Segfaults, “signal 11s” when using either LAM or MPICH with only MPI wrapper libraries (profiling & tracing) • Modified, updated version (3.2) of LU comes with TAU • Had problems compiling and running this • Gave TAU the benefit of the doubt for the rest of the evaluations • Guessed at what TAU profile would tell us had it been working with LU for bottleneck tests • LU timing overheads omitted from overhead measurements

  12. Instrumentation Overhead: Notes • Performed automatic instrumentation of CAMEL using tau_instrument • Like KOJAK, program execution time was several orders of magnitude slower • This is likely due to the use of very small functions which normally get inlined by the compiler • For profile measurements on the following slides, only main was instrumented • Under this scenario, profiling and tracing overhead was almost nonexistent (<1%) • Instrumentation points chosen for overhead measurements • Profiling • CAMEL: all MPI calls, main enter + exit • PPerfMark suite: all MPI calls, all function calls • Used –PROFILECALLPATH configuration option • Using other profile flavors (without call paths, with extra stats) made a negligible difference on overall profile overhead • Tracing • CAMEL: all MPI calls • PPerfMark suite: all MPI calls • Similar to what we have done for other tools • Benchmarks marked with * had high variability in runtimes

  13. Instrumentation Overhead: Notes (2) • Used LAM for all measurements • Some benchmarks with high overhead (small-messages, wrong-way, ping-pong) had slightly smaller overhead using MPICH • Small messages: 54.2% vs. 483.316% • Wrong way: 24.5% vs. 28.573% • Ping-pong: 51.5% vs. 56.259% • Probably due to LAM running faster (especially on small-messages) and execution time being limited by I/O time for writing trace file • Same I/O time, smaller execution time -> higher % overhead • In general, overhead for profiling and tracing extremely low except for a few cases • High profile overhead programs with small functions that get called a lot • small-messages, wrong-way, ping-pong, CAMEL with everything instrumented • High trace overhead for programs with large traces generated very quickly • small-messages, wrong-way, ping-pong • tau_reduce provides a nice way to help reduce instrumentation overhead, although an initial profile must be first gathered

  14. Instrumentation Overhead: Profiles

  15. Instrumentation Overhead: Traces

  16. Visualizations: pprof • Gives text-based dump of profile files, similar to gprof/prof output • Example (partial) output: … USER EVENTS Profile :NODE 7, CONTEXT 0, THREAD 0 --------------------------------------------------------------------------------------- NumSamples MaxValue MinValue MeanValue Std. Dev. Event Name --------------------------------------------------------------------------------------- 4 2016 4 516 866.2 Message size received from all nodes 86 28 28 28 0 Message size sent to all nodes --------------------------------------------------------------------------------------- FUNCTION SUMMARY (total): --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 5:44.295 5:45.407 8 1440 43175935 main() 0.2 541 553 8 232 69161 MPI_Init() 0.2 541 553 8 232 69161 main() => MPI_Init() 0.2 543 543 704 0 772 MPI_Recv() 0.2 543 543 704 0 772 main() => MPI_Recv() 0.0 9 9 8 0 1136 MPI_Barrier() …

  17. Visualizations: paraprof • paraprof provides visual representations of same data given by pprof • Used to be a Tcl/Tk application known as “racy” • Racy has been deprecated, but is still included with TAU for historical reasons • Java application with three main views • Main profile view • Histograms (next slides) • Three-dimensional visualization (next slides) • Main profile view (right top) • “Function ledger” maps colors to function names (right, bottom left) • Overall time for each function displayed as a stacked bar chart • Can click on each function to get detailed information (right, bottom right) • No line-level source code correlation • Can infer this information Indirectly if call paths are used Main profile view Function ledger Function details view

  18. Visualizations: paraprof (2) • paraprof can also show histogram views for each function of the main profile view (right) • Simply show histogram of aggregate time for a function across all threads • Histogram to right shows that most functions spent around 75.8 seconds (midpoint between min and max) in MPI_Barrier

  19. Visualizations: paraprof (3) • paraprof also can display three-dimensional displays of profile data • Bar and triangle meshes axes • Time spent in each function (height) • Which function (width) • Which node (depth) • Scatter plot lets you pick axes • Plots support transparency, rotation, and highlighting a particular function or node • Surprisingly responsive for a Java application!

  20. Bottleneck Identification Test Suite • Testing metric: what did pprof/paraprof tell us from wallclock time profiles? • Since no built-in trace visualizer, we ignored what could be done with other trace tools • Programs correctness not affected by instrumentation  • Except for our version of LU  • CAMEL: PASSED • Showed work evenly distributed among nodes • When full tracing used, can easily show which functions take the most wall clock time • LU: FAILED • Could not run, got segfaults using MPICH or LAM • Even if it worked, it would be very difficult/impossible to garner communication patterns from profile views • Big messages: PASSED • Profile showed most of application time dominated by MPI calls to send and receive • Diffuse procedure: TOSS-UP • Profile showed most time taken by MPI_Barrier calls • However, profile also showed the bottleneck procedure (which is dispersed across all nodes) taking up a negligible amount of overall time • Really need a trace view to see diffuse behavior of program • Hot procedure: PASSED • Profile clearly shows that one function is responsible for most execution time

  21. Bottleneck Identification Test Suite (2) • Intensive server: PASSED • Profile showed most time spent in MPI_Recv for all nodes except first node • Profile also illustrated most time for first node spent in waste_time • Ping-pong: PASSED • Easy to see from profile that most time is being spent in MPI_Send and MPI_Recv • pprof and paraprof also showed a large number of MPI calls • Random barrier: TOSS-UP • Profile showed most time being spent in MPI_Barrier • However, random nature of barrier not shown by profile • Trace view is necessary to see random barrier behavior • Small messages: PASS • Profile showed one process spending most time in MPI_Send and the other process in MPI_Recv • System time: FAILED • No built-in way to separate wall clock time into system time vs. user time • PAPI metrics can’t record system time vs. user time either • Wrong order: FAILED • Impossible to see communication behavior without a trace

  22. TAU General Comments • Good things • Supports profiling & tracing • Very portable • Wide range of software support • Several programming models & libraries supported • Visualization tools seem very stable • Good support for exporting data to other tools • Things that could use improvement • Dependence on other software for basic functionality (instrumentation via PDToolkit or DynInst) makes installation difficult • Source code correlation could be better • Only at the function or function call level (with call paths) • Export is nice, but lots of things are easier to do directly in other tools • For example, mpicc -mpilog to get a trace for Jumpshot instead of cparse, tau_instrument, wrapper Makefiles, … • TAU does add automatic instrumentation for profiling functions, which is an added benefit • Three-dimensional visualizations are nice but “Cube” viewer from KOJAK is easier to use and displays data in a very concise manner • Text is also hard to read on three-dimensional views for function names • Some interoperability features (export to SLOG-2 and ALOG) do not work well in version we tested • TAU could potentially serve as a base for our UPC and SHMEM performance tool

  23. TAU: Adding UPC & SHMEM • SHMEM • Not much extra work needed • Have already created weak binding patches for GPSHMEM & created a wrapper library that calls the appropriate TAU functions • UPC • If we have source code instrumentation, then just put in TAU* instrumentation calls in the appropriate places • If we do binary instrumentation, we’ll probably have to make major modifications to DynInst • In any case, once the UPC instrumentation problem is solved, adding support for UPC into TAU will not be too hard • However, how to instrument UPC programs while retaining low overhead? • Also, how to extend TAU to support more advanced analyses? • Support for profiles and traces a nice bonus

  24. Evaluation (1) • Available metrics: 4/5 • Supports recording execution time (broken down into call trees) • Supports several methods of gathering profile data • Supports all PAPI metrics for profiles • Cost: 5/5 • Free! • Documentation quality: 3.5/5 • User’s manual very good, but out of date • For example, three-dimensional visualizations not covered in manual • Extensibility: 4/5 • Open source, uses documented APIs • Can add support for new languages using source instrumentation • Filtering and aggregation: 2.5/5 • Filtering & aggregation available through profile view • No advanced filter or custom aggregation methods built in for traces

  25. Evaluation (2) • Hardware support: 5/5 • Many platforms supported: 64-bit Linux (Opteron, Itanium, Alpha, SPARC); IBM SP2 (AIX); IBM BlueGene/L; AlphaServer (Tru64); SPARC-based clusters (Solaris); SGI (IRIX 6.x) systems, including Indy, Power Challenge, Onyx, Onyx2, Origin 200, 2000, 3000 series; NEC SX-5; Cray X1, T3E; Apple OS X; HP RISC systems (HP-UX) • Heterogeneity support: 0/5(not supported) • Installation: 2.5/5 • As simple as ./configure with options, then make install • However, dependence on other software for source or binary instrumentation makes installation time-consuming • Interoperability: 5/5 • Profile files use simple ASCII format; trace files use documented binary format • Can export to VAMPIR, Jumpshot/upshot (ALOG & SLOG-2), CUBE, SDDF, Paraver • Learning curve: 2.5/5 • Learning how to use the different Makefile wrappers and command-line programs takes a while • After a short period, instrumentation & tool usage relatively easy

  26. Evaluation (3) • Manual overhead: 4/5 • Automatic instrumentation of MPI calls on all platforms • Automatic instrumentation of all functions or a selected group of functions • Call path support gives almost the same information as instrumenting call sites • MPI and OpenMP instrumentation support • Measurement accuracy: 5/5 • CAMEL overhead < 1% for profiling and tracing when a few functions were instrumented • Overall, accuracy pretty good except for a few cases • Multiple executions: 3/5 • Can relate profile metrics between runs in paraprof • Can store performance data in DBMS (PerfDB) • Seems like PerfDB is in a preliminary state, though • Multiple analyses & views: 4/5 • Both profiling and tracing are supported (although no built-in trace viewer) • Profile view has stacked bar charts, “regular” views, three-dimensional views, and histograms

  27. Evaluation (4) • Performance bottleneck identification: 3.5/5 • No automatic bottleneck identification • Profile viewer helpful for identifying methods that take most time • Lack of built-in trace viewer makes identification of some bottlenecks impossible, but trace export means could combine with several other viewers to cover just about anything • Profiling/tracing support: 4/5 • Tracing & profiling supported • Default trace file format size reasonable but not most compact • Response time: 3/5 • Loading profiles after run almost instantaneous using paraprof viewer • Exporting traces to other tools time consuming (have to run tau_merge, tau_convert, etc; a few extra disk I/Os) • Software support: 5/5 • Supports OpenMP, MPI, and several other programming models • A wide range of compilers are supported • Can support linking against any library, but does not instrument library functions • Source code correlation: 2/5 • Supported down to the function and function call site level (when collecting call paths is enabled) • Searching: 0/5 (not supported)

  28. Evaluation (5) • System stability: 3/5 • Software is generally stable • Bugs encountered: • Segfaults on instrumented version of our LU code • SLOG-2 export seems to give Jumpshot-4 some trouble (several “unsupported event” messages on a few exported traces) • Exporting to ALOG format puts stray “: %d” lines in ALOG file • Technical support: 5/5 • Good response from our contact (Sameer), most emails answered within 48 hours with useful information

More Related