TAU Meets Dyninst and MRNet: A Long-term and Short-term Affair

TAU Meets Dyninst and MRNet:A Long-term and Short-term Affair Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau Department of Computer and Information Science Performance Research Laboratory University of Oregon

Performance Research Lab • Dr. Sameer Shende, Senior scientist • Alan Morris, Senior software engineer • Wyatt Spear, Software engineer • Scott Biersdorff, Software engineer • Li Li, Ph.D. student • “Model-based Automatic Performance Diagnosis” • Ph.D. thesis, January 2007 • Kevin Huck, Ph.D. student • Aroon Nataraj, Ph.D. student • Integrated kernel / application performance analysis • Scalable performance monitoring

TAU Performance System • Tuning and Analysis Utilities (14+ year project effort) • Performance system framework for HPC systems • Integrated, scalable, flexible, and parallel • Multiple parallel programming paradigms • Parallel performance mapping methodology • Portable (open source) parallel performance system • Instrumentation, measurement, analysis, and visualization • Portable performance profiling and tracing facility • Performance data management and data mining • Scalable (very large) parallel performance analysis • Partners • Research Center Jülich, LLNL, ANL, LANL, UTK

TAU Performance Observation Methodology • Advocate event-based, direct performance observation • Observe execution events • Types: control flow, state-based, user-defined • Modes: atomic, interval (enter/exit) • Instrument program code directly (defines events) • Modify program code at points of event occurrence • Different code forms (source, library, object, binary, VM) • Measurement code inserted (instantiates events) • Make events visible • Measures performance related to event occurrence • Contrast with event-based sampling

TAU Performance System Architecture

User-level abstractions problem domain linker OS Multi-Level Instrumentation and Mapping • Multiple interfaces • Information sharing • Between interfaces • Event selection • Within levels • Between levels • Mapping • Performance data is associated with high-level semantic abstractions source code instrumentation preprocessor instrumentation source code instrumentation compiler instrumentation object code libraries executable instrumentation instrumentation runtime image instrumentation instrumentation VM performancedata run

TAU Instrumentation Approach • Support for standard program events • Routines, classes and templates • Statement-level blocks and loops • Support for user-defined events • Begin/End events (“user-defined timers”) • Atomic events (e.g., size of memory allocated/freed) • Selection of event statistics • Support definition of “semantic” entities for mapping • Support for event groups (aggregation, selection) • Instrumentation selection and optimization • Instrumentation enabling/disabling and runtime throttling

TAU Instrumentation Mechanisms • Source code • Manual (TAU API, TAU component API) • Automatic (robust) • C, C++, F77/90/95 (Program Database Toolkit (PDT)) • OpenMP (directive rewriting (Opari), POMP2 spec) • Object code • Pre-instrumented libraries (e.g., MPI using PMPI) • Statically-linked and dynamically-linked • Executable code • Dynamic instrumentation (pre-execution) (DyninstAPI) • Virtual machine instrumentation (e.g., Java using JVMPI) • TAU_COMPILER to automate instrumentation process

TAU Measurement Approach • Portable and scalable parallel profiling solution • Multiple profiling types and options • Event selection and control (enabling/disabling, throttling) • Online profile access and sampling • Online performance profile overhead compensation • Portable and scalable parallel tracing solution • Trace translation to EPILOG, VTF3, and OTF • Trace streams (OTF) and hierarchical trace merging • Robust timing and hardware performance support • Multiple counters (hardware, user-defined, system) • Measurement specification separate from instrumentation

TAU Measurement Mechanisms • Parallel profiling • Function-level, block-level, statement-level • Supports user-defined events and mapping events • TAU parallel profile stored (dumped) during execution • Support for flat, callgraph/callpath, phase profiling • Support for memory profiling (headroom, leaks) • Tracing • All profile-level events • Inter-process communication events • Inclusion of multiple counter data in traced events • Compile-time and runtime measurement selection

Performance Analysis and Visualization • Analysis of parallel profile and trace measurement • Parallel profile analysis • ParaProf: parallel profile analysis and presentation • ParaVis: parallel performance visualization package • Profile generation from trace data (tau2pprof) • Performance data management framework (PerfDMF) • Parallel trace analysis • Translation to VTF (V3.0), EPILOG, OTF formats • Integration with VNG (Technical University of Dresden) • Online parallel analysis and visualization • Integration with CUBE browser (KOJAK, UTK, FZJ)

TAU and DyninstAPI • TAU has had a long-term affair Dyninst technology • Dyninst offered a binary-level instrumentation tool • Could help in cases when the source code is unavailable • Could allow instrumentation without recompilation • TAU requirements • Instrument HPC applications with TAU measurements • Multiple paradigms, languages, compilers, platforms • Portability • Tested Dyninst features as they were released • Issues • MPI, threading, availability, binary rewriting • It been on/off open relationship

Using DyninstAPI • TAU uses DyninstAPI for binary code patching • Pre-execution • versus at any point during execution • Methods • runtime before the application begins • binary rewriting • tau_run (mutator) • Loads TAU measurement library • Uses DyninstAPI to instrument mutatee • Can apply instrumentation selection

Using DyninstAPI with TAU Configure TAU with Dyninst and build <taudir>/<arch>/bin/tau_run % configure –dyninst=/usr/local/dyninstAPI-5.0.1 % make clean; make install tau_run command % tau_run [<-o outfile>] [-Xrun<libname>][-f <select_inst_file>] [-v] <infile> Instrument all events with TAU measurement library and execute: % tau_run klargest Instrument all events with TAU+PAPI measurements (libTAUsh-papi.so) and execute: % tau_run -XrunTAUsh-papi a.out Instruments only events specified in select.tau instrumentation specification file and execute: % tau_run -f select.tau a.out Binary rewriting: % tau_run –o a.inst.out a.out

Runtime Instrumentation with DyninstAPI • tau_run loads TAU’s shared object in the address space • Selects routines to be instrumented • Calls DyninstAPI OneTimeCode • Register a startup routine • Pass a string of routine (event) names • “main | foo | bar” • IDs assigned to events • TAU’s hooks for entry/exit used for instrumentation • Invoked during execution

Using DyninstAPI with MPI • One mutator per mutatee • Each mutator instruments mutatee prior to execution • No central control • Each mutatee writes its own performance data to disk % mpirun -np 4 ./run.sh % cat run.sh #!/bin/sh /usr/local/tau-2.x/x86_64/bin/tau_run <path>/a.out

Binary Rewriting with TAU • Rewrite binary (Save the world) before executing • No central control • No need to re-instrument the code on all backend nodes • Each mutatee writes its own performance data to disk % tau_run -o a.inst.out a.out % cd _dyninstsaved0 % mpirun -np 4 ./a.inst.out

Example • EK-SIMPLE benchmark • CFD benchmark • Andy Shaw, Kofi Fynn • Adapted by Brad Chamberlain • Experimentation • Run on 4 cpus • Runtime instrumentation using DyninstAPI and tau_run • Measure wallclock time and CPU time experiments • Profiling and tracing modes of measurement • Look at performance data with Paraprof and Vampir

ParaProf - Main Window (4 cpus)

ParaProf - Indivdual Profile (n,c,t 0,0,0)

ParaProf - Statistics Table (Mean)

ParaProf - net_recv (MPI rank 1)

Integrated Instrumentation (Source + Dyninst) • Use source instrumentation for some events • Use Dyninst for other events • Access same TAU measurement infrastructure • Demonstrate on matrix multiplication example • Compare regular versus strip-mining versions Source instrumented Source + binary instrumented

TAU-over-MRNET (ToM) Project MRNET as a Transport Substrate in TAU (Reporting early work done in the last week.)

TAU Transport Substrate - Motivations • Transport Substrate • Enables movement of measurement-related data • TAU, in the past, has relied on shared file-system • Some Modes of Performance Observation • Offline / Post-mortem observation and analysis • least requirements for a specialized transport • Online observation • long running applications, especially at scale • dumping to file-system can be suboptimal • Online observation with feedback into application • in addition, requires that the transport is bi-directional • Performance observation problems and requirements are a function of the mode

Requirements • Improve performance of transport • NFS can be slow and variable • Specialization and remoting of FS-operations to front-end • Data Reduction • At scale, cost of moving data too high • Sample in different domain (node-wise, event-wise) • Control • Selection of events, measurement technique, target nodes • What data to output, how often and in what form? • Feedback into the measurement system, feedback into application • Online, distributed processing of generated performance data • Use compute resource of transport nodes • Global performance analyses within the topology • Distribute statistical analyses • easy (mean, variance, histogram), challenging (clustering)

Approach and First Prototype • Measurement and measured data transport are separate • No such distinction in TAU • Created abstraction to separate and hide transport • TauOutput • Did not create a custom transport for TAU • Use existing monitoring/transport capabilities • Supermon (Sottile and Minnich, LANL) • Piggy-backed TAU performance data on Supermon channels • Correlate system-level metrics from Supermon with TAU application performance data

Rationale • Moved away from NFS • Separation of concerns • Scalability, portability, robustness • Addressed independent of TAU • Re-use existing technologies where appropriate • Multiple bindings • Use different solutions best suited to particular platform • Implementation speed • Easy, fast to create adapter that binds to existing transport • MRNET support was added in about a week • Says a lot about usability of MRNET

ToM Architecture • TAU Components* • Front-End (FE) • Filters • Back-End (BE) • * Over MRNet API • No-Backend-Instantiationmode • Push-Pull model of dataretrieval • No daemon • Instrumented application contains TAU and Back-End • Two channels (streams) • Data (BE to FE) • Control (FE to BE)

ToM Architecture • Applicaton calls into TAU • Per-Iteration explicit call to output routine • Periodic calls using alarm • TauOutput object invoked • Configuration specific:compile or runtime • One per thread • TauOutput mimics subset of FS-style operations • Avoids changes to TAU code • If required rest of TAU can be made aware of output type • Non-blocking recv for control • Back-end pushes • Sink pulls

Simple Example (NPB LU - A, Per-5 iterations) Exclusive time

Simple Example (NPB LU - A, Per-5 iterations) Number of calls

Comparing ToM with NFS • TAUoverNFS versus TAUoverMRNET • 250 ssor iterations • 251 TAU_DB_DUMP operations • Significant advantages with specialized transport substrate • Similar when using Supermon as the substrate • Remoting of expensive FS meta-data operations to Front-End

Playing with Filters • Downstream (FE to BE) multicast path • Even without filters, is very useful for control • Data Reduction Filters are integral to Upstream path (BE to FE) • W/O filters loss-less data reproduced D-1 times • Unnecessary large cost to network • Filter 1: Random Sampling Filter • Very simplistic data reduction by node-wise sampling • Accepts or Rejects packets probabilistically • TAU Front-End can control probability P(accept) • P(accept)=K/N (N = # leafs, K is user constant) • Bounds number of packets per-round to K

Filter 1 in Action (Ring application) • Compare different P(accept) values • 1, 1/4, 1/16 • Front-End unable to keep up • Queuing delay propagated back

Other Filters • Statistics filter • Reduce raw performance data to smaller set of statistics • Distribute these statistical analyses from Front-End to the filters • Simple measures - mean, std.dev, histograms • More sophisticated measures - distributed clustering • Controlling filters • No direct way to control Upstream-filters • not on control path • Recommended solution • place upstream filters that work in concert with downstream filters to share control information • requires synchronization of state between upstream and downstream filters • Our Echo hack • Back-Ends transparently echo Filter-Control packets back upstream • this is then interpreted by the filters • easier to implement • control response time may be greater

Feedback / Suggestions • Easy to integrate with MRNET • Good examples documentation, readable source code • Setup phase • Make MRNET intermediate nodes listen on pre-specified port • Allow arbitrary mrnet-ranks to connect and then set the Ids in the topology • Relaxing strict apriori-ranks can make setup easier • Setup in Job-Q environments difficult • Packetization API can be more flexible • Current API is usable and simple (var-arg printf style) • Composing a packet over a series of staggered stages difficult • Allow control over how buffering is performed • Important in a push-pull model as data injection points (rates) independent of data retrieval • Is not a problem in a purely pull model

TAUoverMRNET - Contrast TAUoverSupermon • Supermon (cluster-monitor) vs. MRNet (reduction-network) • Both light-weight transport substrates • Data format • Supermon: ascii s-expressions • MRNET: packets with packed (binary?) data • Supermon Setup • Loose topology • No support/help in setting up intermediate nodes • Assume Supermon is part of the environment • MRNET Setup • Strict topology • Better support for starting intermediate nodes • With/Without Back-End instantiation (TAU uses latter) • Multiple Front-Ends (or sinks) possible with Supermon • MRNET, front-end needs to program this functionality • No exisiting pluggable “filter” support in Supermon • Performing aggregation is more difficult with Supermon. • Supermons allows buffer-policy specification, MRNET does not

Future Work • Dyninst • Tighter integration of source and binary instrumentation • Conveying of source information to binary level • Enabling use of TAU’s advanced measurement features • Leveraging TAU’s performance mapping support • Want robust and portable binary rewriting tool • MRNet • Development of more performance filters • Evaluation of MRNet performance for different scenarios • Testing at large scale • Use in applications

TAU Meets Dyninst and MRNet: A Long-term and Short-term Affair