300 likes | 440 Views
Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu PARA’06: MS8: Tools for Parallel Performance Analysis, 2:40pm – 3pm, Mon 6/19/06. Outline. Overview of features
E N D
Optimization of Instrumentation in Parallel Performance Evaluation ToolsSameer Shende, Allen D. Malony, Alan MorrisUniversity of Oregon{sameer, malony,amorris}@cs.uoregon.edu PARA’06: MS8: Tools for Parallel Performance Analysis, 2:40pm – 3pm, Mon 6/19/06
Outline • Overview of features • Instrumentation • Measurement (Profiling, Tracing) • Analysis tools • Tools and techniques for optimizing instrumentation • Conclusions
TAU Performance System • Tuning and Analysis Utilities (14+ year project effort) • Performance system framework for HPC systems • Integrated, scalable, portable, flexible, and parallel • Integrated toolkit for performance problem solving • Automatic instrumentation • Highly configurable measurement system with support for many flavors of profiling and tracing • Portable analysis and visualization tools • Performance data management and data mining • http://www.cs.uoregon.edu/research/tau
TAU Performance System Architecture event selection
Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 Program documentation PDBhtml Application component glue IL IL SILOON C / C++ IL analyzer Fortran IL analyzer C++ / F90/95 interoperability CHASM Program Database Files Automatic source instrumentation TAU_instr DUCTAPE
ParaProf – Manager Window performancedatabase derived performance metrics
ParaProf – Full Profile (Miranda) 8K processors!
ParaProf – 3D Full Profile (Miranda) 16k processors
ParaProf – 3D Scatterplot (Miranda) • Each pointis a “thread”of execution • Relation between four routines shown at once
TAU Instrumentation Approach • Support for standard program events • Routines • Classes and templates • Statement-level blocks • Support for user-defined events • Begin/End events (“user-defined timers”) • Atomic events (e.g., size of memory allocated/freed) • Support definition of “semantic” entities for mapping • Support for event groups • Instrumentation optimization (eliminate instrumentation in lightweight routines)
Sampling vs Measured Profiling • Sampling • At a sample, PC or callstack is examined • Estimate performance of the program based on samples taken in code regions • Fixed overhead, depends on inter-sample interval • Typically used in gprof, prof and other system profilers • Measured Profiling • Instrumentation calls inserted at code regions • Entry/exit from routine, outer-loops, “events” • Accurate measurements, compensation for timer overheads possible • Accuracy inversely proportional to the granularity of instrumentation • Coarse grained instrumentation is more accurate • Overhead of instrumentation depends on event frequency • Optimize instrumentation to capture necessary detail, eliminate instrumentation in frequently executing lightweight routines • Used in TAU
TAU Instrumentation • Flexible instrumentation mechanisms at multiple levels • Source code • manual (TAU API, TAU Component API) • automatic • C, C++, F77/90/95 (Program Database Toolkit (PDT)) • OpenMP (directive rewriting (Opari), POMP spec) • Object code • pre-instrumented libraries (e.g., MPI using PMPI) • statically-linked and dynamically-linked • Executable code • dynamic instrumentation (pre-execution) (DynInstAPI) • virtual machine instrumentation (e.g., Java using JVMPI) • Runtime Linking (LD_PRELOAD)
PAPI [UTK] • Performance Application Programming Interface • The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors. • Parallel Tools Consortium project • University of Tennessee, Knoxville • http://icl.cs.utk.edu/papi
KOJAK • KOJAK Toolkit [ICL, UTK and FZJ, Germany] • Epilog tracing library • Opari OpenMP re-writing tool • Expert automatic bottleneck detection trace analyzer • CUBE performance data browser • http://icl.cs.utk.edu/kojak
Automatic Instrumentation • We now provide compiler wrapper scripts • Simply replace mpxlf90 with tau_f90.sh • Automatically instruments Fortran source code, links with TAU MPI Wrapper libraries. • Use tau_cc.sh and tau_cxx.sh for C/C++ Before CXX = mpCC F90 = mpxlf90_r CFLAGS = LIBS =-lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS) .cpp.o: $(CC) $(CFLAGS) -c $< After CXX = tau_cxx.sh F90 = tau_f90.sh CFLAGS = LIBS =-lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS) .cpp.o: $(CC) $(CFLAGS) -c $<
AutoInstrumentation using TAU_COMPILER • $(TAU_COMPILER) stub Makefile variable in 2.14+ release • Invokes PDT parser, TAU instrumentor, compiler through tau_compiler.shshell script • Requires minimal changes to application Makefile • Compilation rules are not changed • User sets TAU_MAKEFILE and TAU_OPTIONS environment variables • User renames the compilers • F90=xlf90 to • F90= tau_f90.sh • Passes options from TAU stub Makefile to the four compilation stages • Uses original compilation command if an error occurs
TAU_COMPILER Options • Optional parameters for $(TAU_COMPILER): [tau_compiler.sh –help] • -optVerbose Turn on verbose debugging messages • -optPdtDir="" PDT architecture directory. Typically $(PDTDIR)/$(PDTARCHDIR) • -optPdtF95Opts="" Options for Fortran parser in PDT (f95parse) • -optPdtCOpts="" Options for C parser in PDT (cparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) • -optPdtCxxOpts="" Options for C++ parser in PDT (cxxparse). Typically$(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) • -optPdtF90Parser="" Specify a different Fortran parser. For e.g., f90parse instead of f95parse • -optPdtUser="" Optional arguments for parsing source code • -optPDBFile="" Specify [merged] PDB file. Skips parsing phase. • -optTauInstr="" Specify location of tau_instrumentor. Typically $(TAUROOT)/$(CONFIG_ARCH)/bin/tau_instrumentor • -optTauSelectFile="" Specify selective instrumentation file for tau_instrumentor • -optTau="" Specify options for tau_instrumentor • -optCompile="" Options passed to the compiler. Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) • -optLinking="" Options passed to the linker. Typically $(TAU_MPI_FLIBS) $(TAU_LIBS) $(TAU_CXXLIBS) • -optNoMpi Removes -l*mpi* libraries during linking (default) • -optKeepFiles Does not remove intermediate .pdb and .inst.* files e.g., % setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau –optVerbose -optPdtCOpts=“-I/home -DFOO” ’ % setenv TAU_MAKEFILE /usr/local/tau-2.15.4/ia64/lib/Makefile.tau-icpc-mpi-pdt % tau_cxx.sh matrix.cpp -o matrix –lm % tau_f90.sh foo.o bar.o –o app –lm
Optimization of Instrumentation Overhead • Group routines into profile groups, runtime selection of profiling groups • Instrument sections of code selectively • Exclude or include list of routines fed to the instrumentor – controlled manually or automatically • Rule based control of instrumentation • Generate selective instrumentation file by examining performance data from a previous run
tau_reduce: Rule-Based Overhead Analysis • Analyze the performance data to determine events with high (relative) overhead performance measurements • Create a select list for excluding those events • Rule grammar (used in tau_reducetool) [GroupName:]Field Operator Number • GroupName indicates rule applies to events in group • Field is a event metric attribute (from profile statistics) • numcalls, numsubs, percent, usec, cumusec, count [PAPI], totalcount, stdev, usecs/call, counts/call • Operator is one of >, <, or = • Number is any number • Compound rules possible using & between simple rules
Optimizing Instrumentation Overhead: Examples • #Exclude all events that are members of TAU_USER #and use less than 1000 microsecondsTAU_USER:usec < 1000 • #Exclude all events that have less than 100 #microseconds and are called only onceusec < 1000 & numcalls = 1 • #Exclude all events that have less than 1000 usecs per #call OR have a (total inclusive) percent less than 5usecs/call < 1000percent < 5 • Scientific notation can be used • usec>1000 & numcalls>400000 & usecs/call<30 & percent>25
TAU_REDUCE • Reads profile files and rules • Creates selective instrumentation file • Specifies which routines should be excluded from instrumentation rules tau_reduce Selective instrumentation file profile
Instrumentation Specification % tau_instrumentor Usage : tau_instrumentor <pdbfile> <sourcefile> [-o <outputfile>] [-noinline] [-g groupname] [-i headerfile] [-c|-c++|-fortran] [-f <instr_req_file> ] For selective instrumentation, use –f option % tau_instrumentor foo.pdb foo.cpp –o foo.inst.cpp –f selective.dat % cat selective.dat # Selective instrumentation: Specify an exclude/include list of routines/files. BEGIN_EXCLUDE_LIST void quicksort(int *, int, int) void sort_5elements(int *) void interchange(int *, int *) END_EXCLUDE_LIST BEGIN_FILE_INCLUDE_LIST Main.cpp Foo?.c *.C END_FILE_INCLUDE_LIST # Instruments routines in Main.cpp, Foo?.c and *.C files only # Use BEGIN_[FILE]_INCLUDE_LIST with END_[FILE]_INCLUDE_LIST
Optimization of Instrumentation Overhead (contd.) • Runtime throttling of events based on rule • Numcalls > ThresholdA and TimePerCall < ThresholdB • setenv TAU_THROTTLE 1 • setenv TAU_THROTTLE_NUMCALLS <no> • setenv TAU_THROTTLE_PERCALL <value> • Default values: • <no> = 100000 calls • <value> = 10 microseconds per call • The next call to meet these conditions is disabled at runtime and put in a TAU_DISABLE group
EPILOG Tracing Optimization • TAU and Epilog Tracing Package • TAU can generate epilog trace files • configure –epilog=<dir> -TRACE … • Epilog uses its own MPI wrapper library • Events are analyzed by Expert to detect performance bottlenecks automatically • Output is a CUBE profile file with callpath information • CUBE output read by CUBE GUI and TAU’s ParaProf profile browser • Expert discards all events do not call an MPI call directly/indirectly • Optimization opportunity for instrumentation
Runtime Instrumentation Control • When TAU is configured with –MPITRACE configuration option (without EPILOG support) • TAU stores events and wallclock time in a buffer • Defers writing buffer to disk until an MPI call takes place • Events directly in callstack are enabled and written to disk • Other events are discarded • TAU traces are converted to Epilog traces (tau2elg) • Expert has minimal set of events
Callpath Profiling Based Selective Instrumentation • TAU is configured with –PROFILECALLPATH • Env. variable TAU_CALLPATH_DEPTH set to a large value • Callpaths rooted at “main” • TAU profiles analyzed to produce an “include list” • list of routines that should be instrumented (tauinc.sh) [F. Wolf] • Events that call an MPI routine directly/indirectly • TAU generates EPILOG traces • Expert analyzes EPILOG traces to produce CUBE profiles • ParaProf and CUBE browsers read CUBE files • PerfDMF performance database stores bottleneck results
Conclusions • Optimization of instrumentation is critical for balancing the volume of performance data generated • Several techniques for reducing the amount of instrumentation
Support Acknowledgements • Department of Energy (DOE) • Office of Science contracts • University of Utah ASC Level 1 sub-contract • LLNL ASC/NNSA Level 3 contract • LLNL ParaTools/GWT contract • NSF • High-End Computing Grant • T.U. Dresden, GWT • Dr. Wolfgang Nagel and Holger Brunst • Research Centre Juelich • Dr. Bernd Mohr, Dr. Felix Wolf • Los Alamos National Laboratory contracts