Outline

Using TAU on SiCortexAlan Morris, Aroon NatarajSameer Shende, Allen D. MalonyUniversity of Oregon{amorris, anataraj, sameer, malony}@cs.uoregon.edu

Outline • What is TAU? • Instrumentation • Measurement • Invoking on SiCortex - tauex • Demo • sweep3D • Visualizing performance results

TAU Performance System • Tuning and Analysis Utilities (13+ year project effort) • Performance measurement framework for HPC systems • Portable, scalable, flexible, and parallel • Integrated toolkit for performance problem solving • Automatic instrumentation • Highly configurable measurement system with support for many flavors of profiling and tracing • Portable analysis and visualization tools • Performance data management and data mining • http://tau.uoregon.edu

TAU Instrumentation Approach • Support for standard program events • Routines • Classes and templates • Finer-grain -- loop-level • Support for user-defined events • Begin/End events (“user-defined timers”) • Atomic events (e.g., size of memory allocated/freed) • Support for event groups • Selective examination of performance data • Runtime disabling of groups • Instrumentation optimization • Selective instrumentation of events (only instrument needed) • tau_reduce - generate selective instrumentation file

TAU Instrumentation • Flexible instrumentation mechanisms at multiple levels • Source code • manual (TAU API, CCA TAU Component API) • automatic • C, C++, F77/90/95 (Program Database Toolkit (PDT)) • OpenMP (directive rewriting (Opari), POMP spec) • Library level • pre-instrumented libraries (e.g., MPI using PMPI) • statically-linked and dynamically-linked • Executable code • dynamic instrumentation (pre-execution) (DynInstAPI) • virtual machine instrumentation (e.g., Java using JVMPI) • Runtime Linking (LD_PRELOAD)

Automatic Instrumentation • We provide compiler wrapper scripts • Simply replace mpif90 with tauf90 • Automatically instruments Fortran source code, links with TAU MPI Wrapper libraries. • Use taucc and taucxx for C/C++ Before CXX = mpicxx F90 = mpif90 CFLAGS = LIBS =-lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS) .cpp.o: $(CC) $(CFLAGS) -c $< After CXX = taucxx F90 = tauf90 CFLAGS = LIBS =-lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS) .cpp.o: $(CC) $(CFLAGS) -c $<

Measurement Options • Flat profiles • Time (or counts) spent in each routine (nodes in callgraph). • Exclusive/inclusive time, no. of calls, child calls • Support for hardware counters (PAPI), multiple counters. • Callpath Profiles • Flat profiles, plus • Time spent along a calling path (edges in callgraph) • E.g., “main=> f1 => f2 => MPI_Send” shows the time spent in MPI_Send when called by f2, when f2 is called by f1, when it is called by main. • Configurable callpath depth limit (TAU_CALLPATH_DEPTH environment variable) • Tracing • VAMPIRTRACE • TAU Trace format; Converters: tau2slog2, tau2vtf, tau2otf

Running Applications with TAU on SiCortex • New tool, tauex Usage: tauex [options] [--] <exe> <exe options> Options: -d: Enable debugging output, use repeatedly for more output. -h: Print this message. -i: Print information about the host machine. -s: Dump the shell environment variables and exit. -U: User mode counts -K: Kernel mode counts -S: Supervisor mode counts -I: Interrupt mode counts -l: List events -L <event> : Describe event -a: Count all native events (implies -m) -m: Multiple runs (enough runs of exe to gather all events) -e <event> : Specify PAPI preset or native event -T <OPENMP,PROFILE,CALLPATH,TRACE,VAMPIRTRACE,EPILOG,DISABLE> : specify TAU option -v: Debug/Verbose mode -XrunTAU-<options> : specify TAU library directly

Demo • Application Sweep3D • Build standard sweep3d application (un-instrumented) • Run standard (un-instrumented) sweep3d using tauex • Provides MPI-only profiles • Build sweep3d with automatic TAU instrumentor • Run instrumented sweep3d with tauex • Provides application-level and MPI events

Using tauex • Use with uninstrumented executable for MPI profiling and tracing $ mpif90 ring.f90 –o ring $ srun -pscx -n4 tauex ./ring $ cd ring.tau.1669/MULTI__P_WALL_CLOCK_TIME $ pprof … FUNCTION SUMMARY (mean): --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 3 338 1 8 338086 .TAU application 59.5 201 201 1 0 201180 MPI_Init() 37.2 125 125 1 0 125889 MPI_Finalize() 2.1 6 6 1 0 6971 MPI_Barrier() 0.2 0.626 0.626 1 0 626 MPI_Recv() 0.1 0.247 0.247 1 0 247 MPI_Bcast() 0.0 0.102 0.102 1 0 102 MPI_Send() 0.0 0.00875 0.00875 1 0 9 MPI_Comm_size() 0.0 0.00525 0.00525 1 0 5 MPI_Comm_rank()

Using TAU compiler wrappers $ tauf90 ring.f90 –o ring # verbose output shows each step Debug: Parsing with PDT Parser Executing> /usr/share/PDT/mips/bin/f95parse ring.f90 -I/home/amorris/usr/include Debug: Instrumenting with TAU Executing> /usr/bin/tau_instrumentor ring.pdb ring.f90 -o ring.inst.f90 Debug: Compiling (Individually) with Instrumented Code Executing> pathf95 -mabi=64 -I. -c ring.inst.f90 -o ring.o Debug: Linking (Together) object files Executing> pathf95 ring.o -mabi=64 -lTAU -Wl,-rpath -lpfm -lpapi -lpfm -lpthread -L/usr/lib/gcc/mips64el-gentoo-linux-gnu/4.1.2/ -lstdc++ -lgcc_s -lscmpi -o ring Debug: cleaning inst file Executing> /bin/rm -f ring.inst.f90 Debug: cleaning PDB file Executing> /bin/rm -f ring.pdb

Using tauex • Use with instrumented executables for Application+MPI profiling and tracing $ tau_f90.sh ring.f90 –o ring $ srun -pscx -n4 tauex –e CPU_CYCLES ./ring $ cd ring.tau.1674/MULTI__CPU_CYCLES $ pprof FUNCTION SUMMARY (mean): --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Count/Call Name counts total counts --------------------------------------------------------------------------------------- 100.0 4.552E+05 2.557E+07 1 5 25568675 MAIN 81.1 2.074E+07 2.074E+07 1 0 20742874 MPI_Init() 16.3 1.419E+05 4.176E+06 1 4 4176389 FUNC 14.2 3.639E+06 3.639E+06 1 0 3638847 MPI_Barrier() 0.8 2.091E+05 2.091E+05 1 0 209063 MPI_Recv() 0.7 1.875E+05 1.875E+05 1 0 187542 MPI_Finalize() 0.6 1.539E+05 1.539E+05 1 0 153880 MPI_Bcast() 0.1 3.27E+04 3.27E+04 1 0 32702 MPI_Send() 0.0 4337 4337 1 0 4337 MPI_Comm_size() 0.0 2308 2308 1 0 2308 MPI_Comm_rank()

Using tauex • Other typical usage scenarios # floating point instruction counts and time (compute derived # FLOPS in ParaProf) $ tauex –e P_WALL_CLOCK_TIME –e PAPI_FP_INS <app> # Generate callpath profiles $ tauex –e PAPI_FP_INS –T callpath <app> # Generate OTF traces for Vampir $ tauex –e PAPI_FP_INS –T vampirtrace <app> # Generate Epilog traces for Kojak $ tauex –T epilog <app>

Using TAU with FLASH on SiCortex • To use TAU with FLASH on SiCortex platforms, simply specify –tau=<path/to/stub/makefile> in setup # On Full Disclosure: $ ./setup Sedov -2d -auto -site=sicortex -objdir=tau -tau=/home/amorris/usr/share/TAU/64/Makefile.tau-multiplecounters-pathcc-mpi-papi-pdt # Build $ cd tau ; make # Run (using tauex) # This will generate callpath profiles with time and floating point instruction metrics $ srun -pscx -n16 tauex -T callpath –e P_WALL_CLOCK_TIME –e PAPI_FP_INS ./flash3

Using ParaProf • Not yet available on SiCortex mips nodes (requires Java) • For now, tar up <app>.tau.<jobid> directory and copy to another machine • ParaProf can be run on Linux, Windows, or Mac • If you don’t have TAU/paraprof, but have Java Web Start, visit http://tau.uoregon.edu/paraprof to run it scx-m23-n6> tar czf flash3.tau.1578.tar.gz flash3.tau.1578 scx-m32-n6> scp flash3.tau.1578.tar.gz somewhere-else: scx-m32-n6> ssh somewhere-else se> tar –xzf flash3.tau.1578.tar.gz se> paraprof flash3.tau.1578

ParaProf – Full Profile (FLASH) MPI_Barrier IO routines

ParaProf - Statistics Table (Uintah)

ParaProf –Callgraph View (MFIX)

ParaProf – Histogram View (Miranda) • Scalable 2D displays 16k processors 8k processors

ParaProf – 3D Full Profile (Miranda) 16k processors

ParaProf – 3D Scatterplot (Miranda) • Each pointis a “thread”of execution • Relation between four routines shown at once

Tracing (Vampir) - Uintah • Trace analysis provides in-depth understanding of temporal event and message passing relationships • Traces can even store hardware counters

VNG Timeline Display (Miranda on BGL)

Thank You TAU should soon be part of the SiCortex standard install Check out: http://tau.uoregon.edu

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: