1 / 24

Outline

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj , sameer, malony}@cs.uoregon.edu. Outline. What is TAU? Instrumentation Measurement Invoking on SiCortex - tauex Demo sweep3D Visualizing performance results.

trey
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using TAU on SiCortexAlan Morris, Aroon NatarajSameer Shende, Allen D. MalonyUniversity of Oregon{amorris, anataraj, sameer, malony}@cs.uoregon.edu

  2. Outline • What is TAU? • Instrumentation • Measurement • Invoking on SiCortex - tauex • Demo • sweep3D • Visualizing performance results

  3. TAU Performance System • Tuning and Analysis Utilities (13+ year project effort) • Performance measurement framework for HPC systems • Portable, scalable, flexible, and parallel • Integrated toolkit for performance problem solving • Automatic instrumentation • Highly configurable measurement system with support for many flavors of profiling and tracing • Portable analysis and visualization tools • Performance data management and data mining • http://tau.uoregon.edu

  4. TAU Instrumentation Approach • Support for standard program events • Routines • Classes and templates • Finer-grain -- loop-level • Support for user-defined events • Begin/End events (“user-defined timers”) • Atomic events (e.g., size of memory allocated/freed) • Support for event groups • Selective examination of performance data • Runtime disabling of groups • Instrumentation optimization • Selective instrumentation of events (only instrument needed) • tau_reduce - generate selective instrumentation file

  5. TAU Instrumentation • Flexible instrumentation mechanisms at multiple levels • Source code • manual (TAU API, CCA TAU Component API) • automatic • C, C++, F77/90/95 (Program Database Toolkit (PDT)) • OpenMP (directive rewriting (Opari), POMP spec) • Library level • pre-instrumented libraries (e.g., MPI using PMPI) • statically-linked and dynamically-linked • Executable code • dynamic instrumentation (pre-execution) (DynInstAPI) • virtual machine instrumentation (e.g., Java using JVMPI) • Runtime Linking (LD_PRELOAD)

  6. Automatic Instrumentation • We provide compiler wrapper scripts • Simply replace mpif90 with tauf90 • Automatically instruments Fortran source code, links with TAU MPI Wrapper libraries. • Use taucc and taucxx for C/C++ Before CXX = mpicxx F90 = mpif90 CFLAGS = LIBS =-lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS) .cpp.o: $(CC) $(CFLAGS) -c $< After CXX = taucxx F90 = tauf90 CFLAGS = LIBS =-lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS) .cpp.o: $(CC) $(CFLAGS) -c $<

  7. Measurement Options • Flat profiles • Time (or counts) spent in each routine (nodes in callgraph). • Exclusive/inclusive time, no. of calls, child calls • Support for hardware counters (PAPI), multiple counters. • Callpath Profiles • Flat profiles, plus • Time spent along a calling path (edges in callgraph) • E.g., “main=> f1 => f2 => MPI_Send” shows the time spent in MPI_Send when called by f2, when f2 is called by f1, when it is called by main. • Configurable callpath depth limit (TAU_CALLPATH_DEPTH environment variable) • Tracing • VAMPIRTRACE • TAU Trace format; Converters: tau2slog2, tau2vtf, tau2otf

  8. Running Applications with TAU on SiCortex • New tool, tauex Usage: tauex [options] [--] <exe> <exe options> Options: -d: Enable debugging output, use repeatedly for more output. -h: Print this message. -i: Print information about the host machine. -s: Dump the shell environment variables and exit. -U: User mode counts -K: Kernel mode counts -S: Supervisor mode counts -I: Interrupt mode counts -l: List events -L <event> : Describe event -a: Count all native events (implies -m) -m: Multiple runs (enough runs of exe to gather all events) -e <event> : Specify PAPI preset or native event -T <OPENMP,PROFILE,CALLPATH,TRACE,VAMPIRTRACE,EPILOG,DISABLE> : specify TAU option -v: Debug/Verbose mode -XrunTAU-<options> : specify TAU library directly

  9. Demo • Application Sweep3D • Build standard sweep3d application (un-instrumented) • Run standard (un-instrumented) sweep3d using tauex • Provides MPI-only profiles • Build sweep3d with automatic TAU instrumentor • Run instrumented sweep3d with tauex • Provides application-level and MPI events

  10. Using tauex • Use with uninstrumented executable for MPI profiling and tracing $ mpif90 ring.f90 –o ring $ srun -pscx -n4 tauex ./ring $ cd ring.tau.1669/MULTI__P_WALL_CLOCK_TIME $ pprof … FUNCTION SUMMARY (mean): --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 3 338 1 8 338086 .TAU application 59.5 201 201 1 0 201180 MPI_Init() 37.2 125 125 1 0 125889 MPI_Finalize() 2.1 6 6 1 0 6971 MPI_Barrier() 0.2 0.626 0.626 1 0 626 MPI_Recv() 0.1 0.247 0.247 1 0 247 MPI_Bcast() 0.0 0.102 0.102 1 0 102 MPI_Send() 0.0 0.00875 0.00875 1 0 9 MPI_Comm_size() 0.0 0.00525 0.00525 1 0 5 MPI_Comm_rank()

  11. Using TAU compiler wrappers $ tauf90 ring.f90 –o ring # verbose output shows each step Debug: Parsing with PDT Parser Executing> /usr/share/PDT/mips/bin/f95parse ring.f90 -I/home/amorris/usr/include Debug: Instrumenting with TAU Executing> /usr/bin/tau_instrumentor ring.pdb ring.f90 -o ring.inst.f90 Debug: Compiling (Individually) with Instrumented Code Executing> pathf95 -mabi=64 -I. -c ring.inst.f90 -o ring.o Debug: Linking (Together) object files Executing> pathf95 ring.o -mabi=64 -lTAU -Wl,-rpath -lpfm -lpapi -lpfm -lpthread -L/usr/lib/gcc/mips64el-gentoo-linux-gnu/4.1.2/ -lstdc++ -lgcc_s -lscmpi -o ring Debug: cleaning inst file Executing> /bin/rm -f ring.inst.f90 Debug: cleaning PDB file Executing> /bin/rm -f ring.pdb

  12. Using tauex • Use with instrumented executables for Application+MPI profiling and tracing $ tau_f90.sh ring.f90 –o ring $ srun -pscx -n4 tauex –e CPU_CYCLES ./ring $ cd ring.tau.1674/MULTI__CPU_CYCLES $ pprof FUNCTION SUMMARY (mean): --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Count/Call Name counts total counts --------------------------------------------------------------------------------------- 100.0 4.552E+05 2.557E+07 1 5 25568675 MAIN 81.1 2.074E+07 2.074E+07 1 0 20742874 MPI_Init() 16.3 1.419E+05 4.176E+06 1 4 4176389 FUNC 14.2 3.639E+06 3.639E+06 1 0 3638847 MPI_Barrier() 0.8 2.091E+05 2.091E+05 1 0 209063 MPI_Recv() 0.7 1.875E+05 1.875E+05 1 0 187542 MPI_Finalize() 0.6 1.539E+05 1.539E+05 1 0 153880 MPI_Bcast() 0.1 3.27E+04 3.27E+04 1 0 32702 MPI_Send() 0.0 4337 4337 1 0 4337 MPI_Comm_size() 0.0 2308 2308 1 0 2308 MPI_Comm_rank()

  13. Using tauex • Other typical usage scenarios # floating point instruction counts and time (compute derived # FLOPS in ParaProf) $ tauex –e P_WALL_CLOCK_TIME –e PAPI_FP_INS <app> # Generate callpath profiles $ tauex –e PAPI_FP_INS –T callpath <app> # Generate OTF traces for Vampir $ tauex –e PAPI_FP_INS –T vampirtrace <app> # Generate Epilog traces for Kojak $ tauex –T epilog <app>

  14. Using TAU with FLASH on SiCortex • To use TAU with FLASH on SiCortex platforms, simply specify –tau=<path/to/stub/makefile> in setup # On Full Disclosure: $ ./setup Sedov -2d -auto -site=sicortex -objdir=tau -tau=/home/amorris/usr/share/TAU/64/Makefile.tau-multiplecounters-pathcc-mpi-papi-pdt # Build $ cd tau ; make # Run (using tauex) # This will generate callpath profiles with time and floating point instruction metrics $ srun -pscx -n16 tauex -T callpath –e P_WALL_CLOCK_TIME –e PAPI_FP_INS ./flash3

  15. Using ParaProf • Not yet available on SiCortex mips nodes (requires Java) • For now, tar up <app>.tau.<jobid> directory and copy to another machine • ParaProf can be run on Linux, Windows, or Mac • If you don’t have TAU/paraprof, but have Java Web Start, visit http://tau.uoregon.edu/paraprof to run it scx-m23-n6> tar czf flash3.tau.1578.tar.gz flash3.tau.1578 scx-m32-n6> scp flash3.tau.1578.tar.gz somewhere-else: scx-m32-n6> ssh somewhere-else se> tar –xzf flash3.tau.1578.tar.gz se> paraprof flash3.tau.1578

  16. ParaProf – Full Profile (FLASH) MPI_Barrier IO routines

  17. ParaProf - Statistics Table (Uintah)

  18. ParaProf –Callgraph View (MFIX)

  19. ParaProf – Histogram View (Miranda) • Scalable 2D displays 16k processors 8k processors

  20. ParaProf – 3D Full Profile (Miranda) 16k processors

  21. ParaProf – 3D Scatterplot (Miranda) • Each pointis a “thread”of execution • Relation between four routines shown at once

  22. Tracing (Vampir) - Uintah • Trace analysis provides in-depth understanding of temporal event and message passing relationships • Traces can even store hardware counters

  23. VNG Timeline Display (Miranda on BGL)

  24. Thank You TAU should soon be part of the SiCortex standard install Check out: http://tau.uoregon.edu

More Related