550 likes | 566 Views
Explore how Dynaprof helps optimize applications, identify bottlenecks, and generate performance signatures using hardware counters and resource usage traces.
E N D
Application Signatures and Performance Analysis of Scientific Applications Philip J. Mucci Innovative Computing Laboratory, UTK Performance Evaluation Research Center, LBL mucci@cs.utk.edu http://icl.cs.utk.edu/~mucci/dynaprof/snapshots/sc2002.ppt PAPI and Dynaprof
Goals • Understanding the behavior of the application • Identification of bottlenecks. • Usage of the hardware resources. • Effects of that usage on performance. • Using Dynaprof to achieve that goal • Command line usage • 3 Dynaprof probes • Wallclock Time • Hardware performance counters • Resource usage traces
Motivation • Optimize the application's performance. • Evaluate the algorithms efficiency. • Generate an application signature. • A collection of data that represent the major terms in the performance model. • Develop a performance model.
Overview of Hardware Counters • Data is NOT PORTABLE, but PAPI is... • Small number of registers dedicated for performance monitoring functions. • AMD Athlon, 4 counters • Pentium <= III, 2 counters • Pentium IV, 18 counters • IA64, 4 counters • Alpha 21x64, 2 counters • Power 3, 8 counters • Power 4, 8 counters to a group • UltraSparc II, 2 counters • MIPS R14K, 2 counters
Applications used in this Tutorial • Serial: • FSPX: A binary alloy solidification benchmark. • SWIM: The SPEC shallow water benchmark. • Parallel (MPI): • Ex19 from PetSC distribution. • Solves nonlinear driven cavity with multigrid. A 2D driven cavity problem solved in a velocity-vorticity formulation.
FPSX Execution Environment • Intel PIII, 1.2 Ghz • FP Results/Clock: 1 1.2 Gflips • 4 SP/clk with SSE, 2DP/clk with SSE2 • Caches: 16K/16K, 256K • G77 version 2.96 -g -O -malign-double -mpentiumpro -funroll-loops -fexpensive-optimizations • Execution time: > /bin/time fspx 115.370u 0.030s 1:58.17 97.6% 0+0k 0+0io 162pf+0w
swim Execution Environment • IBM Nighthawk, 16-way Power 3, 375MHz • FP Results/Clock: 4 (1.5 Gflips) • Caches: 32K/64K, 8MB • MPI over TCP/IP via switch • Xlc 5.0.2.1 built with -g -O3 -qstrict -qarch=pwr3 -qtune=pwr3 • Execution time: > /bin/time poe swim -procs 2 0.4u 0.0s 0:15 3% 217+3933k 0+0io 1pf+0w
ex19 Execution Environment • IBM Nighthawk, 16-way Power 3, 375MHz • FP Results/Clock: 4 (1.5 Gflips) • Caches: 32K/64K, 8MB • Xlc 5.0.2.1 built with -g • Execution time: > /bin/time poe ex19 -procs 2 -da_grid_x 56 -da_grid_y 56 0.520u 0.200s 0:44.18 1.6% 297+3580k 0+0io 0pf+0w
Gprof • Gathers timer interrupts vs. text address. • Recompile with -p option. • Gprof profile is useful for a high level overview • Does it tell us why?
FPSX: Top 4 functions • Top 4 functions make up 50% of execution time • In module update.F • flux • proflux • pde • In module phase.F • phase • Use the list command to explore modules and functions
Dynaprof Environment Variables • LD_LIBRARY_PATH: Colon seperated list where to look for shared libraries. We need to find: • DynInst library • PAPI library • Any dependancies on the above. (libperfctr.so, libcpc.so) • DYNINSTAPI_RT_LIB: Full pathname of DynInst runtime library. • No settings necessary for AIX/DPCL port
Running Dynaprof • Usage: dynaprof [-d] [serial_application] • -d enables debugging output • Specifying an application automatically loads it into the tool immediately after initialization.
Command Line Interface • Uses GNU Readline library for input • Full featured Command Line Editing • File and command completion: <Tab> • History: <Up>/<Down> • Settings, macros and aliases in ~/.inputrc • Allows Emacs or VI style bindings • set editing-mode emacs • set editing-mode vi • See man page, TexInfo file or home page.
Load command • Starts the application and stops it at the first instruction. • Usage: load <application> [args] > dynaprof (dynaprof) load tests/fpsx
Poeload command • For use with MPI applications on AIX and DPCL. • DPCL < 3.2.5 requires full path • Usage: poeload <application> [args] (dynaprof) poeload tests/swim -procs 2
Mpiload command • For use with MPI applications. • Stops the application after it calls PMPI_Init(). • Mostly useful for script driven execution of MPI jobs • Usage: mpiload <application> [args] (dynaprof) mpiload tests/mpicount
Attach command • Attaches to a running application (or poe process) and stops it. • Usage: attach <application> <pid> (dynaprof) ^Z > tests/fspx & [2] 17500 > fg (dynaprof) attach tests/fspx 17500
Poeattach Command • For use with MPI applications on AIX and DPCL. • DPCL < 3.2.5 requires full path • Usage: poeattach <application> <pid_of_poe> (dynaprof) ^Z poe ex19 -da_grid_x 56 -da_grid_y 56 -procs 2 & [2] 17500 > fg (dynaprof) poeattach ex19 17500
List command • list • List all modules in process • list <pattern> • List all matching modules • list <module> • List all functions in module • list <module> <pattern> • List all matching functions in module • list <module> <function> • List instrumentable points in function
Exploring FSPX • G77's Fortran Runtime support Code compiled with g77 without -g ends up in the DEFAULT_MODULE • Application Code • Shared libraries
Exploring FSPX 2 • G77's Fortran Runtime support Code compiled with g77 without -g ends up in the DEFAULT_MODULE
Exploring FSPX 3 Function Calls
Use command • Loads a probe shared library into address space (dynaprof) use [probe [args]] • Use by itself displays current probe. • To change options, respecify probe. • 4 probes in this release • Wallclock: Real time clock • PAPI: Hardware metrics • Perfometer: RT Visi of streaming hardware metrics
Instr command • instr • list all instrumented functions • instr module <pattern> [arg] • Instrument all functions in modules matching pattern • instr function <module> <pattern> [arg] • Instrument all functions matching pattern in module
Threads and Dynaprof Probes • For threaded code, use the same probe! • Dynaprof detects threads and loads a special version of the probe library. • Each probe specifies what to do when a new thread is discovered. • Each thread gets the same instrumentation.
Probe Warning • Instrumentation is not free. • Consider granularity of region being measured. • Overhead for PAPI 2.3 is O(100) cycles. • Between 500 and 2000 cycles for a 2 counter read. • Overhead for Wallclock is O(100) cycles.
Wallclock Probe • High resolution, low latency timer • Usage: use wallclockprobe • Reports time in microseconds, 1.0x10-6s.
PAPI Probe • Count PAPI Presets or Native Events • Usage: use papiprobe [event,event,...] • Default argument is either PAPI_FP_INS or PAPI_TOT_INS if the architecture doesn't support it. • Available events a can be obtained by using: papi_avail -a
PAPI Probe and Multiplexing • More than physical number of metrics automatically enables multiplexing. • Minimum runtime of instrumented regions must be observed, such that all virtual counters get a chance to run at least once. run-timemin = num_events * .01s • Automatic warning functionality is being rolled into PAPI.
PAPI Native Events • Look in the PAPI distribution • See the README file for your architecture in the src directory • See the example program tests/native.c in the src/tests directory
Intel Pentium IV Events (Arguments to perfex -e from PerfCtr distribution)
Perfometer Probe • Sends a stream of performance data every N seconds to the Perfometer GUI. • Functions can be colored at instrumentation time. • Default color is white, 0xFFFFFF • Usage: use perfometerprobe [0xRRGGBB] instr <args> <0xRRGGBB>
Perfometer Probe 2 • Perfometer GUI is NOT launched automatically. • showrgb in X11 lists colors and names. • Run the Java GUI • Java -jar Perfometer.jar • Connect up to the specified hostname and port.
Reporting Probe Data • The wallclock and PAPI probes produce very similar data. • Both use a parsing script written in Perl. • wallclockrpt <file> • papiproberpt <file> • Produce 3 profiles • Inclusive: Tfunction = Tself + Tchildren • Exclusive: Tfunction = Tself • 1-Level Call Tree: Tchild= Inclusive Tfunction
fspx IPC proflux 0.61 phase 0.63 flux 0.49 pde 0.46