1 / 55

PAPI and Dynaprof

Explore how Dynaprof helps optimize applications, identify bottlenecks, and generate performance signatures using hardware counters and resource usage traces.

barnesj
Download Presentation

PAPI and Dynaprof

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application Signatures and Performance Analysis of Scientific Applications Philip J. Mucci Innovative Computing Laboratory, UTK Performance Evaluation Research Center, LBL mucci@cs.utk.edu http://icl.cs.utk.edu/~mucci/dynaprof/snapshots/sc2002.ppt PAPI and Dynaprof

  2. Goals • Understanding the behavior of the application • Identification of bottlenecks. • Usage of the hardware resources. • Effects of that usage on performance. • Using Dynaprof to achieve that goal • Command line usage • 3 Dynaprof probes • Wallclock Time • Hardware performance counters • Resource usage traces

  3. Motivation • Optimize the application's performance. • Evaluate the algorithms efficiency. • Generate an application signature. • A collection of data that represent the major terms in the performance model. • Develop a performance model.

  4. Overview of Hardware Counters • Data is NOT PORTABLE, but PAPI is... • Small number of registers dedicated for performance monitoring functions. • AMD Athlon, 4 counters • Pentium <= III, 2 counters • Pentium IV, 18 counters • IA64, 4 counters • Alpha 21x64, 2 counters • Power 3, 8 counters • Power 4, 8 counters to a group • UltraSparc II, 2 counters • MIPS R14K, 2 counters

  5. Applications used in this Tutorial • Serial: • FSPX: A binary alloy solidification benchmark. • SWIM: The SPEC shallow water benchmark. • Parallel (MPI): • Ex19 from PetSC distribution. • Solves nonlinear driven cavity with multigrid. A 2D driven cavity problem solved in a velocity-vorticity formulation.

  6. FPSX Execution Environment • Intel PIII, 1.2 Ghz • FP Results/Clock: 1 1.2 Gflips • 4 SP/clk with SSE, 2DP/clk with SSE2 • Caches: 16K/16K, 256K • G77 version 2.96 -g -O -malign-double -mpentiumpro -funroll-loops -fexpensive-optimizations • Execution time: > /bin/time fspx 115.370u 0.030s 1:58.17 97.6% 0+0k 0+0io 162pf+0w

  7. swim Execution Environment • IBM Nighthawk, 16-way Power 3, 375MHz • FP Results/Clock: 4 (1.5 Gflips) • Caches: 32K/64K, 8MB • MPI over TCP/IP via switch • Xlc 5.0.2.1 built with -g -O3 -qstrict -qarch=pwr3 -qtune=pwr3 • Execution time: > /bin/time poe swim -procs 2 0.4u 0.0s 0:15 3% 217+3933k 0+0io 1pf+0w

  8. ex19 Execution Environment • IBM Nighthawk, 16-way Power 3, 375MHz • FP Results/Clock: 4 (1.5 Gflips) • Caches: 32K/64K, 8MB • Xlc 5.0.2.1 built with -g • Execution time: > /bin/time poe ex19 -procs 2 -da_grid_x 56 -da_grid_y 56 0.520u 0.200s 0:44.18 1.6% 297+3580k 0+0io 0pf+0w

  9. Gprof • Gathers timer interrupts vs. text address. • Recompile with -p option. • Gprof profile is useful for a high level overview • Does it tell us why?

  10. Gprof Profile of FSPX

  11. FPSX: Top 4 functions • Top 4 functions make up 50% of execution time • In module update.F • flux • proflux • pde • In module phase.F • phase • Use the list command to explore modules and functions

  12. Gprof Profile of SWIM

  13. Gprof Profile of ex19

  14. Dynaprof Environment Variables • LD_LIBRARY_PATH: Colon seperated list where to look for shared libraries. We need to find: • DynInst library • PAPI library • Any dependancies on the above. (libperfctr.so, libcpc.so) • DYNINSTAPI_RT_LIB: Full pathname of DynInst runtime library. • No settings necessary for AIX/DPCL port

  15. Running Dynaprof • Usage: dynaprof [-d] [serial_application] • -d enables debugging output • Specifying an application automatically loads it into the tool immediately after initialization.

  16. Command Line Interface • Uses GNU Readline library for input • Full featured Command Line Editing • File and command completion: <Tab> • History: <Up>/<Down> • Settings, macros and aliases in ~/.inputrc • Allows Emacs or VI style bindings • set editing-mode emacs • set editing-mode vi • See man page, TexInfo file or home page.

  17. Load command • Starts the application and stops it at the first instruction. • Usage: load <application> [args] > dynaprof (dynaprof) load tests/fpsx

  18. Poeload command • For use with MPI applications on AIX and DPCL. • DPCL < 3.2.5 requires full path • Usage: poeload <application> [args] (dynaprof) poeload tests/swim -procs 2

  19. Mpiload command • For use with MPI applications. • Stops the application after it calls PMPI_Init(). • Mostly useful for script driven execution of MPI jobs • Usage: mpiload <application> [args] (dynaprof) mpiload tests/mpicount

  20. Attach command • Attaches to a running application (or poe process) and stops it. • Usage: attach <application> <pid> (dynaprof) ^Z > tests/fspx & [2] 17500 > fg (dynaprof) attach tests/fspx 17500

  21. Poeattach Command • For use with MPI applications on AIX and DPCL. • DPCL < 3.2.5 requires full path • Usage: poeattach <application> <pid_of_poe> (dynaprof) ^Z poe ex19 -da_grid_x 56 -da_grid_y 56 -procs 2 & [2] 17500 > fg (dynaprof) poeattach ex19 17500

  22. List command • list • List all modules in process • list <pattern> • List all matching modules • list <module> • List all functions in module • list <module> <pattern> • List all matching functions in module • list <module> <function> • List instrumentable points in function

  23. Exploring FSPX • G77's Fortran Runtime support Code compiled with g77 without -g ends up in the DEFAULT_MODULE • Application Code • Shared libraries

  24. Exploring FSPX 2 • G77's Fortran Runtime support Code compiled with g77 without -g ends up in the DEFAULT_MODULE

  25. Exploring FSPX 3 Function Calls

  26. Use command • Loads a probe shared library into address space (dynaprof) use [probe [args]] • Use by itself displays current probe. • To change options, respecify probe. • 4 probes in this release • Wallclock: Real time clock • PAPI: Hardware metrics • Perfometer: RT Visi of streaming hardware metrics

  27. Instr command • instr • list all instrumented functions • instr module <pattern> [arg] • Instrument all functions in modules matching pattern • instr function <module> <pattern> [arg] • Instrument all functions matching pattern in module

  28. Threads and Dynaprof Probes • For threaded code, use the same probe! • Dynaprof detects threads and loads a special version of the probe library. • Each probe specifies what to do when a new thread is discovered. • Each thread gets the same instrumentation.

  29. Probe Warning • Instrumentation is not free. • Consider granularity of region being measured. • Overhead for PAPI 2.3 is O(100) cycles. • Between 500 and 2000 cycles for a 2 counter read. • Overhead for Wallclock is O(100) cycles.

  30. Wallclock Probe • High resolution, low latency timer • Usage: use wallclockprobe • Reports time in microseconds, 1.0x10-6s.

  31. PAPI Probe • Count PAPI Presets or Native Events • Usage: use papiprobe [event,event,...] • Default argument is either PAPI_FP_INS or PAPI_TOT_INS if the architecture doesn't support it. • Available events a can be obtained by using: papi_avail -a

  32. PAPI Probe and Multiplexing • More than physical number of metrics automatically enables multiplexing. • Minimum runtime of instrumented regions must be observed, such that all virtual counters get a chance to run at least once. run-timemin = num_events * .01s • Automatic warning functionality is being rolled into PAPI.

  33. PAPI Native Events • Look in the PAPI distribution • See the README file for your architecture in the src directory • See the example program tests/native.c in the src/tests directory

  34. Power 3 Events

  35. Power 3 Events 2

  36. Power 4 Events

  37. Pentium III Events

  38. Intel Pentium IV Events (Arguments to perfex -e from PerfCtr distribution)

  39. Sun UltraSparc II Events

  40. Sun UltraSparc III Events

  41. MIPS R12K Events

  42. Alpha/DADD 21264 Events

  43. Perfometer Probe • Sends a stream of performance data every N seconds to the Perfometer GUI. • Functions can be colored at instrumentation time. • Default color is white, 0xFFFFFF • Usage: use perfometerprobe [0xRRGGBB] instr <args> <0xRRGGBB>

  44. Perfometer Probe 2 • Perfometer GUI is NOT launched automatically. • showrgb in X11 lists colors and names. • Run the Java GUI • Java -jar Perfometer.jar • Connect up to the specified hostname and port.

  45. Instrumenting SWIM withperfometerprobe

  46. Instrumenting FSPX forInstructions Per Cycle

  47. Instrumenting SWIM forInstructions Per Cycle

  48. Reporting Probe Data • The wallclock and PAPI probes produce very similar data. • Both use a parsing script written in Perl. • wallclockrpt <file> • papiproberpt <file> • Produce 3 profiles • Inclusive: Tfunction = Tself + Tchildren • Exclusive: Tfunction = Tself • 1-Level Call Tree: Tchild= Inclusive Tfunction

  49. Fspx Cycles& Instrs.

  50. fspx IPC proflux 0.61 phase 0.63 flux 0.49 pde 0.46

More Related