410 likes | 598 Views
PAPI The Performance Application Programming Interface. Kevin London london@cs.utk.edu Nathan Garner garner@cs.utk.edu. Purpose.
E N D
PAPIThe Performance Application Programming Interface Kevin London london@cs.utk.edu Nathan Garner garner@cs.utk.edu
Purpose The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.
Motivation • To leverage existing and future performance tool development • To increase application and system performance • To characterize application and system workload • To stimulate run-time optimization research
Goals • Provide a solid foundation for cross platform performance analysis tools. • Loose standardization between vendors, academics and users. • Provide a number of implementations for HPC architectures. • Well documented, easy to use.
Why PAPI is needed • No common performance tools except prof and gprof. • Most commercial tools are based on time. • HPC has memory and floating point intensive workloads which require good scheduling. (pipelining)
Implementation • Support native events and 103 “preset” events, which are commonly available metrics, some are derived. • Query to see if a preset exists • Fully programmable, thread safe, low level interface directed towards the tool developer and the sophisticated user • The EventSet is the underlying abstraction • Hardware events are used in conjunction with one another to provide meaningful information.
PAPI Presets Test case 8: Available events and hardware information. ------------------------------------------------------------------------- Vendor string and code : GenuineIntel (-1) Model string and code : Celeron (Mendocino) (6) CPU revision : 10.000000 CPU Megahertz : 366.504944 ------------------------------------------------------------------------- Name Code Avail Deriv Description (Note) PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses PAPI_L2_DCM 0x80000002 No No Level 2 data cache misses PAPI_L2_ICM 0x80000003 No No Level 2 instruction cache misses PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache misses PAPI_L2_TCM 0x80000007 Yes No Level 2 cache misses PAPI_L3_TCM 0x80000008 No No Level 3 cache misses PAPI_CA_SNP 0x80000009 No No Requests for a snoop PAPI_CA_SHR 0x8000000a No No Requests for shared cache line PAPI_CA_CLN 0x8000000b No No Requests for clean cache line PAPI_CA_INV 0x8000000c No No Requests for cache line inv. . . .
PAPI High Level API • PAPI high level is meant for application programmers wanting coarse-grained measurements. • Not tuned for efficiency • Calls the lower level API. • Not thread safe. (may change) • Only allows PAPI Presets. (may change)
PAPI High Level Functions PAPI_num_counters() PAPI_start_counters() PAPI_stop_counters() PAPI_read_counters()
Implementation PAPI contains functions to: • Obtain accurate time. • Obtain information about the executable and the hardware. • Register callbacks on counter overflow of a user threshold. • SRV4 compatible profil() call that uses hardware counters,
49 PAPI Functions PAPI_accum PAPI_add_event PAPI_add_events PAPI_add_pevent PAPI_cleanup_eventset PAPI_create_eventset PAPI_create_eventset_r PAPI_destroy_eventset PAPI_get_executable_info PAPI_get_hardware_info PAPI_get_opt PAPI_get_overflow_address PAPI_get_real_cyc PAPI_get_real_usec PAPI_get_virt_cyc PAPI_get_virt_usec PAPI_library_init PAPI_thread_init PAPI_list_events PAPI_lock PAPI_overflow PAPI_perror PAPI_profil PAPI_query_all_events_verbose PAPI_query_event PAPI_query_event_verbose PAPI_get_opt PAPI_get_overflow_address PAPI_get_real_cyc PAPI_get_real_usec PAPI_get_virt_cyc PAPI_get_virt_usec PAPI_library_init PAPI_thread_init PAPI_list_events PAPI_lock PAPI_num_counters PAPI_overflow PAPI_perror PAPI_profil PAPI_query_all_events_verbose PAPI_query_event PAPI_query_event_verbose PAPI_read PAPI_read_counters PAPI_rem_event PAPI_rem_events PAPI_reset PAPI_restore PAPI_save PAPI_set_debug PAPI_set_domain PAPI_set_granularity PAPI_set_opt PAPI_shutdown PAPI_start PAPI_start_counters PAPI_state PAPI_state PAPI_stop PAPI_stop_counters PAPI_unlock PAPI_write
#include "fpapi.h" program fmatrixlowpapi ** USER DECLERATIONS ** call PAPIf_library_init( check ) call PAPIf_thread_init( handle, handle, check ) call PAPIf_num_counters( numevents ) print *, 'number of hardware counters supported: ', numevents call PAPIf_add_event(EventSet,PAPI_FLOPS,check) call PAPIf_add_event(EventSet,PAPI_L1_TCM,check) call PAPIf_add_event(EventSet,PAPI_L2_TCM,check) call PAPIf_get_hardware_info( ncpu, nnodes, totalcpus, vendor, . vstring, model, mstring, revision, mhz ) print *, 'A', totalcpus, ' CPU ', mstring, ' at', mhz, 'Mhz.' print *, ncpu, nnodes, totalcpus, vendor, vstring, model, . mstring, revision, mhz call PAPIf_get_real_usec( starttime ) call PAPIf_start( EventSet, check ) ** USER CODE **
call PAPIf_stop(EventSet,values,check) call PAPIf_get_real_usec( stoptime ) finaltime = (stoptime/1000000.0) - (starttime/1000000.0) print *, 'Time: ', finaltime print *, 'FLOPS: ', values(1) print *, 'Total Level 1 Data cache misses: ', values(2) print *, 'Total Level 2 Data cache misses: ', values(3) return end
number of hardware counters supported: 32 A 2 CPU R12000 at 270.0000 Mhz. MIPS 30 R12000 2.300000 270.0000 Time: 1.547424316406250 FLOPS: 4258753 Total Level 1 Data cache misses: 1539918 Total Level 2 Data cache misses: 6936
Threads and PAPI • PAPI must be able to support both explicit (library) and implicit (compiler) threading models. • However, this can only happen if the threads are ‘bound’. • A ‘bound’ thread is one that has a scheduling entity known and handled by the OS kernel.
Platforms Linux/x86 Solaris/Ultra AIX/Power Tru64/Alpha IRIX/MIPS Fortran wrappers Thread support Remote CVS access Updated Web Site Documentation Tool integration The 1.0 Release
UTK Tools • Perfometer • Real time trace based visualization of metrics at the subroutine level. (Java/Swing) • Profometer (planned) • Real time sample based visualization at the line level. (Java/Swing) • Hwprof (planned) • Back end to generate performance data to be fed into the above tools. Possible integration with DynInst.
Perfometer Features • Platform independent visualization of PAPI metrics • Graphical display may run remotely, freeing the compute node of the drawing overhead • Flexible interface (internal drawing classes are reused for other tools) • Quick interpretation of complex results • Color coding to highlight selected procedures
Perfometer Usage • Application is instrumented with a single call to perfometer() • Sections of code that are of interest can be distinguished in the graph with specific colors using a call to mark_perfometer(COLOR) • #include "papicolorcodes.h" • call perfometer • call mark_perfometer(RED)
Perfometer Future Development • Allow runtime selection of multiple PAPI metrics for simultaneous display • Integration with Dyninst to eliminate need for recompiling user codes • Dump trace data to file for post-mortem study • Additional graph display types
Profometer Features • Visual representation of the quantity of a given metric spent in a particular code segment • Color coding of user selected code segments • Zoom in and out to emphasize sections of interest • Reuse of the Perfometer engine
Profometer Screenshot • Profometer – Histogram of a given metric per code segment
Profometer Future Development • Run time modification of metric being monitored • Hooks into debugging interface to allow GDB style interaction with source code
UTKhwprof Screenshot rusage child rusage child ============= ===== ============= ===== user time sec 1.000 num of swap operations 0 sys time sec 0.010 block input operations 0 real time sec 1.010 block output operations 0 maximum resident set size 0 messages sent 0 (ru_ixrss) currently null 0 messages received 0 integral resident set size 0 signals received 0 (ru_ixrss) currently null 0 voluntary context switches 0 page faults without I/O 29 involuntary context switches 0 page faults with I/O 78 local platform ============== num hw counters: 3 clock tick: 100 Hz PAPI clock rate: 199.00 MHz PAPI cycle time: 0.00502513 usec/cycle CPU name for this node: redwood.cs.utk.edu PAPI counts =========== PAPI_TOT_CYC: 4419 PAPI_INT_INS: 4451 PAPI_TOT_INS: 102034
U. Illinois: SvPablo • Source code instrumentation based profiling of F77, F90, C and C++. • Color coded key next to source code indicating severity of metric. • MPI aware. • Statistics at the function, loop and line level.
U. Oregon: TAU • Source code based instrumentation of C, C++, F77, F90, HPF and pC++. • Maintains a program database in which to store and localize performance data. • Multiple lightweight tools and a launcher • Including call graph/control flow browser, a class browser, a remote debugger, MPI trace analysis and a profiler. • Integrated with PAPI.
Visual Profiler: vprof • Developed by Curtis Janssen at Sandia Livermore • Creates and visualizes line level execution profiles obtained with PC-sampling. • Data usually generated with the profil()/monitor() library/system call or done by hand with interval timers and signal information. • Ported to use PAPI_profil() in a day.
Pacific Sierra Research DEEP/MPI • Source code instrumentation based profiling at the basic block level. (regions of code with 1 entry and 1 exit, order 10’s of instructions) • Comprehensive visualization and analysis. • Integrated source code browser with highlighting. • Works now with MPI, soon with OpenMP. • Integrated with PAPI.
Web Resources • Mailing list • send “subscribe ptools-perfapi” to majordomo@ptools.org • ptools-perfapi@ptools.org is the reflector • Web page • http://icl.cs.utk.edu/projects/papi • Post RISC paper by Richard Enbody et. al. • http://www.cps.msu.edu/~crs/cps920/
Web Resources 2 • PCL http://www.fz-juelich.de/zam/PCL/ • Vprof http://aros.ca.sandia.gov/~cljanss/perf/vprof/ • Paradyn http://www.cs.wisc.edu/paradyn/libhrtime/ • DynInst http://www.cs.umd.edu/projects/dyninstAPI/ • Libhrtime http://www.cs.wisc.edu/paradyn/libhrtime/ • TAU http://www.cs.uoregon.edu/research/paracomp/tau/ • SvPablo http://www-pablo.cs.uiuc.edu/Project/SVPablo/SvPabloOverview.htm
The Future • x86/Alpha Linux kernel • Implementation under /proc • merge with libhrtime patch from U. Wisc • Support for signal dispatch on hardware counter overflow • Support for 21064, HP PA 8000, Cray Inc. SV, IBM P2SC
Source Code Access • Every 24 hours, snapshot of source tree at: http://icl.cs.utk.edu/projects/papi/snapshot.cgi • Remote read-only access to the CVS source tree: > (csh) setenv CVSROOT or % (sh) export CVSROOT= anonymous@hera.cs.utk.edu:/cvs/homes/papi cvs login password: <cr> cvs checkout papi or cvs update cd papi/src make –f Makefile.<arch> cvs logout
The Future • Dynamic Instrumentation of Running Applications via Dyninst • Support of gathering performance data of Applications using MPI • Support for 21064, HP PA 8000, Cray Inc. SV, IBM P2SC