1 / 41

PAPI The Performance Application Programming Interface

PAPI The Performance Application Programming Interface. Kevin London london@cs.utk.edu Nathan Garner garner@cs.utk.edu. Purpose.

Download Presentation

PAPI The Performance Application Programming Interface

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PAPIThe Performance Application Programming Interface Kevin London london@cs.utk.edu Nathan Garner garner@cs.utk.edu

  2. Purpose The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.

  3. Motivation • To leverage existing and future performance tool development • To increase application and system performance • To characterize application and system workload • To stimulate run-time optimization research

  4. Goals • Provide a solid foundation for cross platform performance analysis tools. • Loose standardization between vendors, academics and users. • Provide a number of implementations for HPC architectures. • Well documented, easy to use.

  5. Why PAPI is needed • No common performance tools except prof and gprof. • Most commercial tools are based on time. • HPC has memory and floating point intensive workloads which require good scheduling. (pipelining)

  6. Implementation • Support native events and 103 “preset” events, which are commonly available metrics, some are derived. • Query to see if a preset exists • Fully programmable, thread safe, low level interface directed towards the tool developer and the sophisticated user • The EventSet is the underlying abstraction • Hardware events are used in conjunction with one another to provide meaningful information.

  7. PAPI Presets Test case 8: Available events and hardware information. ------------------------------------------------------------------------- Vendor string and code : GenuineIntel (-1) Model string and code : Celeron (Mendocino) (6) CPU revision : 10.000000 CPU Megahertz : 366.504944 ------------------------------------------------------------------------- Name Code Avail Deriv Description (Note) PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses PAPI_L2_DCM 0x80000002 No No Level 2 data cache misses PAPI_L2_ICM 0x80000003 No No Level 2 instruction cache misses PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache misses PAPI_L2_TCM 0x80000007 Yes No Level 2 cache misses PAPI_L3_TCM 0x80000008 No No Level 3 cache misses PAPI_CA_SNP 0x80000009 No No Requests for a snoop PAPI_CA_SHR 0x8000000a No No Requests for shared cache line PAPI_CA_CLN 0x8000000b No No Requests for clean cache line PAPI_CA_INV 0x8000000c No No Requests for cache line inv. . . .

  8. PAPI High Level API • PAPI high level is meant for application programmers wanting coarse-grained measurements. • Not tuned for efficiency • Calls the lower level API. • Not thread safe. (may change) • Only allows PAPI Presets. (may change)

  9. PAPI High Level Functions PAPI_num_counters() PAPI_start_counters() PAPI_stop_counters() PAPI_read_counters()

  10. Implementation PAPI contains functions to: • Obtain accurate time. • Obtain information about the executable and the hardware. • Register callbacks on counter overflow of a user threshold. • SRV4 compatible profil() call that uses hardware counters,

  11. Implementation

  12. 49 PAPI Functions PAPI_accum PAPI_add_event PAPI_add_events PAPI_add_pevent PAPI_cleanup_eventset PAPI_create_eventset PAPI_create_eventset_r PAPI_destroy_eventset PAPI_get_executable_info PAPI_get_hardware_info PAPI_get_opt PAPI_get_overflow_address PAPI_get_real_cyc PAPI_get_real_usec PAPI_get_virt_cyc PAPI_get_virt_usec PAPI_library_init PAPI_thread_init PAPI_list_events PAPI_lock PAPI_overflow PAPI_perror PAPI_profil PAPI_query_all_events_verbose PAPI_query_event PAPI_query_event_verbose PAPI_get_opt PAPI_get_overflow_address PAPI_get_real_cyc PAPI_get_real_usec PAPI_get_virt_cyc PAPI_get_virt_usec PAPI_library_init PAPI_thread_init PAPI_list_events PAPI_lock PAPI_num_counters PAPI_overflow PAPI_perror PAPI_profil PAPI_query_all_events_verbose PAPI_query_event PAPI_query_event_verbose PAPI_read PAPI_read_counters PAPI_rem_event PAPI_rem_events PAPI_reset PAPI_restore PAPI_save PAPI_set_debug PAPI_set_domain PAPI_set_granularity PAPI_set_opt PAPI_shutdown PAPI_start PAPI_start_counters PAPI_state PAPI_state PAPI_stop PAPI_stop_counters PAPI_unlock PAPI_write

  13. #include "fpapi.h" program fmatrixlowpapi ** USER DECLERATIONS ** call PAPIf_library_init( check ) call PAPIf_thread_init( handle, handle, check ) call PAPIf_num_counters( numevents ) print *, 'number of hardware counters supported: ', numevents call PAPIf_add_event(EventSet,PAPI_FLOPS,check) call PAPIf_add_event(EventSet,PAPI_L1_TCM,check) call PAPIf_add_event(EventSet,PAPI_L2_TCM,check) call PAPIf_get_hardware_info( ncpu, nnodes, totalcpus, vendor, . vstring, model, mstring, revision, mhz ) print *, 'A', totalcpus, ' CPU ', mstring, ' at', mhz, 'Mhz.' print *, ncpu, nnodes, totalcpus, vendor, vstring, model, . mstring, revision, mhz call PAPIf_get_real_usec( starttime ) call PAPIf_start( EventSet, check ) ** USER CODE **

  14. call PAPIf_stop(EventSet,values,check) call PAPIf_get_real_usec( stoptime ) finaltime = (stoptime/1000000.0) - (starttime/1000000.0) print *, 'Time: ', finaltime print *, 'FLOPS: ', values(1) print *, 'Total Level 1 Data cache misses: ', values(2) print *, 'Total Level 2 Data cache misses: ', values(3) return end

  15. number of hardware counters supported: 32 A 2 CPU R12000 at 270.0000 Mhz. MIPS 30 R12000 2.300000 270.0000 Time: 1.547424316406250 FLOPS: 4258753 Total Level 1 Data cache misses: 1539918 Total Level 2 Data cache misses: 6936

  16. Threads and PAPI • PAPI must be able to support both explicit (library) and implicit (compiler) threading models. • However, this can only happen if the threads are ‘bound’. • A ‘bound’ thread is one that has a scheduling entity known and handled by the OS kernel.

  17. Platforms Linux/x86 Solaris/Ultra AIX/Power Tru64/Alpha IRIX/MIPS Fortran wrappers Thread support Remote CVS access Updated Web Site Documentation Tool integration The 1.0 Release

  18. UTK Tools • Perfometer • Real time trace based visualization of metrics at the subroutine level. (Java/Swing) • Profometer (planned) • Real time sample based visualization at the line level. (Java/Swing) • Hwprof (planned) • Back end to generate performance data to be fed into the above tools. Possible integration with DynInst.

  19. Perfometer Features • Platform independent visualization of PAPI metrics • Graphical display may run remotely, freeing the compute node of the drawing overhead • Flexible interface (internal drawing classes are reused for other tools) • Quick interpretation of complex results • Color coding to highlight selected procedures

  20. Perfometer Screenshot

  21. Perfometer Usage • Application is instrumented with a single call to perfometer() • Sections of code that are of interest can be distinguished in the graph with specific colors using a call to mark_perfometer(COLOR) • #include "papicolorcodes.h" • call perfometer • call mark_perfometer(RED)

  22. Perfometer Future Development • Allow runtime selection of multiple PAPI metrics for simultaneous display • Integration with Dyninst to eliminate need for recompiling user codes • Dump trace data to file for post-mortem study • Additional graph display types

  23. Profometer Features • Visual representation of the quantity of a given metric spent in a particular code segment • Color coding of user selected code segments • Zoom in and out to emphasize sections of interest • Reuse of the Perfometer engine

  24. Profometer Screenshot • Profometer – Histogram of a given metric per code segment

  25. Profometer Future Development • Run time modification of metric being monitored • Hooks into debugging interface to allow GDB style interaction with source code

  26. UTKhwprof Screenshot rusage child rusage child ============= ===== ============= ===== user time sec 1.000 num of swap operations 0 sys time sec 0.010 block input operations 0 real time sec 1.010 block output operations 0 maximum resident set size 0 messages sent 0 (ru_ixrss) currently null 0 messages received 0 integral resident set size 0 signals received 0 (ru_ixrss) currently null 0 voluntary context switches 0 page faults without I/O 29 involuntary context switches 0 page faults with I/O 78 local platform ============== num hw counters: 3 clock tick: 100 Hz PAPI clock rate: 199.00 MHz PAPI cycle time: 0.00502513 usec/cycle CPU name for this node: redwood.cs.utk.edu PAPI counts =========== PAPI_TOT_CYC: 4419 PAPI_INT_INS: 4451 PAPI_TOT_INS: 102034

  27. Other Tools using PAPI

  28. U. Illinois: SvPablo • Source code instrumentation based profiling of F77, F90, C and C++. • Color coded key next to source code indicating severity of metric. • MPI aware. • Statistics at the function, loop and line level.

  29. U. Illinois: SvPablo

  30. U. Oregon: TAU • Source code based instrumentation of C, C++, F77, F90, HPF and pC++. • Maintains a program database in which to store and localize performance data. • Multiple lightweight tools and a launcher • Including call graph/control flow browser, a class browser, a remote debugger, MPI trace analysis and a profiler. • Integrated with PAPI.

  31. TAU: Racy/PAPI

  32. TAU: Racy

  33. Visual Profiler: vprof • Developed by Curtis Janssen at Sandia Livermore • Creates and visualizes line level execution profiles obtained with PC-sampling. • Data usually generated with the profil()/monitor() library/system call or done by hand with interval timers and signal information. • Ported to use PAPI_profil() in a day.

  34. Sandia Livermore: vprof

  35. Pacific Sierra Research DEEP/MPI • Source code instrumentation based profiling at the basic block level. (regions of code with 1 entry and 1 exit, order 10’s of instructions) • Comprehensive visualization and analysis. • Integrated source code browser with highlighting. • Works now with MPI, soon with OpenMP. • Integrated with PAPI.

  36. Pacific Sierra Research DEEP/MPI

  37. Web Resources • Mailing list • send “subscribe ptools-perfapi” to majordomo@ptools.org • ptools-perfapi@ptools.org is the reflector • Web page • http://icl.cs.utk.edu/projects/papi • Post RISC paper by Richard Enbody et. al. • http://www.cps.msu.edu/~crs/cps920/

  38. Web Resources 2 • PCL http://www.fz-juelich.de/zam/PCL/ • Vprof http://aros.ca.sandia.gov/~cljanss/perf/vprof/ • Paradyn http://www.cs.wisc.edu/paradyn/libhrtime/ • DynInst http://www.cs.umd.edu/projects/dyninstAPI/ • Libhrtime http://www.cs.wisc.edu/paradyn/libhrtime/ • TAU http://www.cs.uoregon.edu/research/paracomp/tau/ • SvPablo http://www-pablo.cs.uiuc.edu/Project/SVPablo/SvPabloOverview.htm

  39. The Future • x86/Alpha Linux kernel • Implementation under /proc • merge with libhrtime patch from U. Wisc • Support for signal dispatch on hardware counter overflow • Support for 21064, HP PA 8000, Cray Inc. SV, IBM P2SC

  40. Source Code Access • Every 24 hours, snapshot of source tree at: http://icl.cs.utk.edu/projects/papi/snapshot.cgi • Remote read-only access to the CVS source tree: > (csh) setenv CVSROOT or % (sh) export CVSROOT= anonymous@hera.cs.utk.edu:/cvs/homes/papi cvs login password: <cr> cvs checkout papi or cvs update cd papi/src make –f Makefile.<arch> cvs logout

  41. The Future • Dynamic Instrumentation of Running Applications via Dyninst • Support of gathering performance data of Applications using MPI • Support for 21064, HP PA 8000, Cray Inc. SV, IBM P2SC

More Related