400 likes | 600 Views
Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3. Philip Mucci m ucci@cs.utk.edu Shirley Moore shirley@cs.utk.edu Daniel Terpstra terpstra@cs.utk.edu. Hardware Counters.
E N D
Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3 Philip Mucci mucci@cs.utk.edu Shirley Moore shirley@cs.utk.edu Daniel Terpstra terpstra@cs.utk.edu
Hardware Counters • Small set of registers that count events, which are occurrences of specific signals related to the processor’s function • Monitoring these events facilitates correlation between the structure of the source/object code and the efficiency of the mapping of that code to the underlying architecture. ScicomP4 Knoxville, TN
Goals of PAPI • Solid foundation for cross platform performance analysis tools • Free tool developers from re-implementing counter access • Standardization between vendors, academics and users • Encourage vendors to provide hardware and OS support for counter access • Reference implementations for a number of HPC architectures • Well documented and easy to use ScicomP4 Knoxville, TN
Overviewof PAPI • Performance Application Programming Interface • The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors. • Parallel Tools Consortium project http://www.ptools.org/ ScicomP4 Knoxville, TN
PAPI Counter Interfaces • PAPI provides three interfaces to the underlying counter hardware: • The low level interface manages hardware events in user defined groups called EventSets. • The high level interface simply provides the ability to start, stop and read the counters for a specified list of events. • Graphical tools to visualize information. ScicomP4 Knoxville, TN
PAPI Implementation Tools!!! PAPI Low Level PAPI High Level Portable Layer PAPI Machine Dependent Substrate Machine Specific Layer Kernel Extension Operating System Hardware Performance Counter ScicomP4 Knoxville, TN
PAPI Preset Events • Proposed standard set of events deemed most relevant for application performance tuning • Defined in papiStdEventDefs.h • Mapped to native events on a given platform • Run tests/avail to see list of PAPI preset events available on a platform ScicomP4 Knoxville, TN
High-level Interface • Meant for application programmers wanting coarse-grained measurements • Not thread safe • Calls the lower level API • Allows only PAPI preset events • Easier to use and less setup (additional code) than low-level ScicomP4 Knoxville, TN
Fortran interfacePAPIF_start_countersPAPIF_read_countersPAPIF_stop_countersPAPIF_accum_countersPAPIF_num_countersPAPIF_flops High-level API • C interfacePAPI_start_countersPAPI_read_countersPAPI_stop_countersPAPI_accum_countersPAPI_num_countersPAPI_flops ScicomP4 Knoxville, TN
Using the High-level API • Int PAPI_num_counters(void) • Initializes PAPI (if needed) • Returns number of hardware counters • int PAPI_start_counters(int *events, int len) • Initializes PAPI (if needed) • Sets up an event set with the given counters • Starts counting in the event set • int PAPI_library_init(int version) • Low-level routine implicitly called by above ScicomP4 Knoxville, TN
Controlling the Counters • PAPI_stop_counters(long_long *vals, int alen) • Stop counters and put counter values in array • PAPI_accum_counters(long_long *vals, int alen) • Accumulate counters into array and reset • PAPI_read_counters(long_long *vals, int alen) • Copy counter values into array and reset counters • PAPI_flops(float *rtime, float *ptime, long_long *flpins, float *mflops) • Wallclock time, process time, FP ins since start, • Mflop/s since last call ScicomP4 Knoxville, TN
PAPI_flops • int PAPI_flops(float *real_time, float *proc_time, long_long *flpins, float *mflops) • Only two calls needed, PAPI_flops before and after the code you want to monitor • real_time is the wall-clocktime between the two calls • proc_time is the “virtual” time or time the process was actually executing between the two calls (not as fine grained as real_time but better for longer measurements) • flpins is the total floating point instructions executed between the two calls • mflops is the Mflop/s rating between the two calls ScicomP4 Knoxville, TN
PAPI High-level Example long long values[NUM_EVENTS]; unsigned int Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC}; /* Start the counters */ PAPI_start_counters((int*)Events,NUM_EVENTS); /* What we are monitoring? */ do_work(); /* Stop the counters and store the results in values */ retval = PAPI_stop_counters(values,NUM_EVENTS); ScicomP4 Knoxville, TN
Return Codes ScicomP4 Knoxville, TN
Low-level Interface • Increased efficiency and functionality over the high level PAPI interface • About 40 functions • Obtain information about the executable and the hardware • Thread-safe • Fully programmable • Callbacks on counter overflow ScicomP4 Knoxville, TN
Low-level Functionality • Library initialization PAPI_library_init, PAPI_thread_init, PAPI_shutdown • Timing functions PAPI_get_real_usec, PAPI_get_virt_usecPAPI_get_real_cyc, PAPI_get_virt_cyc • Inquiry functions • Management functions • Simple lock PAPI_lock/PAPI_unlock ScicomP4 Knoxville, TN
Event sets • The event set contains key information • What low-level hardware counters to use • Most recently read counter values • The state of the event set (running/not running) • Option settings (e.g., domain, granularity, overflow, profiling) • Event sets can overlap if they map to the same hardware counter set-up. • Allows inclusive/exclusive measurements ScicomP4 Knoxville, TN
Event set operations • Event set managementPAPI_create_eventset, PAPI_add_event[s], PAPI_rem_event[s], PAPI_destroy_eventset • Event set controlPAPI_start, PAPI_stop, PAPI_read, PAPI_accum • Event set inquiryPAPI_query_event, PAPI_list_events,... ScicomP4 Knoxville, TN
Simple example #include "papi.h“ #define NUM_EVENTS 2 int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC}, EventSet;long_long values[NUM_EVENTS]; /* Initialize the Library */ retval = PAPI_library_init(PAPI_VER_CURRENT); /* Allocate space for the new eventset and do setup */ retval = PAPI_create_eventset(&EventSet); /* Add Flops and total cycles to the eventset */ retval = PAPI_add_events(&EventSet,Events,NUM_EVENTS); /* Start the counters */ retval = PAPI_start(EventSet); do_work(); /* What we want to monitor*/ /*Stop counters and store results in values */ retval = PAPI_stop(EventSet,values); ScicomP4 Knoxville, TN
Overlapping counters retval = PAPI_start(InclEventSet); retval = PAPI_start(OthersEventSet); ... retval = PAPI_reset(OthersEventSet); do_flops(NUM_FLOPS); /* Function call */ retval = PAPI_accum(OthersEventSet,Othersvalues); ... retval = PAPI_stop(InclEventSet,Inclvalues); printf("Counts: %12lld %12lld\n", Inclvalues[0], Inclvalues[0]-Othersvalues[0]); ScicomP4 Knoxville, TN
Counter Domains • int PAPI_set_domain(int domain); • PAPI_DOM_USER User context counted • PAPI_DOM_KERNEL Kernel/OS context counted • PAPI_DOM_OTHER Exception/transient mode • PAPI_DOM_ALL All contexts counted • PAPI_DOM_MIN The smallest available context • PAPI_DOM_MAX The largest available context • Requires OS support, all domains not available on all platforms • Default is PAPI_DOM_USER ScicomP4 Knoxville, TN
Using PAPI with Threads • After PAPI_library_init need to register unique thread identifier function • For Pthreads retval=PAPI_thread_init(pthread_self, 0); • OpenMP retval=PAPI_thread_init(omp_get_thread_num, 0); • Each thread responsible for creation, start, stop and read of its own counters ScicomP4 Knoxville, TN
Using PAPI with Multiplexing • Multiplexing allows simultaneous use of more counters than are supported by the hardware. • Some platforms support multiplexing in the OS (e.g., SGI IRIX); on those that don’t PAPI implements multiplexing in software. • The more events you multiplex and the shorter the amount of time, the more likely the representation is not correct. • See John May’s excellent whitepaper ScicomP4 Knoxville, TN
Issues with Multiplexing • PAPI_multiplex_init() • should be called after PAPI_library_init() to initialize multiplexing • PAPI_set_multiplex( int *EventSet ); • Used after the eventset is created to turn on multiplexing for that eventset • Then use PAPI like normal ScicomP4 Knoxville, TN
Native Events • An event countable by the CPU can be counted even if there is no matching preset PAPI event • Same interface as when setting up a preset event, but a CPU-specific bit pattern is used instead of the PAPI event definition • /usr/lpp/pmtoolkit/lib/POWER3-II.evs ScicomP4 Knoxville, TN
Power3 Native Event Example See for list of native events. The encoding for native events is: Lower 8 bits indicate which counter number: 0 – 7 Bits 8-16 indicate which event number: 0-50 native = (35 << 8) | 1; /* FPU1CMPL */ PAPI_add_event(&EventSet,native) ScicomP4 Knoxville, TN
Power 3 General Events • INST_CMPL (1) • FPU1_CMPL (35) • LD_MISS_L1 (5) • LD_CMPL (5) • FPU0_CMPL (5) • CYC (12) • FMA (9) • TLB (0) ScicomP4 Knoxville, TN
Power 3 False Sharing Events • L1_M_TO_E_OR_S (25) • E_TO_S (23) • L2_E_OR_S_TO_I (21) • L2_M_TO_E_OR_S (27) • L2_M_TO_I (10) • PUSH_INT (8) • 0INST_DISP (6) • TLB (0) ScicomP4 Knoxville, TN
Callbacks on Counter Overflow • PAPI provides the ability to call user-defined handlers when a specified event exceeds a specified threshold. • For systems that do not support counter overflow at the OS level, PAPI sets up a high resolution interval timer and installs a timer interrupt handler. ScicomP4 Knoxville, TN
PAPI_overflow • int PAPI_overflow(int EventSet, int EventCode, int threshold, int flags, PAPI_overflow_handler_t handler) • Sets up an EventSet such that when it is PAPI_start()’d, it begins to register overflows • The EventSet may contain multiple events, but only one may be an overflow trigger. ScicomP4 Knoxville, TN
Statistical Profiling • PAPI provides support for execution profiling based on any counter event. • PAPI_profil() creates a histogram by text address of overflow counts for a specified region of the application code. • Used in vprof tool from Sandia Livermore ScicomP4 Knoxville, TN
Linux/x86, Windows 2000 Requires patch to Linux kernel, driver for Windows Linux/IA-64 Sun Solaris 2.8/Ultra I/II IBM AIX 4.3+/Power Contact IBM for pmtoolkit SGI IRIX/MIPS Compaq Tru64/Alpha Ev6 & Ev67 Requires OS device driver from Compaq Cray T3E/Unicos PAPI Release Platforms ScicomP4 Knoxville, TN
For More Information • http://icl.cs.utk.edu/projects/papi/ • Software and documentation • Reference materials • Papers and presentations • Third-party tools • Mailing lists ScicomP4 Knoxville, TN
PERC: Performance Evaluation Research Center • Developing a science for understanding performance of scientific applications on high-end computer systems. • Developing engineering strategies for improving performance on these systems. • DOE Labs: ANL, LBNL, LLNL, ORNL • Universities: UCSD, UI-UC, UMD, UTK • Funded by SciDAC: Scientific Discovery through Advanced Computing ScicomP4 Knoxville, TN
PERC: Real-World Applications • High Energy and Nuclear Physics • Shedding New Light on Exploding Stars: Terascale Simulations of Neutrino-Driven SuperNovae and Their NucleoSynthesis • Advanced Computing for 21st Century Accelerator Science and Technology • Biology and Environmental Research • Collaborative Design and Development of the Community Climate System Model for Terascale Computers • Fusion Energy Sciences • Numerical Computation of Wave-Plasma Interactions in Multi-dimensional Systems • Advanced Scientific Computing • Terascale Optimal PDE Solvers (TOPS) • Applied Partial Differential Equations Center (APDEC) • Scientific Data Management (SDM) • Chemical Sciences • Accurate Properties for Open-Shell States of Large Molecules • …and more… ScicomP4 Knoxville, TN
Parallel Climate Transition Model • Components for Ocean, Atmosphere, Sea Ice, Land Surface and River Transport • Developed by Warren Washington’s group at NCAR • POP: Parallel Ocean Program from LANL • CCM3: Community Climate Model 3.2 from NCAR including LSM: Land Surface Model • ICE: CICE from LANL and CCSM from NCAR • RTM: River Transport Module from UT Austin ScicomP4 Knoxville, TN
PCTM: The Code • 132K lines of code • Mostly Fortran 90 with some Fortran 77 and C • Each component runs sequentially • Each component is parallel • The overall model must run as pure MPI • Code is already instrumented with timing library in flux coupler • Add additional PAPI code to measure 8 hardware metrics for each timer ScicomP4 Knoxville, TN
PCTM: Parallel Climate Transition Model RiverModel OceanModel AtmosphereModel Flux Coupler LandSurfaceModel Sea Ice Model Sequential Executionof Parallelized Modules ScicomP4 Knoxville, TN
PCTM: Performance Questions • Are we cache, functional unit or bound? • How does shared memory MPI affect performance? • How many tasks on a 4-way node? ScicomP4 Knoxville, TN
PCTM: Answers • No answers yet… Wait for SC2001 • Measure coarse regions of code for • Instructions per Cycle • Stall Cycles • 0 Instruction Dispatch Cycles • Measure those regions with multiplexing and inspect code • Statistically profile candidates for improvement, line by line ScicomP4 Knoxville, TN