IBM Hardware Performance Monitor (hpm)

IBM Hardware Performance Monitor (hpm) NPACI Parallel Computing Institute August, 2002

What is Performance? - Where is time spent and how is time spent? • MIPS – Millions of Instructions Per Second • MFLOPS – Millions of Floating-Point Operations Per Second • Run time/CPU time

What is a Performance Monitor? • Provides detailed processor/system data • Processor Monitors • Typically a group of registers • Special purpose registers keep track of programmable events • Non-intrusive counts result in “accurate” measurement of processor events • Typical Events counted are Instruction, floating point instruction, cache misses, etc. • System Level Monitors • Can be hardware or software • Intended to measure system activity • Examples: • bus monitor: measures memory traffic, can analyze cache coherency issues in multiprocessor system • Network monitor: measures network traffic, can analyze web traffic internally and externally

Hardware Counter Motivations • To understand execution behavior of application code • Why not use software? • Strength: simple, GUI interface • Weakness: large overhead, intrusive, higher level, abstraction and simplicity • How about using a simulator? • Strength: control, low-level, accurate • Weakness: limit on size of code, difficult to implement, time-consuming to run • When should we directly use hardware counters? • Software and simulators not available or not enough • Strength: non-intrusive, instruction level analysis, moderate control, very accurate, low overhead • Weakness: not typically reusable, OS kernel support

Ptools Project • PMAPI Project • Common standard API for industry • Supported by IBM, SUN, SGI, COMPAQ etc • PAPI Project • Standard application programming interface • Portable, available through a module • Can access hardware counter info • HPM Toolkit • Easy to use • Doesn’t effect code performance • Use hardware counters • Designed specifically for IBM SPs and Power

Problem Set • Should we collect all events all the time? • Not necessary and wasteful • What counts should be used? • Gather only what you need • Cycles • Committed Instructions • Loads • Stores • L1/L2 misses • L1/L2 stores • Committed fl pt instr • Branches • Branch misses • TLB misses • Cache misses

POWER3 Architecture

IBM HPM Toolkit • High Performance Monitor • Developed for performance measurement of applications running on IBM Power3 systems. It consists of: • An utility (hpmcount) • An instrumentation library (libhpm) • A graphical user interface (hpmviz). • Requires PMAPI kernel extensions to be loaded • Works on IBM 630 and 604e processors • Based on IBM’s PMAPI – low level interface

HPM Count • Utilities for performance measurement of application • Extra logic inserted to the processor to count specific events • Updated at every cycle • Provide a summary output at the end of the execution: • Wall clock time • Resource usage statistics • Hardware performance counters information • Derived hardware metrics • Serial/Parallel, Gives each performance numbers for each task

HPM Usage HW Event Categories EVENT SET = 1 Cycles Inst. Completed TLB misses Stores completed Loads completed FPU0 ops FPU1 ops FMAs executed EVENT SET = 2 Cycles Inst. Completed TLB misses Stores dispatched L1 store misses Loads dispatched L1 load misses LSU idle EVENT SET = 3 Cycles Inst. dispatched Inst. Completed Cycles w/ 0 inst. completed I cache misses FXU0 ops FXU1 ops FXU2 ops EVENT SET = 4 Cycles Loads dispatched L1 load misses L2 load misses Stores dispatched L2 store misses Comp. unit waiting on load LSU idle floating point performance and usage of floating point units performance and usage of level 1 instruction cache usage of level 2 data cache and branch prediction data locality and usage of level 1 data cache

HPM for Whole Program – using HPMCOUNT • Installed in /usr/local/apps/hpm, /usr/local/apps/HPM_V2.3 • Environment setting: setenv LIBHPM_EVENT_SET 1 (2,3,4) setenv MP_LABELIO YES ->to correlate each line of output with corresponding task setenv MP_STDOUTMODE ->taskID(e.g. 0) to discard output from other tasks • Usage: poe hpmcount ./a.out -nodes 1 -tasks_per_node 1 -rmpool 1 [-s <set>] [-e ev[,ev]*] [-h] -h displays a help message -e ev0,ev1,… list of event numbers, separated by commas ev<i> corresponds to event selected for counter <I> -s predefined set of envets

Derived Hardware Metrics • Hardware counters provide only raw counts • 8 counters on Power3 • Enough info for generation of derived metrics on each execution • Derived Metrics • Floating point rate • Computational Intensity • Instruction per load / store • Load/store per data cache misses • Cache hit rate • Loads per load miss • Stores per store miss • Loads per TLB miss • FMA % • Branches Mispredicted %

HPMCOUNT Output (Event=1)Resource Usage Statistics Total execution time of instrumented code (wall time): 6.218496 seconds Total amount of time in user mode : 5.860000 seconds Total amount of time in system mode : 3.120000 seconds Maximum resident set size : 23408 Kbytes Average shared memory use in text segment : 97372 Kbytes*sec Average unshared memory use in data segment : 13396800 Kbytes*sec Number of page faults without I/O activity : 5924 Number of page faults with I/O activity : 12 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 2840 Number of involuntary context switches : 27740

HPMCOUNT (Event=1continued)Resource statistics Instrumented section: 1 - Label: ALL - process: 1 file: swim_omp.f, lines: 89 <--> 189 Count: 1 Wall Clock Time : 6.216718 seconds Total time in user mode : 5.35645462067771 seconds Exclusive duration: 0.012166 seconds PM_CYC (Cycles) : 2008608171 PM_INST_CMPL (Instructions completed) : 1891769436 PM_TLB_MISS (TLB misses) : 2374441 PM_ST_CMPL (Stores completed) : 274169278 PM_LD_CMPL (Loads completed) : 672275023 PM_FPU0_CMPL (FPU 0 instructions) : 528010431 PM_FPU1_CMPL (FPU 1 instructions) : 245779486 PM_EXEC_FMA (FMAs executed) : 270299532

Timers Time usually reports three metrics: • User Time • The time used by your code on CPU, also CPU time • Total time in user mode = Cycles/Processor Frequency • System Time • The time used by your code running kernel code (doing I/O, writing to disk, or printing to the screen etc). • It is worth to minimize the system time, by speeding up the disk I/O, doing I/O in parallel, or doing I/O in background while your CPU computes in the foreground • Wall Clock time • Total execution time, the combination of the time 1 and 2 plus the time spent idle (waiting for resources) • In parallel performance tuning, only wall clock time counts • Interprocessor communication consumes a significant amount of your execution time (user/system time usually don’t account for it), need to rely on wall clock time for all the time consumed by the job

Floating Point Measures • PM_FPU0_CMPL (FPU 0 instructions) • The POWER3 processor has two Floating Point Units (FPU) which operate in parallel. Each FPU can start a new instruction at every cycle. This counter shows the number of floating point instructions that have been executed by the first FPU. • PM_FPU1_CMPL (FPU 1 instructions) • This counter shows the number of floating point instructions (add, multiply, subtract, divide, multiply & add) that have been processed by the second FPU. • PM_EXEC_FMA (FMAs executed) • This is the number of Floating point Multiply & Add (FMA) instructions. This instruction does a computation of following type x = s * a + b So two floating point operations are done within one instruction. The compiler generate this instruction as often as possible to speed up the program. But sometimes additional manual optimization is necessary to replace single multiply instructions and corresponding add instructions by one FMA.

HPMCOUNT (Event=1continued) Utilization rate : 86.162 % % TLB misses per cycle : 0.118 % Estimated latency from TLB misses : 4.432 sec Avg number of loads per TLB miss : 283.130 Load and store operations : 946.444 M Instructions per load/store : 1.999 MIPS : 304.304 Instructions per cycle : 0.942 HW Float points instructions per Cycle : 0.385 Floating point instructions + FMAs : 1044.089 M Float point instructions + FMA rate : 167.949 Mflip/s FMA percentage : 51.777 % Computation intensity : 1.103

Total Flop Rate • Float point instructions + FMA rate • This is the most often mentioned performance index, the MFlops rate. • The peak performance of the POWER3-II processor is 1500 MFlops. (375 MHZ clock x 2 FPUs x 2 Flops/FMA instruction). • Many applications do not reach more than 10 percent of this peak performance. • Average number of loads per TLB miss • This value is the ratio PM_LD_CMPL / PM_TLB_MISS. Each time after a TLB miss has been processed, fast access to a new page of data is possible. Small values for this metric indicate that the program has a poor data locality, a redesign of the data structures in the program may result in significant performance improvements. • Computation intensity • Computational intensity is the ratio of Load and store operations and Floating point operations

HPM for Part of Program – using LIBHPM • Instrumentation of performance library for performance measurement of Fortran, C and C++ applications • Collects information and performs summarization during run-time, generate performance file for each task • Use the same set of hardware counters events used by hpmcount • User can specify an event set with the file: libHPMevents • For each instrumented point in a program, libhpm provides output : • Total count • Total duration (wall clock time) • Hardware performance counters information • Hardware derived metrics • Supports • multiple instrumentation points, nested instrumentation • OpenMP and thread applications • Multiple calls to an instrumented point

LIBHPM Functions • Fortran • f_hpminit(taskID) • f_hpmterminate(taskID) • f_ hpmstart(instID) • f_ hpmstop(instID) • f_ hpmtstart(instID) • f_ hpmtstop(instID) • C & C++ • hpmInit(taskID) • hpmTerminate(taskID) • hpmStart(instID) • hpmStop(instID) • hpmTstart(instID) • hpmTstop(instID)

Using LIBHPM - C • Declaration #include “libhpm.h” • C usage MPI_Comm_rank( MPI_COMM_WORLD, &taskID) hpmInit(taskID,”hpm_test”) hpmStart(1,”outer call”) * code segment to be timed * hpmStop( 1) hpmTerminate(taskID) • Compilation mpcc_r -I/usr/local/apps/HPM_V2.3/include -O3 -lhpm_r -lpmapi -lm -qarch=pwr3 -qstrict -qsmp=omp -L/usr/local/apps/HPM_V2.3/lib hpm_test.c -o hpm_test.x

Using LIBHPM - Fortran • Declaration #include “f_hpm.h” • Fortran usage CALL MPI_COMM_RANK( MPI_COMM_WORLD, taskid, ierr ) call f_hpminit(taskID) call f_hpmstart(instID) * code segment to be timed * call f_hpmstop(instID) call f_hpmterminate(taskID) CALL MPI_FINALIZE(ierr) • Compilation mpxlf_r -I/usr/local/apps/HPM_V2.3/include -qsuffix=cpp=f -O3 -qarch=pwr3 -qstrict -qsmp=omp -L/usr/local/apps/HPM_V2.3/lib -lhpm_r -lpmapi -lm hpm_test.f -o hpm_test.x

Using LIBHPM - Threads call f_hpminit(taskID) //do call f_hpmtstart(10); *do_work; call f_hpmtstop(10); end //do //do call f_hpmtstart(20+my_thread_ID ); *do_work; call f_hpmtstop(20+my_thread_ID ); end //do call f_hpmterminate(taskID)

HPM Example Code in C #include <mpi.h> #include <stdio.h> #include "libhpm.h" #define n 10000 main(int argc, char *argv[]){ int taskID,i,numprocs; double a[n],b[n],c[n]; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&taskID); hpmInit(taskID,"hpm_test"); hpmStart(1,“section 1"); for(i=1;i<n+1;i++){ a[i]=i; b[i]=n-1; } hpmStop(1); hpmStart(2, "section 2"); for(i=2;i<n+1;i++){ c[i]=a[i]*b[i]+a[i]/b[i]; } hpmStop(2); hpmTerminate(taskID); MPI_Finalize(); }

HPM Example Code in Fortran call f_hpmstop(5) call f_hpmterminate(taskID) call MPI_FINALIZE(ierr) end program hpm_test parameter (n=10000) integer taskID,ierr,numtasks dimension a(n),b(n),c(n) include "mpif.h" #include "f_hpm.h" call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,taskID,ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD,numtasks,ierr) call f_hpminit(taskID,"hpm_test") call f_hpmstart(5,“section1") do i=1,n a(i)=real(i) b(i)=real(n-i) enddo

Compiling and Linking FF = mpxlf_r HPM_DIR = /usr/local/apps/HPM_V2.3 HPM_INC = -I$(HPM_DIR)/include HPM_LIB = -L$(HPM_DIR)/lib -lhpm_r -lpmapi -lm FFLAGS = -qsuffix=cpp=f -O3 -qarch=pwr3 -qstrict -qsmp=omp #Note: -qsuffix=cpp=f is only required for Fortran code with “.f” hpm_test.x: hpm_test.f $(FF) $(HPM_INC) $(FFLAGS) hpm_test.f $(HPM_LIB) -o hpm_test.x

HPMVIZ • takes as input the performance files generated by libhpm • Usage: > hpmviz [<performance files(.viz)>] • define a range of values considered satisfactory • Red: below predefined as minimum recommended value • Green: above the threshold value • HPMVIZ left pane of the window: • displays for each instrumented point, identified by its label, the inclusive duration, exclusive, and count. • HPMVIZ right pane of the window: • shows the corresponding source code which can be edited and saved. • The “metrics” windows: • display the task ID, Thread ID, count, exclusive duration, inclusive duration, and the derived hardware metrics.

HPMVIZ

IBM SP HPM Toolkit Summary • A complete problem set: • Derived metrics • Analysis of error message • Analyze derived metrics • HPMCOUNT: very accurate with low overhead, non-intrusive, general view for whole program • LIBHPM: same sets as hpmcount, for part of program • HPMVIZ: easier to view the hardware counters information and derived metrics

HPM References • HPM “README” file in /usr/local/apps/HPM_V2.3 • Online Documentation http://www.sdsc.edu/SciApps/IBM_tools/hpm.html

Lab Session for HPMEnvironment Setup Setup for running X-windows applications on PCs: 1. Login to b80login.sdsc.edu using CRT (located in Applications common). 2. Launch Exceed (located in either “Applications (Common)” or as a shortcut on your desktop called "Humming Bird". 3. set your environment, for csh: setenv DISPLAY t-wolf.sdsc.edu:0.0 ****where "t-wolf“ for example is the name of the PC you are using 4. copy files from /work/Training/HPM_Training directory into your own working space. * create a directory to work with HPM: mkdir HPM * change directories into new directory: cd HPM * copy files into new directory: cp /work/Training/ HPM_Training/* . 5. Go to /work/Training/HPM_Training/simple/

Lab Session for HPMRunning HPM 1. Compile either Fortran or C example with the following: make –f makefile_f (or makefile_c) 2. Run executable either interactive or by batch interactive command: poe hpm_test.x -nodes 1 -tasks_per_node 2 –euilib ip \ –euidevice en0 3. Explore hpmcount summary output, looking at both usage and resource statistics

IBM Hardware Performance Monitor (hpm)