IBM Hardware Performance Monitor (hpm)

IBM Hardware Performance Monitor (hpm) SDSC Parallel Computing Institute

Using HPM on IBM SP • What is Performance? • FLOPS? • OPS? • Monitor selected hardware events on/off-chip • Cache misses (L1, L2), TLB misses, FP ops, etc. • HW-based monitors • SW-based monitors • Ptools project - PMAPI

Using HPM on IBM SP What is Performance? Many different measures - • Total operations/sec • Floating-point operations/sec • Time-to-solution (wall-clock time) Want • Monitor selected hardware events on/off-chip • Cache misses (L1, L2), TLB misses, FP ops, etc. • HW-based monitors least intrusive • SW-based monitors often have API • Ptools project - PMAPI

Using HPM on IBM SP Want to optimize single-cpu performance of RISC architectures • RISC cpus features • Multiple functional units • Instruction pipeline(s) • Memory hierarchy • L1 cache • L2 cache • Main memory

Using HPM on IBM SP Peak CPU “speed” (simple formula): = Clock Speed (Cycles/sec) Max. # FP ops/cycle Example IBM SP : 375 M Cycles/per * 4 = 1.5 G flop/sec Usually – never see close to this speed. Where are missing cycles?

Using HPM on IBM SP • Hardware Performance Monitor • HPC RISC chips often have built-in hardware event monitors • Monitors gather data on selected hardware events on/off-chip • Cache misses (L1, L2), TLB misses, FP ops, memory ops (Loads/Stores), etc. • HW timer • HW-based monitors generally least intrusive • SW-based monitors often have “nice” API • PTools project – PMAPI • Common “standard” API for industry • Supported by IBM, SUN, SGI, COMPAQ

Using HPM on IBM SP IBM SP – Access to hardware counters via libhpm: • Software library • Reads monitor data for selected sets of hardware events on/off-chip • Cache misses (L1, L2), TLB misses, FP ops, memory ops (Loads/Stores), etc. • API supports C/F90 • Libwct – timer library • Supported by IBM

IBM SP HPM UsageHW Event Categories LIBHPM_EVENT_SET = 1 LIBHPM_EVENT_SET = 2 LIBHPM_EVENT_SET = 3 Cycles Cycles Cycles TLB misses Instructions dispatched Stores completed FPU0 ops Store misses in L1 Loads completed FPU1 ops Load misses in L1 FXU0 ops Stores completed Stores dispatched FXU1 ops Loads completed Loads dispatched FXU2 ops FMAs executed Load queue full FPU0 ops Float Adds or Multiplies TLB misses FPU1 ops

Using HPM for Whole Program • Installed in /usr/local/apps/hpm • Usage: setenv LIBHPM_EVENT_SET 1 (2,3) poe /usr/local/apps/hpm/hpmcount ./a.out -nodes 1 -tasks_per_node 1 -rmpool 1 hpmcount (V 0.9) summary Total execution time: 0.016292 seconds PM_CYC (Cycles) : 994597 PM_TLB_MISS (TLB misses) : 329 PM_FPU0_CMPL (FPU 0 instructions) : 59978 PM_FPU1_CMPL (FPU 1 instructions) : 20023

Using HPM for Part of the Program • Fortran usage CALL MPI_COMM_RANK( MPI_COMM_WORLD, taskid, ierr ) call f_hpm_init(taskid) call f_hpm_start( instid ) * code segment to be timed * call f_hpm_stop( instid ) call f_hpm_terminate(taskid) CALL MPI_FINALIZE(ierr) • Compilation mpxlf -O3 -qarch=pwr3 code2.f /usr/local/apps/hpm/libhpm.a\ -L/usr/lpp/pmtoolkit/lib -lpmapi -bI:/usr/lpp/pmtoolkit/lib/pmsvcs.exp -lm

Using HPM - Sample code in Fortran call f_hpm_start( 2 ) do i=1,10000 c(i)=a(i)*b(i)+a(i)/b(i) enddo call f_hpm_stop( 2 )

Using HPM - Result and Analysis (Event=1) Instrumented point: 2 process: 0 Count: 1 Duration: 0.000938 seconds PM_CYC (Cycles) : 145162 (=10000*12) PM_TLB_MISS (TLB misses) : 25 (translation lookaside buffer) PM_FPU0_CMPL (FPU 0 instructions) : 10052 PM_FPU1_CMPL (FPU 1 instructions) : 10029 PM_ST_CMPL (Stores completed) : 10078 PM_LD_CMPL (Loads completed) : 20121 PM_EXEC_FMA (FMAs executed) : 20079 PM_FPU_FADD_FMUL (Float Adds or Multiplies) : 0

Using HPM - Result and Analysis (Event=1)continued Average number of loads per TLB miss : 804.840000 (=20121/25) Total loads and stores : 3.019900E+04 (=20K+10k) Total hardware floating point operations : 2.008100E+04 (=10K+10K) Hardware float point rate : 21.408316 Mflop/sec (=20K/0.000938) Total number of multiplies and adds : 4.015800E+04 Float multiply add rate : 42.812367 Mflop/sec (=40K/0.000938) Computation intensity : 1.329779 (=40k/30k)

Using HPM - Result and Analysis (Event=1)Result and Analysis (event=2) PM_CYC (Cycles) : 175445 PM_INST_CMPL (Instructions completed) : 64903 PM_LD_DISP (Loads dispatched) : 20190 PM_LD_MISS_L1 (Load misses in L1) : 646 PM_ST_DISP (Stores dispatched) : 10113 PM_ST_MISS (Store misses in L1) : 30 PM_LQ_FULL (Load queue full) : 2349 PM_TLB_MISS (TLB misses) : 25 Cycles per Instruction : 2.703188 Average number of loads per TLB miss : 807.600000 Total loads and stores : 3.030300E+04 Instructions per load/store : 2.141801 Average number of loads per load miss : 31.253870 Average number of stores per store miss : 337.100000 Average number of load/stores per D1 miss : 44.826923 L1 cache hit rate : 97.77 %

Using HPM - Result and Analysis (Event=1)Result and Analysis (event=3) PM_CYC (Cycles) : 145260 PM_INST_CMPL (Instructions completed) : 54869 PM_IC_MISS (Instruction cache misses) : 22 PM_FXU0_PROD_RESULT (FXU 0 instructions) : 1152 PM_FXU1_PROD_RESULT (FXU 1 instructions) : 897 PM_FXU2_PROD_RESULT (FXU 2 instructions) : 85 PM_FPU1_CMPL (FPU 1 instructions) : 10029 PM_FPU0_CMPL (FPU 0 instructions) : 10052 Cycles per Instruction : 2.647397 Instructions per I Cache Miss : 2494.045455 Total number of fixed point opterations : 2.134000E+03 Total hardware floating point operations : 2.008100E+04 Hardware float point rate : 21.093487 Mflop/sec

Using HPM - Result and Analysis (Event=1)Case 1 do i=1,n c(I)=a(I)*(b(I)+1/b(I)) enddo PM_CYC (Cycles) : 175149 PM_TLB_MISS (TLB misses) : 28 PM_FPU0_CMPL (FPU 0 instructions) : 20088 PM_FPU1_CMPL (FPU 1 instructions) : 10033 PM_ST_CMPL (Stores completed) : 10078 PM_LD_CMPL (Loads completed) : 20121 PM_EXEC_FMA (FMAs executed) : 10039 PM_FPU_FADD_FMUL (Float Adds or Multiplies) : 20080 Hardware float point rate : 27.786900 Mflop/sec Total number of multiplies and adds : 4.015800E+04 Float multiply add rate : 37.046125 Mflop/sec

Using HPM -Calling from C #include <stdio.h> #include <mpi.h> #define n 10000 void hpm_init( int my_ID ); void hpm_terminate( int my_ID ); void hpm_start( int inst_ID ); void hpm_stop( int inst_ID ); main(int argc, char **argv){ int taskid,i; double a[n],b[n],c[n]; MPI_Init(&argc,&argv); MPI_Comm_rank( MPI_COMM_WORLD, &taskid); hpm_init(taskid); hpm_start( 1 ); for(i=1;i<n+1;i++){ a[i]=i; b[i]=n-i; }

Using HPM -Calling from C(continued) • hpm_stop( 1 ); • hpm_start( 2 ); • for(i=1;i<n+1;i++){ • c[i]=a[i]*b[i]+a[i]/b[i]; • } • hpm_stop( 2 ); • hpm_terminate(taskid); • MPI_Finalize(); • } • Compilation • mpcc -O3 -qarch=pwr3 code2.f /usr/local/apps/hpm/libhpm.a\ • -L/usr/lpp/pmtoolkit/lib -lpmapi -bI:/usr/lpp/pmtoolkit/lib/pmsvcs.exp -lm

HPM Code in Fortran parameter (n=10000) integer taskid dimension a(n),b(n),c(n) include "mpif.h" call mpi_init(ierr) CALL MPI_COMM_RANK( MPI_COMM_WORLD, taskid, ierr ) call f_hpm_init(taskid) call f_hpm_start( 1 ) do i=1,n a(i)=real(i) b(i)=real(n-i) enddo

HPM Code in Fortran (continued) call f_hpm_stop( 2 ) call f_hpm_terminate(taskid) CALL MPI_FINALIZE(ierr) end

Using IBM SP HPM IBM HPM library includes libwct – accurate real-time clock • API supports C/F90

Using HPM on IBM SP References: • HPM “README” file in /usr/local/apps/hpm

IBM Hardware Performance Monitor (hpm)