200 likes | 359 Views
Performance Monitoring Tools on TCS. Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications. Objective. Measure single PE performance Operation counts, wall time, MFLOP rates Cache utilization ratio Study scalability
E N D
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications
Objective • Measure single PE performance • Operation counts, wall time, MFLOP rates • Cache utilization ratio • Study scalability • Time spent in MPI calls vs. computation • Time spent in OpenMP parallel sections
Atom Tools • atom(1) • Various tools • Low overhead • No recompiling or re-linking in some cases
Useful Tools • Flop2: • Floating point operations count • Timer5: • Wall time (inclusive & exclusive) per routine • Calltrace: • Detailed statistics of calls and their arguments • Developed by Dick Foster @ Compaq
Instrumentation • Load atom module • module load atom • Create routines file • nm –g a.out | awk ‘{if($5==“T”) print $1}’ > routines • Edit routines file • place main routine first; remove unwanted ones • Instrument executable • cat routines | atom –tool flop2 a.out • cat routines | atom –tool timer5 a.out • Execute • a.out.[flop2,timer5]to create fprof.* and tprof.*
Single PE Performance Analysis Sample Timer5 output file: Procedure Calls Self Time Total Time ========= ===== ========= ========== $null_evol$null_j_ 3072 60596709 79880903 $null_eth$null_d1_ 72458 45499161 45499161 $null_hyper_u$null_u_ 3328 39889655 44500045 $null_hyper_w$null_w_ 3328 19195271 33769541 ... ... ... ... ============= ========== ============ ============ Total 1961226 248258934 248258934
Single PE Performance Analysis Sample Flop2 output file: Procedure Calls Fops ========= ===== ==== $null_evol$null_j_ 3072 20406036288 $null_eth$null_d1_ 72458 20220926518 $null_hyper_u$null_u_ 3328 14062774258 $null_hyper_w$null_w_ 3328 3823795456 ... ... ... ========================================== Total 1936818 70876179927 Obtain MFLOPS = Fops/(Self Time)
MPI calltrace • module load atom • cat $ATOMPATH/mpicalls | atom –tool \ calltrace a.out • Execute a.out.calltrace to generate one trace file per PE • Gather timings for desired MPI routines • Repeat for increasing number of processors
Sample calltrace statistics: Number of processors 8 PEs 128 PEs 256 PEs Processor grid 2x2x2 8x4x4 8x8x4 Total Run time: 277.028 314.857 422.170 MPI_ISEND Statistics 1.250 1.498 2.265 MPI_RECV Statistics 4.349 19.779 26.537 MPI_WAIT Statistics 9.172 16.311 20.150 MPI_ALLTOALL Statistics 5.072 9.433 12.894 MPI_REDUCE Statistics 0.013 0.162 0.002 MPI_ALLREDUCE Statistics 0.391 2.073 10.313 MPI_BCAST Statistics 0.061 1.135 1.382 MPI_BARRIER Statistics 14.959 28.694 62.028 ____________________________________________________ Total MPI Time 35.267 79.085 135.571
DCPI • Digital Continuous Profiling Infrastructure • daemon and profiling utilities • Very low overhead (1-2%) • Aggregate or per-process data and analysis • No code modifications • Requires interactive access to compute nodes
DCPI Example • Driver script • creates map file and host list • calls daemon and profiling scripts • Daemon startup script • starts daemon with selected options • Daemon shutdown script • halts daemon • Profiling script • executes post-processing utility with selected options
DCPI Driver Script • PBS job file • dcpi.pbs • Creates map file and host list • Image map generated by dcpiscan(1) • Host list used by dsh(1) commands • Executes daemon and profiling scripts • Start daemon, run test executable, stop daemon, post-process
DCPI Startup Script • C shell script • dcpi_start.csh • Three arguments defined by driver job • MAP, WORK, EXE • Creates database directory (DCPIDB) • Derived from WORK + hostname • Starts dcpid(1) process • Events of interest are specified here
DCPI Stop Script • C shell script • dcpi_stop.csh • No arguments • dcpiquit(1) flushes buffers and halts the daemon process
DCPI Profiling Script • C shell script • dcpi_post.csh • Three arguments defined by driver job • MAP, WORK, EXE • Determines database location (as before) • Uses dcpiprof(1) to post-process database files • Profile selection(s) must be consistent with daemon startup options
DCPI Example Output • Profiler writes to stdout by default • dcpi.output • Single node output in four sections • Start daemon, run test, halt daemon • Basic dcpiprof output • Memory operations (MOPS) • Floating point operations (FOPS) • Reference profiling script for details
Other DCPI Options • Per-process output files • See dcpid(1) –bypid option • Trim output • See dcpiprof(1) –keep option • Host list can also be cropped • ProfileMe events for EV67 and later • Focus on –pm events • See dcpiprofileme(1) options
Common DCPI Problems • Login denied (dsh) • Requires permission to login on compute nodes • Daemon not started in background • NFS is flaky for larger node counts (100+) • Set filemode of DCPIDB directory correctly • Mismatch between startup configuration and profiling specifications • See dcpid(1), dcpiprof(1), and dcpiprofileme(1)
Summary • Low-level interfaces provide access to hardware counters • Very effective, but requires experience • Minimal overhead costs • Report timings, flop counts, MFLOP rates for user code and library calls, e.g. MPI • More information available, e.g. message sizes, time variability, etc.