320 likes | 442 Views
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER. Profiling Tools on the NERSC Crays and IBM/SP. NERSC User Services. N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER. Outline. Profiling Tools on NERSC platforms Cray PVP (killeen, seymour) Cray T3E (mcurie)
E N D
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Profiling Tools on the NERSC Crays and IBM/SP NERSC User Services
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Outline • Profiling Tools on NERSC platforms • Cray PVP (killeen, seymour) • Cray T3E (mcurie) • IBM/SP (gseaborg) • UNIX profiling/performance analysis tools • References 2
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Why Profile? • Characterise application : • Is code cpu bound? • Is code I/O bound? • Is code memory bound? • Analyse communication patterns - D.M. codes • Focus optimisation effort ... and ultimately.. • Improve performance and resource utilisation 3
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Cray PVP/T3E - Application Characterization • Job accounting (ja) • ja • ./a.out • ja -st -n a.out - see next slide for sample output • Look out for : • Maximum Memory Used > available memory • Total I/O wait time (locked+unlocked) > 50% User CPU time • Multitasking breakdown for parallel codes 4
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Job accounting : summary report Elapsed Time : 8 Seconds User CPU Time : 35.5939 Seconds Multitasking/ Multistreaming Breakdown (Concurrent CPUs * Connect seconds = CPU seconds) 1 * 0.0100 = 0.0100 2 * 0.0100 = 0.0200 3 * 0.0600 = 0.1800 4 * 8.8500 = 35.4000 (Avg.) (total) (total) 3.99 * 8.9300 = 35.6100 System CPU Time : 0.1226 Seconds I/O Wait Time (Locked) : 0.0000 I/O Wait Time (Unlocked) : 0.0000 CPU Time Memory Integral : 5.3854 Mword-seconds Data Transferred : 0.0001 MWords Maximum memory used : 0.4746 MWords 5
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER HPM - Hardware Performance Monitor • Helps locate CPU related code bottlenecks • reports use of vector registers, instruction buffers, memory ports • hpm {options} ./a.out {prog_arguments} • options = -g2 -> memory access information • options = -g3 -> vector register information • Look for : • Ratio of Floating Ops/CPU second to CPU mem. references per sec should reflect the FpOps in the code 6
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Sample hpm output : (hpm -g0 ./a.out) Million inst/sec (MIPS) : 7.67 Instructions : 274017290 Avg. clock periods/inst : 26.06 % CP holding issue : 94.02 CP holding issue : 6714667737 Inst.buffer fetches/sec : 0.04M Inst.buf. fetches: 1420802 Floating adds/sec : 15.40M F.P. adds : 550002417 Floating multiplies/sec : 24.36M F.P. multiplies : 870004996 Floating reciprocal/sec : 0.28M F.P. reciprocals : 10000042 Cache hits/sec : 0.00M Cache hits : 45893 CPU mem. references/sec : 34.64M CPU references : 1236978495 Floating ops/CPU second: 40.5M 7
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Cray PVP : CPU Bound Codes: prof/profview • Instruments code to provide % cpu time in function calls • f90 -lprof prog.f90 • ./a.out -> generates prof.data • prof -st ./a.out > prof.report • Chart (over) indicates relative distribution of CPU execution time by function call • prof -x a.out > pgm.prof • profview pgm.prof 8
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Profview - Sample Output 9
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER I/O and Memory Bound Codes : procstat/procview • procstat -m -i -R a.raw a.out • procview a.raw • I/O Analysis : • Reports, Files -> All User Files (Long Report) • Bytes Processed or I/O Wait Time • Memory Analysis : • Reports -> Processes -> Maximum Memory Used (Long Format) 10
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER I/O Bound Codes : procview • procview indicates which files consume most real time for I/O processing 11
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Memory Bound Codes : procview • “High” (> 10% Elapsed Time) Time to complete Memory requests may indicate memory bound code • Use Graphs option to produce plot of Memory use over elapsed time of application 12
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER ATExpert - Autotasking Prediction • Analysis of source code to predict autotasking performance on dedicated Cray PVP • f90 -eX -O3 -r4 -o {prog_name} prog.f90 • ./a.out • atexpert -> shows predicted speed-up 13
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER ATExpert Sample output Indicates predicted speed-up of 4.3 on dedicated 8 processor PVP when source code is autotasked 14
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Also available on Cray PVP • Flowtrace/flowview • times (using Operating System timers) subroutines and functions during program execution • jumptrace/jumpview • provides exact timing in function/subroutine by analysis of machine instructions in program • perftrace/perfview • times subroutines/functions based on statistics gathered from HPM tool 15
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Cray T3E - Apprentice • Locate performance problems /inefficiencies • MPI and shared memory performance, load balance and communication, memory use • Provides hardware performance information and tuning recommendations (Displays -> Observations) • Compile/link • f90 -o {prog} -eA {prog_name.f90} -lapp • cc -o {prog} -happrentice {prog_name.c} -lapp • Run code to generate app.rif 16
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Output from : apprentice app.rif 17
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Cray T3E - PAT • Generates profile of CPU time in functions; load balance across PEs; h/w counter info. • Compile and Link with PAT library • f90 -o exe -lpat {source.f} pat.cld • Run program as normal • mpprun -n {procs} {exe} -> generate exe.pif • pat executable exe.pif 18
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Profile based on relative CPU time in function calls Load Balance Histogram for routine “COLL” 19
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Cray T3E - ACTS/TAU • Performance analysis of distributed/shared memory applications (C++ in particular) • module load tau • instrument programs with TAU macros • add $(TAU_DEFS), $(TAULIBS) to compile/link • run application; view tracefile with pprof, VAMPIR • Reference • http://acts.nersc.gov/tau • http://hpcf.nersc.gov/training/classes/Teleconf/1999july/Wu 20
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Cray T3E - Vampir • Analysis of message passing characteristics - generates display of MPI activity over instrumented time period (e.g. sender, receiver, message size, elapsed time) • module load VAMPIR; module load vampirtrace • Facility to instrument with VAMPIRtrace calls • Generate trace file using TAU or VAMPIRtrace • Reference : • http://hpcf.nersc.gov/software/tools/vampir.html 21
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER IBM/SP - Xprofiler • Graphical interface for gprof profiles of parallel applications • Compile and link code with “-g -pg” • poe ./a.out -procs {n} • generates gmon.out.{n} file for each process • may introduce significant (upto factor of 2) overhead • (In $TMPDIR) xprofiler ./a.out gmon.out.* • Report menu provides (gprof) text profile • Source statement profiling shown 22
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Statement level profile available by clicking on relevant function graphical output - use Show Source Code option 24
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER IBM/SP - Visualization Tool (VT) • Message passing trace visualization • Realtime system activity monitor (limited) • MPI load balance overview : • poe ./a.out -procs {n} -tlevel=3 • copy a.out.trc to $TMPDIR • (In $TMPDIR) Invoke vt • In trace visualization mode, “Play” a.out.trc • see next slide for sample of Interprocessor Communication during program execution 25
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER IBM/SP : system_stats • IBM Internal Tool • module load sptools • instrument code with system_stats() call • Link with $(SPTOOLS), run code as normal • Sample output Summary of the utilization of system resources: node hostname wall(s) user(s) sys(s) size(KB) pswitches 0 gs01015 16.80 13.18 0.04 2748 2138 1 gs01015 16.80 16.07 0.04 2744 1868 2 gs01003 16.80 16.62 0.04 2740 1870 3 gs01003 16.80 16.56 0.03 2732 1841 27
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER IBM/SP - trace-mpi • IBM Internal tool - Quantitative information on MPI calls • module load USG ; module load trace-mpi • Fortran - add $(TRACE_MPIF) to build • C - add $(TRACE_MPI) to build • poe ./a.out -procs {n} - generates mpi.trace_file for each process (executable must call MPI_Finalize) • summary mpi.trace_file.{n} (see over) • Useful check for load balance : • grep “Total Communication” mpi.trace.file.* 28
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER MPI message-passing summary for mpi.trace_file.3 MPI Function #calls Avg Bytes Time (sec) ------------------------------------------------------------- MPI_Allreduce: 9355 8.0 3.596 MPI_Barrier: 3 0.0 0.017 MPI_Bcast: 66 5.8 0.013 MPI_Scatter: 31 1008.0 0.088 MPI_Comm_rank: 1 0.0 0.000 MPI_Comm_size: 1 0.0 0.000 MPI_Isend: 43023 2003.7 0.893 MPI_Recv: 43023 2003.7 7.481 MPI_Wait: 43023 2003.7 3.739 Total Communication Information: WALL = 15.8277, CPU = 15.53, MBYTES = 258.72 The total amount of wall time = 26.229613 29
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Upcoming on the SP • ACTS/TAU (C/C++) • currently being ported to the IBM/SP • VAMPIR • has been ordered, awaiting delivery • Performance Monitor Toolkit (HPM) • should be available with Phase II system(requires AIX 4.3.4) • Also, see Performance API project: • http://icl.cs.utk.edu/projects/papi 30
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER General/UNIX Profiling Tools • Command line profilers and system analysis • prof/gprof (enabled for MPI on IBM/SP) • csh time command : time ./a.out • vmstat -> look for high paging over extended time period (application may require more memory) • Fortran/C function timers • getrusage • rtc, irtc • etime, dtime, mclock • MPI_Wtime 31
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER Reference Material • NERSC web pages • http://hpcf.nersc.gov/software/tools • Cray PVP/Cray T3E • http://www.cray.com/swpubs • Optimizing Code on Cray PVP Systems • Cray T3E C, Fortran Optimization Guides • IBM/SP • LLNL Workshop on Performance Tools 32