1 / 52

Kent Milfeld TACC May 28, 2009

Profiling. Kent Milfeld TACC May 28, 2009. Outline. Profiling MPI performance with MPIp Tau gprof Example: matmult. MPIp. Scalable Profiling library for MPI apps Lightweight Collects statistics of MPI functions Communicates only during report generation Less data than tracing tools.

kyrene
Download Presentation

Kent Milfeld TACC May 28, 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Profiling Kent Milfeld TACC May 28, 2009

  2. Outline • Profiling MPI performance with MPIp • Tau • gprof • Example: matmult

  3. MPIp • Scalable Profiling library for MPI apps • Lightweight • Collects statistics of MPI functions • Communicates only during report generation • Less data than tracing tools http://mpip.sourceforge.net

  4. Usage, Instrumentation, Analysis • How to use • No recompiling required! • Profiling gathered in MPI profiling layer • Link static library before default MPI libraries -g -L${TACC_MPIP_LIB} -lmpiP -lbfd -liberty -lintl –lm • What to analyze • Overview of time spent in MPI communication during the application run • Aggregate time for individual MPI call

  5. Control • External Control: Set MPIP environment variable (threshold, callsite depth) • setenv MPIP ‘-t 10 –k 2’ • export MPIP= ‘-t 10 –k 2’ • Use MPI_Pcontrol(#) to limit profiling to specific code blocks C MPI_Pcontrol(2); MPI_Pcontrol(1); MPI_Abc(…); MPI_Pcontrol(0); F90 call MPI_Pcontrol(2) call MPI_Pcontrol(1) call MPI_Abc(…) call MPI_Pcontrol(0)

  6. Example • Untar the mpip tar file. • % tar -xvf ~train00/mpip.tar; cd mpip • Read the “Instructions” file– setup Env. with sourceme-files • % source sourceme.csh (C-type shells) • % source sourceme.sh (Bash-type shells) module purge {Resets to default modules.} module load TACC module unload mvapich {Change compiler/MPI to intel/mvapich.} module swap pgi intel module load mvapich module load mpiP {Load up mpiP.} set path=($path /share/home/00692/train00/mvapich1_intel/Mpipview_2.02/bin) {Sets PATH with mpipview dir. – separate software from mpiP.)}

  7. Example (cont.) • Compile either the matmultc.c or matmultf.f90 code. • % make matmultf • Or • % make matmultc • Uncomment “ibrun matmultc” or “ibrun matmultf” in the job file. • Submit job. • % qsub job • Results in: • <exec_name><unique number>.mpiP

  8. Example *.mpiP Output • MPI-Time: wall-clock time for all MPI calls • MPI callsites

  9. Output (cont.) • Message size • Aggregate time

  10. mpipview • e.g. % mpipview matmultf.32.6675.1.mpiP • Note: list selection.

  11. Indexed

  12. Call Sites

  13. Statistics

  14. Message Sizes

  15. Tau Outline • General • Measurements • Instrumentation & Control • Example: matmult • Profiling and Tracing • Event Tracing • Steps for Performance Evaluation • Tau Architecture • A look at a task-parallel MxM Implementation • Paraprof Interface

  16. General • Tuning and Analysis Utilities (11+ year project effort)www.cs.uoregon.edu/research/paracomp/tau/ • Performance system framework for parallel, shared & distributed memory systems • Targets a general complex system computation model • Nodes / Contexts / Threads • Integrated toolkitfor performance instrumentation, measurement, analysis, and visualization TAU = Profiler and Tracer + Hardware Counters + GUI + Database

  17. Tau: Measurements • Parallel profiling • Function-level, block (loop)-level, statement-level • Supports user-defined events • TAU parallel profile data stored during execution • Hardware counter values • Support for multiple counters • Support for callgraph and callpath profiling • Tracing • All profile-level events • Inter-process communication events • Trace merging and format conversion

  18. Tau: Instrumentation PDT is used to instrument your code. Replace mpicc and mpif90 in make files with tau_f90.sh and tau_cc.sh It is necessary to specify all the components that will be used in the instrumentation (mpi, openmp, profiling, counters [PAPI], etc. However, these come in a limited number of combinations.) Combinations: First determine what you want to do (profiling, PAPI counters, tracing, etc.) and the programming paradigm (mpi, openmp), and the compiler. PDT is a required component: Parallel Paradigm Collectors Compiler: Instrumentation MPI OMP … PAPI Callpath … intel pgi gnu PDT Hand-code

  19. Tau: Instrumentation • You can view the available combinations • (alias tauTypes 'ls -C1 $TAU | grep Makefile ' ). • Your selected combination is made known to the compiler wrapper through the TAU_MAKEFILE environment variable. • E.g. the PDT instrumention (pdt) for the Intel compiler (icpc) for MPI (mpi) is set with this command: • setenv TAU_MAKEFILE /…/Makefile.tau-icpc-mpi-pdt • Other run-time and instrumentation options are set through TAU_OPTIONS. For verbose: • setenv TAU_OPTIONS ‘-optVerbose’

  20. Tau Example % tar –xvf ~train00/tau.tar % cd tau READ the Instructions file % source sourceme.csh or % source sourceme.sh create env. (modules and TAU_MAKEFILE) % make matmultf create executable(s) or % make matmultc % qsub job submit job (edit and uncomment ibrun line) % paraprof (for GUI) Analyze performance data:

  21. Definitions – Profiling • Profiling • Recording of summary information during execution • inclusive, exclusive time, # calls, hardware statistics, … • Reflects performance behavior of program entities • functions, loops, basic blocks • user-defined “semantic” entities • Very good for low-cost performance assessment • Helps to expose performance bottlenecks and hotspots • Implemented through • sampling: periodic OS interrupts or hardware counter traps • instrumentation: direct insertion of measurement code

  22. Definitions – Tracing • Tracing • Recording of information about significant points (events) during program execution • entering/exiting code region (function, loop, block, …) • thread/process interactions (e.g., send/receive message) • Save information in event record • timestamp • CPU identifier, thread identifier • Event type and event-specific information • Event trace is a time-sequenced stream of event records • Can be used to reconstruct dynamic program behavior • Typically requires code instrumentation

  23. Event definition CPU A: void master { trace(ENTER, 1); ... trace(SEND, B); send(B, tag, buf); ... trace(EXIT, 1); } 1 2 3 master worker ... timestamp MONITOR 62 68 ... 58 69 ... 60 64 B A A B B A SEND EXIT ENTER RECV EXIT ENTER B 1 2 2 1 A CPU B: void worker { trace(ENTER, 2); ... recv(A, tag, buf); trace(RECV, A); ... trace(EXIT, 2); } Event Tracing: Instrumentation, Monitor, Trace

  24. main master 1 2 3 master worker ... worker ... 69 62 64 58 60 ... 68 A A B B A B EXIT RECV SEND EXIT ENTER ENTER 1 1 2 B A 2 B A 60 62 64 68 70 58 66 Event Tracing: “Timeline” Visualization

  25. Steps of Performance Evaluation • Collect basic routine-level timing profile to determine where most time is being spent • Collect routine-level hardware counter data to determine types of performance problems • Collect callpath profiles to determine sequence of events causing performance problems • Conduct finer-grained profiling and/or tracing to pinpoint performance bottlenecks • Loop-level profiling with hardware counters • Tracing of communication operations

  26. TAU Performance System Architecture

  27. Order N P Tasks Overview of Matmult: C = A x B C A B = x MASTER Worker Create A & B Send B Receive B Send Row of A Receive a j = x Multiply row a x B j Receive Row of C Send Back row of C

  28. Order N P Tasks Preparation of Matmult: C = A x B C A B = x MASTER PE 0 PE 0 Generate A & B Create A Create B loop over i (i=1n) PE 0  PE x Broadcast B to All by columns MPI_Bcast( b(1,i)…

  29. Order N P Tasks Master Ops of Matmult: C = A x B C A B = x MASTER PE 0 PE 1 -- p-1 Master (0) sends rows 1 through (p-1) to slaves (1p-1) receives loop over i (i=1p-1) 1 p-1 MPI_Send(arow … i,i destination tag PE 0 PE 1 -- p-1 Master (0) receives rows 1 through (n) from Slaves. loop over i (i=1n) 1 source,tag MPI_Recv(crow … ANY,k n MPI_Send(arow …idle,j dest,tag

  30. Order N P Tasks Master Ops of Matmult: C = A x B C A B = x Worker Pick up broadcast of B columns from PE 0 loop over i (i=1n) Slave receives any A row from PE 0 j MPI_Recv( arow …ANY,j Slaves multiply all Columns of B into A (row i) to form row i of Matrix C = x row j Matrix * Vector j PE 0 row j Slave(any) sends row j of C to master, PE 0 MPI_Send( crow … j j

  31. Paraprof and Pprof • Execute application and analyze performance data: • % qsub job • Look for files: profile.<task_no>. • With Multiple counters, look for directories for each counter. • % pprof (for text based profile display) • % paraprof (for GUI) • pprof and paraprof will discover files/directories. • paraprof runs on PCs,Files/Directories can be downloaded to laptop and analyzed there.

  32. Tau Paraprof Overview Raw files HPMToolkit PerfDMFmanaged (database) Metadata MpiP Application Experiment Trial TAU

  33. Tau Paraprof Manager Window Provides Machine Details Organizes Runs as: Applications, Experiments and Trials.

  34. Routine Time Experiment Profile Information is in “GET_TIME_OF_DAY” metric Mean and Standard Deviation Statistics given.

  35. Multiply_Matrices Routine Results Function Data Window gives a closer look at a single function: not from same run

  36. Float Point OPS trial Hardware Counters provide Floating Point Operations (Function Data view).

  37. L1 Data Cache Miss trial Hardware Counters provide L1 Cache Miss Operations.

  38. Call Path Call Graph Paths (Must select through “thread” menu.)

  39. Call Path TAU_MAKEFILE = …Makefile.tau-callpath-icpc-mpi-pdt

  40. Derived Metrics Select Argument 1 (green ball); Select Argument 2 (green ball); Select Operation; then Apply. Derived Metric will appear as a new trial.

  41. DerivedMetrics Since FP/Miss ratios are constant– must be memory access problem. Be careful even though ratios are constant, cores may do different amounts of work/operations per call.

  42. GPROF GPROF is the GNU Project PROFiler. gnu.org/software/binutils/ • Requires recompilation of the code. • Compiler options and libraries provide wrappers for each routine call, and periodic sampling of the program. • A default gmon.out file is produced with the function call information. • GPROF links the symbol list in the executable with the data in gmon.out.

  43. Types of Profiles • Flat Profile • CPU time spend in each function (self and cumulative) • Number of times a function is called • Useful to identify most expensive routines • Call Graph • Number of times a function was called by other functions • Number of times a function called other functions • Useful to identify function relations • Suggests places where function calls could be eliminated • Annotated Source • Indicates number of times a line was executed

  44. Profiling with gprof Use the -pg flag during compilation: % gcc -g -pg srcFile.c % icc -g -pg srcFile.c % pgcc -g -pg srcFile.c Run the executable. An output file gmon.out will be generated with the profiling information. Execute gprof and redirect the output to a file: % gprof exeFile gmon.out> profile.txt % gprof -l exeFile gmon.out> profile_line.txt % gprof -A exeFile gmon.out>profile_anotated.txt

  45. gprof example Move into the profiling examples directory: % cd profile/ Compile matvecop with the profiling flag: % gcc -g -pg -lm matvecop.c {@ ~train00/matvecop.c} Run the executable to generate the gmon.out file: % ./a.out Run the profiler and redirect output to a file: % gprof a.out gmon.out > profile.txt Open the profile file and study the instructions.

  46. Visual Call Graph main sysSqrt vecCube sysCube matCube matSqrt vecSqrt

  47. Flat profile In the flat profile we can identify the most expensive parts of the code (in this case, the calls to matSqrt, matCube, and sysCube). % cumulative self self total time seconds seconds calls s/call s/call name 50.00 2.47 2.47 2 1.24 1.24 matSqrt 24.70 3.69 1.22 1 1.22 1.22 matCube 24.70 4.91 1.22 1 1.22 1.22 sysCube 0.61 4.94 0.03 1 0.03 4.94 main 0.00 4.94 0.00 2 0.00 0.00 vecSqrt 0.00 4.94 0.00 1 0.00 1.24 sysSqrt 0.00 4.94 0.00 1 0.00 0.00 vecCube

  48. Call Graph Profile index % time self children called name 0.00 0.00 1/1 <hicore> (8) [1] 100.0 0.03 4.91 1 main [1] 0.00 1.24 1/1 sysSqrt [3] 1.24 0.00 1/2 matSqrt [2] 1.22 0.00 1/1 sysCube [5] 1.22 0.00 1/1 matCube [4] 0.00 0.00 1/2 vecSqrt [6] 0.00 0.00 1/1 vecCube [7] ----------------------------------------------- 1.24 0.00 1/2 main [1] 1.24 0.00 1/2 sysSqrt [3] [2] 50.0 2.47 0.00 2 matSqrt [2] ----------------------------------------------- 0.00 1.24 1/1 main [1] [3] 25.0 0.00 1.24 1 sysSqrt [3] 1.24 0.00 1/2 matSqrt [2] 0.00 0.00 1/2 vecSqrt [6] -----------------------------------------------

  49. Profiling dos and don’ts DO • Test every change you make • Profile typical cases • Compile with optimization flags • Test for scalability DO NOT • Assume a change will be an improvement • Profile atypical cases • Profile ad infinitum • Set yourself a goal or • Set yourself a time limit

  50. Tools PAPI Low Level PAPI High Level Portable Layer PAPI Machine Dependent Substrate Machine Specific Layer Kernel Extension Operating System Hardware Performance Counter PAPI Implementation

More Related