Intel ® Cluster Tools Introduction and Hands on Sessions

Intel ® Cluster ToolsIntroduction and Hands on Sessions MSU Summer School Intel Cluster Software and Technologies Software & Services Group July, 8 2010, MSU Moscow

Agenda • Intel Cluster Tools settings and configuration • Intel MPI fabrics • Message Checker • ITAC Introduction • ITAC practice

Setup configuration • Source /opt/intel/cc/11.0.74/bin/iccvars.sh intel64 • Source /opt/intel/fc/11.0.74/bin/ifortvars.sh intel64 • Source /opt/intel/impi/4.0.0.25/bin64/mpivars.sh • Source /opt/intel/itac/8.0.0.011/bin/itacvars.sh impi4

Check configuration • Which icc • Which ifort • Which mpiexec • Which traceanalyzer • Echo $LD_LIBRARY_PATH • Set | grep I_MPI • Set | grep VT_

Compile your first MPI application • Using Intel compilers • mpiicc, mpiicpc, mpiifort, ... • Using Gnu compilers • mpicc, mpicxx, mpif77, ... • mpiicc -o hello_ctest.c • mpiifort -o hello_ftest.f

Create mpd.hosts file • Create mpd.hosts file in the working directory with list of available nodes Create mpd ring • mpdboot -r ssh -n #nodes Check mpd ring • mpdtrace

Start your first application • mpiexec -n 16 ./hello_c • mpiexec -n 16 ./hello_f Kill mpd ring • mpdallexit • mpdcleanup -a Start your first application • mpirun -r ssh -n 16 ./hello_c • mpirun -r ssh -n 16 ./hello_f

Alternative Process Manager • Use mpiexec.hydra for better scalability • All options are the same

OFED & DAPL • OFED - OpenFabrics Enterprise Distribution http://openfabrics.org/ • DAPL - Direct Access Programming Library http://www.openfabrics.org/downloads/dapl/ • Check /etc/dat.conf • Set I_MPI_DAPL_PROVIDER=OpenIB-mlx4_0-2

Fabrics selection

Fabrics selection (cont.) • Use I_MPI_FABRICS to set the desired fabric • export I_MPI_FABRICS shm:tcp • mpirun -r ssh -n -env I_MPI_FABRICS shm:tcp ./a.out • DAPL varieties: • export I_MPI_FABRICS shm:dapl • export I_MPI_DAPL_PROVIDER ofa-v2-mlx4_0-1 • export I_MPI_DAPL_UD enable • Connectionless communication • Better scalability • Less memory is required

Fabrics selection (cont.) • OFA fabric • export I_MPI_FABRICS shm:ofa • Multi-rail feature • export I_MPI_OFA_NUM_ADAPTERS=<n> • export I_MPI_OFA_NUM_PORTS=<n> • For OFA devices Intel® MPI Library recognizes some hardware events, can stop using one line and restore connection when a line is OK again

How to get information from Intel MPI library • Use I_MPI_DEBUG env variable • Use a number from 2 to 1001 for different details level • Level 2 shows data transfer mode • Level 4 shows pinning information

cpuinfo utility • Use this utility to get information about processors used in your system Intel(R) Xeon(R) Processor (Intel64 Harpertown) ===== Processor composition ===== Processors(CPUs) : 8 Packages(sockets) : 2 Cores per package : 4 Threads per core : 1 ===== Processor identification ===== Processor Thread Id. Core Id. Package Id. 0 0 0 0 1 0 0 1 2 0 1 0 3 0 1 1 4 0 2 0 5 0 2 1 6 0 3 0 7 0 3 1 ===== Placement on packages ===== Package Id. Core Id. Processors 0 0,1,2,3 0,2,4,6 1 0,1,2,3 1,3,5,7 ===== Cache sharing ===== Cache Size Processors L1 32 KB no sharing L2 6 MB (0,2)(1,3)(4,6)(5,7)

Pinning • One can change default pinning settings • export I_MPI_PIN on/off • export I_MPI_PIN_DOMAIN cache2 (for hybrid) • export I_MPI_PROCESSOR_LIST allcores • export I_MPI_PROCESSOR_LIST shift=socket

$ mpiicc –openmp-o ./your_app $ export OMP_NUM_THREADS=4 $ export I_MPI_FABRICS=shm:dapl $ export KMP_AFFINITY=compact $ mpirun -perhost 4 -n <N> ./a.out OpenMP and Hybrid applications • Check command line for application building • Use the thread safe version of the Intel® MPI Library (-mt_mpi option) • Use the libraries with SMP parallelization (i.e. parallel MKL) • Use –openmp compiler option to enable OpenMP* directives • Set application execution environment for hybrid applications • Set OMP_NUM_THREADS to threads number • Use –perhost option to control process pinning

Intel® MPI Library and MKL • MKL creates own threads (openMP, TBB, …) • MKL from version 10.2 understands settings of Intel® MPI Library and doesn’t create more processes than cores • Use OMP_NUM_THREADS and MKL_NUM_THREADS carefully

How to run a debugger • TotalView • mpirun -r ssh -tv –n # ./a.out • GDB • mpirun -r ssh-gdb–n # ./a.out • Allinea DDT (from GUI) • IDB • mpirun -r ssh-idb–n # ./a.out • You need idb available in your $PATH • Some settings are required

Message Checker • Local checks: isolated to single process • Unexpected process termination • Buffer handling • Request and data type management • Parameter errors found by MPI • Global checks: all processes • Global checks for collectives and p2p ops • Data type mismatches • Corrupted data transmission • Pending messages • Deadlocks (hard & potential) • Global checks for collectives – one report per operation • Operation, size, reduction operation, root mismatch • Parameter error • Mismatched MPI_Comm_free()

Message Checker (cont.) • Levels of severity: • Warnings: application can continue • Error: application can continue but almost certainly not as intended • Fatal error: application must be aborted • Some checks may find both warnings and errors • Example: CALL_FAILED check due to invalid parameter • Invalid parameter in MPI_Send() => msg cannot be sent => error • Invalid parameter in MPI_Request_free() => resource leak => warning

Message Checker (cont.) • Usage model: • Recommended: • -checkoption when running an MPI job $ mpiexec–check–n 4 ./a.out • Use fail-safe version in case of crash $ mpiexec–check libVTfs.so–n 4 ./a.out • Alternatively: • -check_mpi option during link stage $ mpiicc–check_mpi –gtest.c –o a.out • Configuration • Each check can be enabled/disabled individually • set in VT_CONFIG file, e.g. to enable local checks only: CHECK ** OFF CHECK LOCAL:** ON • Change number of warnings and errors printed and/or tolerated before abort See lab/poisson_ITAC_dbglibs

Trace Collector • Link with trace library: • mpiicc -trace test.c -o a.out • Run with -trace option • mpiexec -trace -n # ./a.out • Using of itcpin utility • mpirun –r ssh –n # itcpin --run -- ./a.out • Binary instrumentation • Use -tcollect link option • mpiicc -tcollecttest.c -o a.out

Using Trace Collector for openMP applications • ITA can show only those threads which call MPI functions. There is very simple trick: e.g. before "#pragmaomp barrier" add MPI call: • { int size; MPI_Comm_size(MPI_COMM_WORLD, &size); } • After such modification ITA will show information about OpenMP threads. • Please remember that to support threads you need to use thread-safe MPI Library. Don't forget to set VT_MPI_DLL environment variable. • $ set VT_MPI_DLL=impimt.dll (for Windows) • $ export VT_MPI_DLL=libmpi_mt.so (for Linux)

Light weight statistics ~~~~ Process 0 of 256 on node C-21-23 Data Transfers Src --> Dst Amount(MB) Transfers ----------------------------------------- 000 --> 000 0.000000e+00 0 000 --> 001 1.548767e-03 60 000 --> 002 1.625061e-03 60 000 --> 003 0.000000e+00 0 000 --> 004 1.777649e-03 60 … ========================================= Totals3.918986e+031209 Communication Activity Operation Volume(MB) Calls ----------------------------------------- P2P Csend 9.147644e-02 1160 Send 3.918895e+03 49 Collectives Barrier 0.000000e+0012 Bcast 3.051758e-05 6 Reduce 3.433228e-05 6 Allgather 2.288818e-04 30 Allreduce 4.108429e-03 97 Use I_MPI_STATS environment variable • export I_MPI_STATS # (up to 10) • export I_MPI_STATS_SCOPE p2p:csend

Intel® Trace Analyzer • Generate a trace file for Game of Life • Investigate blocking Send using ITA • Change code • Look at difference

Ideal Interconnect Simulator (IIS) Helps to figure out application's imbalance simulating its behavior in the "ideal communication environment" Realtrace Ideal trace

Imbalance diagram Calculation MPI_Allreduce ITAC Calculation MPI_Allreduce Calculation traceidealizer Calculation • model Calculation MPI_Allreduce = load imbalance = interconnect Calculation MPI_Allreduce

Trace Analyzer - Filtering

mpitune utility Cluster-specific tune • Run it once after installation and each time after cluster configuration change • Best configuration is recorded for each combination of communication device, number of nodes, MPI ranks and process distribution model # Collect configuration values: $ mpitune # Reuse recorded values: $ mpiexec –tune –n 32 ./your_app • Application-specific tuning • Tune any kind of MPI application specifying its command line • By default performance is measured as inversed execution time • To reduce overall tuning time use the shortest representative application workload (if applicable) # Collect configuration settings $ mpitune –-application \”mpiexec –n 32 ./my_app\” –of ./my_app.conf Note: using of backslash and quote is mandatory # Reuse recorded values $ mpiexec -tune ./my_app.conf -n 32 ./my_app

Stay tuned! • Learn more online • Intel® MPI self-help pageshttp://www.intel.com/go/mpi • Ask questions and share your knowledge • Intel® MPI Library support page http://software.intel.com/en-us/articles/intel-cluster-toolkit-support-resources/ • Intel® Software Network Forumhttp://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/

Intel ® Cluster Tools Introduction and Hands on Sessions