Getting Started with HPC On Iceberg

Getting Started with HPC On Iceberg Michael Griffiths and Deniz Savas Corporate Information and Computing Services The University of Sheffield www.sheffield.ac.uk/wrgrid

Review of hardware and software Accessing Managing Jobs Building Applications Resources Getting Help Outline

Job dependencies with arrays ( -hold_jid ) Job dependencies allow one to specify that one job should not be run until another job completes. • One can use job dependencies as follows : • In a two-step process, where the second step depends on the results of the first • Splitting one long job into two smaller jobs helps the queue scheduler be more efficient • One can allocate resources to each job separately. Often, one step requires more or less memory than the other. • To avoid clogging the queue with a large number of jobs • job dependencies can effectively limit the number of running jobs independent of the number of jobs submitted.

Job dependencies with arrays ( -hold_jid ) • suppose one has two scripts: step1.sh and step2.sh • One can make step2.sh dependent on step1.sh as follows : $ qsub step1.sh • Your job 12357 ('step1.sh') has been submitted $ qsub -hold_jid 12357 step2.sh . • Your job 12358 ('step2.sh') has been submitted • Or, By explicitly using the job name $ qsub –N myjob step1.sh $ qsub -hold_jid myjob step2.sh • One could also capture the step1_jid to be used in the step2 submit, as follows $ step1id=`qsub -terse step1.sh`; qsub -hold_jid $step1id step2.sh -terse causes the qsub to display only the job-id of the job being submitted rather than the regular "Your job ..." string.

Building ApplicationsOverview The operating system on iceberg provides full facilities for, scientific code development, compilation and execution of programs. The development environment includes, debugging tools provided by the Portland test suite, the eclipse IDE.

Compilers • PGI, GNU and Intel C and Fortran Compilers are installed on iceberg. • PGI ( Portland Group) compilers are readily available to use as soon as you log into a worker node. • A suitable module command is necessary to access the Intel or GNU compilers. • The following modules relating to compilers are available to use with the module add ( or module load) command: compilers/pgi compilers/intel compilers/gcc • Java compiler and the phyton development environment can also be made available by using the following modules respectively; apps/java apps/phyton

Building ApplicationsCompilers C and Fortran programs may be compiled using the GNU or Portland Group. The invoking of these compilers is summarized in the following table:

Building Applications Compilers All of these commands take the filename containing the source to be compiled as one argument followed by a list of optional parameters. Example: pgcc myhelloworld.c –o hello The filetype {suffix} usually determines how the syntax of the source file will be treated. For example myprogram.f will be treated as a fixed format (FTN77 style) source where as myprogram.f90 will be assumed to be free format ( Fortran90 style) by the compiler. Most compilers have a --help or –help switch that lists the available compiler options. -V parameter lists the version number of the compiler you are using.

Help and documentation on compilers • As well as the –help or --help parameters of the compiler commands there are man ( manual ) pages available for these compilers on iceberg. For example; man pgcc , man icc , man gcc • Full documentation provided with the PGI and Intel compilers are accessible via your browser from any platform via the page: http://www.shef.ac.uk/wrgrid/software/compilers

Building Applications Sequential Fortran Assuming that the Fortran program source code is contained in the file mycode.f90, to compile using the Portland group compiler type:pgf90 mycode.f90 In this case the code will be output into the file a.out. To run this code issue:./a.outat the UNIX prompt. To add some optimization, when using the Portland group compiler, the –fast flag may be used. Also –o may be used to specify the name of the compiled executable, i.e.: pgf90 –o mycode –fast mycode.f90 The resultant executable will have the name mycode and will have been optimized by the compiler.

Building Applications Sequential C Assuming that the program source code is contained in the file mycode.c, to compile using the Portland C compiler, type:pgcc –o mycode mycode.c In this case, the executable will be output into the file mycode which can be run by typing its name at the command prompt:./mycode

Memory Issues Programs requiring larger than 2Gigabytes of memory for its data ( i.e. using very large arrays etc. ) may get into difficulties due to addressing issues when pointers can not hold the values of these large addresses. It is also advisable that variables that store and use the array indices have sufficient number of bytes allocated to them. For example, it is not wise to use short_int (C) or integer*2 (Fortran) for variables holding array indices. Such variables must be re-declared as long_int or integer*4 . To avoid such problems; when using the PGI compilers use the option; –mcmodel=medium when using the Intel compilers use the option; –mcmodel=medium –shared-intel

Setting other resource limits ulimit ulimit provides control over available resources for processes ulimit –a report all available resource limits ulimit –s XXXXX set maximum stacksize Sometimes necessary to set the hardlimit e.g. ulimit –sH XXXXXX

Useful Links for Memory Issues 64 bit programming memory issues http://www.ualberta.ca/CNS/RESEARCH/LinuxClusters/64-bit.html Understanding Memory http://www.ualberta.ca/CNS/RESEARCH/LinuxClusters/mem.html

Checkpointing Jobs Simplest method for checkpointing Ensure that applications save configurations at regular intervals so that jobs may be restarted (if necessary) using these configuration files. Using the BLCR checkpointing environment BLCR commands Using BLCR checkpoint with an SGE job Help on checkpointing https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html

Checkpointing jobsUsing BLCR BLCR commands relating to checkpointing. cr_run , cr_checkpoint , cr_restart Set an environment variable to avoid an errorexport LIBCR_DISABLE_NSCD=1 Start running the code under the control of the check_point system cr_run myexecutable [parameters] Find out it’s process_id (PID) ps | grep myexecutable Checkpoint it and write the state into a filecr_checkpoint -f checkpoint.filePID If and when my executable fails/crashes runs out of time etc. it can now be restarted from the checkpoint file you specified.cr_restart checkpoint.file

Using BLCR checkpoint with an SGE Job A checkpoint environment has been setup called BLCR - it's accessible using the test cstest.q queue. An example of a checkpointing job would look something like:##$ -l h_rt=168:00:00 #$ -c sx #$ -ckpt blcrcr_run ./executable >> output.file The -c hh:mm:ss options tells SGE to checkpoint over the specified time interval . The -c sx options tells SGE to checkpoint if the queue is suspended, or if the execution daemon is killed.

Restart a checkpointed job • create a new jobscript with the same options but use the cr_restart command to resume the job: #!/bin/bash#$ -l h_rt=165:50:00[..... any other normal options...]#$ -ckpt blcr#$ -c sxcr_restart checkpoint.[jobId].[Pid]replacing [jobId] and [Pid] with the values for the checkpoint file. Each time the job ends a new checkpoint file will be generated, and you can then use the new checkpoint file to resubmit the job.

Getting Help • The wrgrid website https://www.sheffield.ac.uk/wrgrid • How to use https://www.sheffield.ac.uk/wrgrid/using • Software https://www.sheffield.ac.uk/wrgrid/software • Data Management https://www.sheffield.ac.uk/wrgrid/data • FAQS https://www.sheffield.ac.uk/wrgrid/questionanswer • News and Events https://www.sheffield.ac.uk/wrgrid/events • Training https://www.sheffield.ac.uk/wrgrid/training • Contacts https://www.sheffield.ac.uk/wrgrid/contacts • CICS Helpdesk • Iceberg admins

Building Applications 8: Debugging The Portland group debugger is a symbolic debugger for Fortran, C, C++ programs. Allows the control of program execution using breakpoints, single stepping and enables the state of a program to be checked by examination of variables and memory locations.

Building Applications 9: Debugging PGDBG debugger is invoked using the pgdbg command as follows: pgdbg arguments program arg1 arg2.. Argn arguments may be any of the pgdbg command line arguments. program is the name of the traget program being debugged, arg1, arg2,... argn are the arguments to the program. To get help from pgdbg use:pgdbg -help

Building Applications 10: Debugging PGDBG GUI invoked by default using the command pgdbg. Note that in order to use the debugging tools applications must be compiled with the -g switch thus enabling the generation of symbolic debugger information.

Building Applications 11: Profiling PGPROF profiler enables the profiling of single process, multi process MPI or SMP OpenMP, or programs compiled with the -Mconcur option. The generated profiling information enables the identification of portions of the application that will benefit most from performance tuning. Profiling generally involves three stages: compilation exection analysis (using the profiler)

Building Applications 12: Profiling To use profiling in is necessary to compile your program with the following options indicated in the table below:

Building Applications 13: Profiling The PG profiler is executed using the command pgprof [options] [datafile] Datafile is a pgprof.out file generated from the program execution.

Shared Memory applications using OpenMP Fortran and C programs containing OpenMP compiler directives can be compiled to take advantage of parallel processing on iceberg. OpenMP model of programming uses a thread-model whereby a number of instances “threads” of a program run simultaneously, when necessary communicating with each other via the memory that is shared by all threads. Although any given processor can run multiple threads of the same program via the operating system’s multi-tasking ability, it is more efficient to allocate one thread per processor in a shared memory machine. On Iceberg we have the following types of compute nodes; 2 dual-core (= 2*2 = 4 processors) AMD nodes 2 quad-core (= 2*4= 8 processor) AMD nodes 2 six-core (=2*6 =12 processor ) Intel nodes Therefore it is usually advisable to restrict OpenMP jobs to about 12 threads when using iceberg.

Shared Memory Applications Compiling OpenMP applications Source code that contains $OMP pragmas for parallel programming can be compiled using the following flags: PGI C, C++, Fortran77 or Fortran90 pgf77 , pgf90, pgcc or pgCC -mp [other options] filename Intel C/C++, Fortran ifort , icc or icpc –openmp [other options] filename Gnu C/C++, Fortran gcc or gfortran –fopenmp [other options] filename Note that source code compilation does not require working within a job using the openmp environment. Only the execution of an OpenMP parallel executable will necessitate such an environment that has been requested by the use of the –pe openmp flag to qsub or qsh commands.

Shared Memory Applications Specifying Required Number of Threads The number of parallel execution threads at execution time is controlled by setting the environment variable OMP_NUM_THREADS to the appropriate value. for the bash or sh shell (which is the default shell on iceberg) use - export OMP_NUM_THREADS=6 If you are using the csh or tcsh shell, use - setenv OMP_NUM_THREADS=6

Shared Memory Applications Starting an OpenMP interactive job Short interactive jobs that use OpenMP parallel programming are allowed. Although upto 48 way parallel jobs can theoretically be run such way, due to the high utilisation of the cluster we recommend that you do not exceed 12-way jobs. Here is an example of starting a 12-way interactive job: qsh -pe openmp 12 or qrsh -pe openmp 12 And in the new shell that starts type: export OMP_NUM_THREADS=12 Alternatively, effect of these two commands can be achieved via the –v parameter: E.g. qsh –pe openmp 12 –v OMP_NUM_THREADS=12 Number of threads to use can later be redefined in the same job to experiment with hyper-threading for example. Important Note: although the number of processors required is specified with the -pe option, it is still necessary to ensure that the OMP_NUM_THREADS environment variable is set to the correct value.

Shared Memory Applications Submitting an OpenMP Job to Sun Grid Engine The job is submitted to a special parallel environment that ensures the job ocupies the required number of slots. Using the SGE command qsub the openmp parallel environment is requested using the -pe option as follows; qsub -pe openmp 12 -v OMP_NUM_THREADS=12 myjobfile.sh The following job script, job.sh is submitted using, qsub job.shWhere job.sh is, #!/bin/bash#$ -cwd#$ -pe openmp 12#$ -v OMP_NUM_THREADS=12./executable

Parallel Programming with MPI Introduction Iceberg is designed with the aim of running MPI (message passing interface ) parallel jobs, the sun grid engine is able to handle MPI jobs. In a message passing parallel program each process executes the same binary code but, executes a different path through the code this is SPMD (single program multiple data) execution. Iceberg uses openmpi-ib and mvapich2-ib implementation provide by infiniband (quadrics/connectX), using IB fast interconnect at 32GigaBits/second.

MPI Tutorials From an iceberg worker, execute the following command: tar –zxvf /usr/local/courses/intrompi.tgz The directory which has been created contains some sample MPI applications which you may compile and run.

Set The Correct Environment for MPI For batch jobs the environment is normally set in Makefile or job script See the script file mpienv.sh (in the intrompi directory) Set the correct environment by pasting in to users .bashrc file Set the environment by typing source mpienv.sh Use modules Show available compilers module avail Load module e.g. module add mpi/intel/openmpi/1.6.4

Environments for MPI on Iceberg Openmpi with gigabit ethernet Openmpi with infiniband Mvapich2 with infiniband mpirun_rsh -rsh -np $NSLOTS -hostfile $TMPDIR/machines ./executable NOTE There are environments for both gnu, PGI and Intel compilers

Parallel Programming with MPI 2: Hello MPI World! #include <mpi.h> #include <stdio.h> int main(int argc,char *argv[]){int rank; /* my rank in MPI_COMM_WORLD */int size; /* size of MPI_COMM_WORLD *//* Always initialise mpi by this call before using any mpi functions. */MPI_Init(& argc , & argv);/* Find out how many processors are taking part in the computations. */MPI_Comm_size(MPI_COMM_WORLD, &size);/* Get the rank of the current process */ MPI_Comm_rank(MPI_COMM_WORLD, & rank);if (rank == 0)printf("Hello MPI world from C!\n");printf("There are %d processes in my world, and I have rank %d\n",size, rank);MPI_Finalize();}

Parallel Programming with MPI Output from Hello MPI World! When run on 4 processors the MPI Hello World program produces the following output, Hello MPI world from C!There are 4 processes in my world, and I have rank 2 There are 4 processes in my world, and I have rank 0 There are 4 processes in my world, and I have rank 3 There are 4 processes in my world, and I have rank 1

Parallel Programming with MPI Compiling MPI Applications Using Infiniband To compile C, C++, Fortran77 or Fortran90 MPI code using the portland compiler, type, mpif77 [compiler options] filename mpif90 [compiler options] filename mpicc [compiler options] filename mpiCC [compiler options] filename

Parallel Programming with MPI Compiling MPI Applications Using Gigabit ethernet on X2200’s To compile C, C++, Fortran77 or Fortran90 MPI code using the portland compiler, with OpenMPI type, export MPI_HOME=“/usr/local/packages5/openmpi-pgi/bin” $MPI_HOME/mpif77 [compiler options] filename $MPI_HOME/mpif90 [compiler options] filename $MPI_HOME/mpicc [compiler options] filename $MPI_HOME/mpiCC [compiler options] filename

Parallel Programming with MPI Submitting an MPI Job to Sun Grid Engine To submit a an MPI job to sun grid engine, use the openmpi-ib parallel environment, ensures that the job occuppies the required number of slots. Using the SGE command qsub, the openmpi-ib parallel environment is requested using the -pe option as follows, qsub -pe openmpi-ib 4 myjobfile.sh

Parallel Programming with MPI Sun Grid Engine MPI Job Script The following job script, job.sh is submitted using, qsub job.sh job.sh is, #!/bin/sh#$ -cwd#$ -pe openmpi-ib 4# SGE_HOME to locate sge mpi execution script#$ -v SGE_HOME=/usr/local/sge6_2/usr/mpi/pgi/openmpi-1.2.8/bin/mpirun ./mpiexecutable

Parallel Programming with MPI Sun Grid Engine MPI Job Script Using this executable directly the job is submitted using qsub in the same way but the scriptfile job.sh is, #!/bin/sh#$ -cwd#$ -pe mvapich2-ib 4# MPIR_HOME from submitting environment#$ -v MPIR_HOME=/usr/mpi/pgi/mvapich2-1.2p1$MPIR_HOME/bin/mpirun_rsh –rsh -np 4 -hostfile $TMPDIR/machines ./mpiexecutable

Parallel Programming with MPI Sun Grid Engine OpenMPI Job Script Using this executable directly the job is submitted using qsub in the same way but the scriptfile job.sh is, #!/bin/sh#$ -cwd#$ -pe ompigige 4# MPIR_HOME from submitting environment#$ -v MPIR_HOME=/usr/local/packages5/openmpi-pgi$MPIR_HOME/bin/mpirun -np 4 -machinefile mpiexecutable

Parallel Programming with MPI 10: Extra Notes Number of slots required and parallel environment must be specified using -pe openmpi-ib NSLOTS The job must be executed using the correct PGI/Intel/gnu implementation of mpirun. Note also: Number of processors is specified using -np NSLOTS Specify the location of the machinefile used for your parallel job, this will be located in a temporary area on the node that SGE submits the job to.

Parallel Programming with MPI 10: Pros and Cons. The downside to message passing codes is that they are harder to write than scalar or shared memory codes. The system bus on a modern cpu can pass in excess of 4Gbits/sec between the memory and cpu. A fast ethernet between PC's may only pass up to 200Mbits/sec between machines over a single ethernet cable and this can be a potential bottleneck when passing data between compute nodes. The solution to this problem for a high performance cluster such as iceberg is to use a high performance network solution, such as the 16Gbit/sec interconnect provided by infiniband. The availability of such high performance networking makes possible a scalable parallel machine.

Getting Started with HPC On Iceberg