TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.

TTU High Performance Computing User Training: Part 2Srirangam Addepalli and David Chaffin, Ph.D. • Advanced Session: Outline • Cluster Architecture • File System and Storage • Lectures with Labs: • Advanced Batch Jobs • Compilers/Libraries/Optimization • Compiling/Running Parallel Jobs • Grid Computing

TTU High Performance Computing User Training: Part 2Srirangam Addepalli and David Chaffin, Ph.D. • HPCC Clusters • hrothgar: 128 dual-processor 64-bit Xeons, 3.2 Ghz, 4GB memory, Infiniband and Gigabit Ethernet, Centos 4.3 (Redhat) • community cluster: 64 nodes, part of hrothgar, same except no Infiniband. Owned by faculty members, controlled by batch queues. • minigar; 20 nodes, 3.6 Ghz, IB, for development, open soon • Physics grid machine on order: some nodes available • poseidon: Opteron, 3 nodes, pathscale compilers • Several retired, test, grid systems

TTU High Performance Computing User Training: Part 2Srirangam Addepalli and David Chaffin, Ph.D. • Cluster Performance • Main factors: • 1. Individual node performance: of course. SpecFP2000Rate (www.spec.org) matches our apps well. Newest dual cores have 2x cores, ~1.5x perf per core for 3x performance per node vs. hrothgar. • 2. Fabric latency (delay time of one message, ms. IB=6 GE=40) • 3. Fabric bandwidth (MB/s IB=600 GE=60) • Intels better cpu right now, AMD better shared mem performance. Overall about equal.

TTU High Performance Computing User Training: Part 2Srirangam Addepalli and David Chaffin, Ph.D. • Cluster Architecture • An application example where the system is limited by interconnect performance: • gromacs, simulation time completed/real time • Hrothgar, 8 nodes, Gig-E: ~1200 ns/day • Hrothgar, 8 nodes, IB: ~2800 ns/day • Current dual-core systems have 3x the serial throughput of hrothgar, and quad-core systems are coming next year. They need more bandwidth: Gig-E will in the future be suitable only for serial jobs.

TTU High Performance Computing User Training: Part 2Srirangam Addepalli and David Chaffin, Ph.D. • Cluster Usage • ssh to hrothgar • scp files to hrothgar • compile on hrothgar • run on compute nodes (only) using lsf batch system (only) • example files: /home/shared/examples/

TTU High Performance Computing User Training: Part 2Srirangam Addepalli and David Chaffin, Ph.D. • Some Useful LSF Commands • bjobs –w (-w for wide shows full node name) • bjobs –l [job#] (–l for long shows everything) • bqueues [-l] shows queues [everything] • bhist [job#] job history • bpeek [job#] stdout/err stored by lsf • bkill job# kill it • -bash-3.00$ /home/shared/bin/check-hosts-batch.sh • hrothgar, 2 free=0 nodes, 0 cpus • hrothgar, 1 free=3 nodes, 3 cpus • hrothgar, 0 free=125 nodes • hrothgar, offline=0 nodes

TTU High Performance Computing User Training: Part 2Srirangam Addepalli and David Chaffin, Ph.D. • Batch Queues on hrothgar • bqueues • QUEUE_NAME PRIO STATU MAX JL/U JL/P JL/H NJOBS PEND RUN • short 35 Open 56 56 - - 0 0 0 • parallel 35 Open 224 40 - - 108 0 108 • serial 30 Open 156 60 - - 204 140 64 • parallel_long 25 Open 256 64 - - 16 0 16 • idle 20 Open 256 256 - - 100 0 55 • Every 30 sec the scheduler cycles queued jobs. Starts if: • (1) nodes are available, free or idle run • (2) Cpu’s less than user queue limit “bqueues JL/U” • (3) Cpu’s Less that total queue limit “bqueues MAX” • (4) Highest priority queue (short,par,ser,par_long,idle) • (5) Fair share (user with smallest current usage goes first)

TTU High Performance Computing User Training: Part 2Srirangam Addepalli and David Chaffin, Ph.D. • Unix/Linux Compiling Common Features • [compiler] [options] [source files] [linker options] • (pathscale is only on poseidon) • C compilers: gcc, icc, pathcc • C++:g++, icpc, pathCC • Fortran:g77, ifort, pathf90 • Options:-O [optimize] -o outputfilename • Source files: new.f or *.f or *.c • Linker options: To link with libx.a or libx.so in /home/elvis/lib: • -L/home/elvis/lib –lx • Many programs need: -lm, -pthread

TTU High Performance Computing User Training: Part 2Srirangam Addepalli and David Chaffin, Ph.D. • MPI Compile: Path • . /home/shared/examples/new-bashrc [using bash] • source /home/shared/examples/new-cshrc [using tcsh] • hrothgar:dchaffin:dchaffin $ echo $PATH • /sbin:/bin:/usr/bin:/usr/sbin:/usr/X11R6/bin:\ • /usr/share/bin:/opt/rocks/bin:/opt/rocks/sbin:\ • /opt/lsfhpc/6.2/linux2.6-glibc2.3-x86_64/bin:\ • /opt/intel/fce/9.0/bin:/opt/intel/cce/9.0/bin:\ • /share/apps/mpich/IB-icc-ifort-64/bin:\ • /opt/lsfhpc/6.2/linux2.6-glibc2.3-x86_64/bin • mpich: IB or GE, icc or gcc or pathcc, ifort or g77 or pathf90 • mpicc/mpif77/mpif90/mpiCC must match mpirun!

TTU High Performance Computing User Training: Part 2Srirangam Addepalli and David Chaffin, Ph.D. • MPI Compile/Run • cp /home/shared/examples/mpi-basic.sh . • cp /home/shared/examples/cpi.c . • /opt/mpich/gnu/bin/mpicc cpi.c [or] • /share/apps/mpich/IB-icc-ifort-64/bin/mpicc cpi.c • vi mpi-basic.sh • Ptiles comment out the mpirun that you are not using either IB or default • Could change executable name • bsub < mpi-basic.sh • produces: • job#.out lsf output • job#.pgm.out mpirun output • job#.err lsf stderr • job#.pgm.err mpirun stderr

TTU High Performance Computing User Training: Part 2Srirangam Addepalli and David Chaffin, Ph.D. • Exercise/Homework • Run mpi benchmark on Infiniband, Ethernet, and Shared memory. Compare latency and bandwidth. Research and briefly discuss reasons for the performance: • Hardware bandwidth (look it up) • Software layers (OS, interrupts, MPI, one-sided copy, two-sided copy) • Hardware: • Topspin Infiniband SDR, PCI-X • Xeon Nocona shared memory • Intel Gigabit, on board • Program:/home/shared/examples/mpilc.c or equivalent

TTU High Performance Computing User Training: Part 2 Srirangam Addepalli and David Chaffin, Ph.D.