630 likes | 742 Views
High Performance Computing Workshop HPC 101. Dr. Charles J Antonelli LSAIT ARS February, 2014. Credits. Contributors: Brock Palen (CAEN HPC) Jeremy Hallum (MSIS) Tony Markel (MSIS) Bennet Fauber (CAEN HPC) Mark Montague (LSAIT ARS ) Nancy Herlocher (LSAIT ARS) LSAIT ARS CAEN HPC.
E N D
High PerformanceComputing WorkshopHPC 101 Dr. Charles J Antonelli LSAIT ARSFebruary, 2014
Credits • Contributors: • Brock Palen (CAEN HPC) • Jeremy Hallum (MSIS) • Tony Markel (MSIS) • BennetFauber (CAEN HPC) • Mark Montague (LSAIT ARS) • Nancy Herlocher (LSAIT ARS) • LSAIT ARS • CAEN HPC cja 2014
Roadmap • High Performance Computing • Flux Architecture • Flux Mechanics • Flux Batch Operations • Introduction to Scheduling cja 2014
High Performance Computing cja 2014
Cluster HPC • Acomputing cluster • a number of computing nodes connected together via special hardware and software that together can solvelarge problems. • A cluster is much less expensive than a single supercomputer(e.g., a mainframe) • Using clusters effectively requires support in scientific software applications(e.g., Matlab's Parallel Toolbox, or R's Snow library), or custom code cja 2014
Programming Models • Two basic parallel programming models • Message-passingThe application consists of several processes running on different nodes and communicating with each other over the network • Used when the data are too large to fit on a single node, and simple synchronization is adequate • “Coarse parallelism” • Implemented using MPI (Message Passing Interface) libraries • Multi-threadedThe application consists of a single process containing several parallel threads that communicate with each other using synchronization primitives • Used when the data can fit into a single process, and the communications overhead of the message-passing model is intolerable • “Fine-grained parallelism” or “shared-memory parallelism” • Implemented using OpenMP (Open Multi-Processing) compilers and libraries • Both cja 2014
Amdahl’s Law cja 2014
Flux Architecture cja 2014
Flux Flux is a university-wideshared computational discovery / high-performance computing service. • Provided by Advanced Research Computing at U-M • Operated by CAEN HPC • Procurement, licensing, billing by U-M ITS • Interdisciplinary since 2010 http://arc.research.umich.edu/resources-services/flux/ cja 2014
The Flux cluster Login nodes Compute nodes Data transfernode Storage … cja 2014
A Flux node 48-64 GB RAM 12-16 Intel cores Local disk Ethernet InfiniBand cja 2014
A Large Memory Flux node 1 TB RAM 32-40 Intel cores Local disk Ethernet InfiniBand cja 2014
Coming soon:A Flux GPU node 64 GB RAM 8 GPUs 16 Intel cores Local disk Each GPU contains 2,688 GPU cores cja 2014
Flux software • Licensed and open software: • Abacus, BLAST, BWA, bowtie, ANSYS, Java, Mason, Mathematica, Matlab, R, RSEM, STATA SE, … • See http://cac.engin.umich.edu/resources • C, C++, Fortran compilers: • Intel (default), PGI, GNU toolchains • You can choose software using the module command cja 2014
Flux network • All Flux nodes are interconnected via Infiniband and a campus-wide private Ethernet network • The Flux login nodes are also connected to the campus backbone network • The Flux data transfer node is connected over a 10 Gbps connection to the campus backbone network • This means • The Flux login nodes can access the Internet • The Flux compute nodes cannot • If Infiniband is not available for a compute node, code on that node will fall back to Ethernet communications cja 2014
Flux data • Lustre filesystem mounted on /scratch on all login, compute, and transfer nodes • 640 TB of short-term storage for batch jobs • Large, fast, short-term • NFS filesystems mounted on /home and /home2 on all nodes • 80 GB of storage per user for development & testing • Small, slow, long-term cja 2014
Flux data • Flux does not provide large, long-term storage • Alternatives: • Value Storage (NFS) • $20.84 / TB / month (replicated, no backups) • $10.42 / TB / month (non-replicated, no backups) • LSA Large Scale Research Storage • 2 TB free to researchers (replicated, no backups) • Faculty members, lecturers, postdocs, GSI/GSRA • Additional storage $30 / TB / year (replicated, no backups) • Departmental server • CAEN can mount your storage on the login nodes cja 2014
Copying data Three ways to copy data to/from Flux • From Linux or Mac OS X, use scp:scplocalfilelogin@flux-xfer.engin.umich.edu:remotefilescplogin@flux-login.engin.umich.edu:remotefilelocalfilescp -r localdirlogin@flux-xfer.engin.umich.edu:remotedir • From Windows, use WinSCP • U-M Blue Dischttp://www.itcs.umich.edu/bluedisc/ • Use Globus Connect cja 2014
Globus Connect • Features • High-speed data transfer, much faster than SCP or SFTP • Reliable & persistent • Minimal client software: Mac OS X, Linux, Windows • GridFTP Endpoints • Gateways through which data flow • Exist for XSEDE, OSG, … • UMich: umich#flux, umich#nyx • Add your own client endpoint! • Add your own server endpoint: contact flux-support@umich.edu • More information • http://cac.engin.umich.edu/resources/login-nodes/globus-gridftp cja 2014
Flux Mechanics cja 2014
Using Flux • Three basic requirements to use Flux: • A Flux account • A Flux allocation • An MToken (or a Software Token) cja 2014
Using Flux • A Flux account • Allows login to the Flux login nodes • Develop, compile, and test code • Available to members of U-M community, free • Get an account by visiting https://www.engin.umich.edu/form/cacaccountapplication cja 2014
Using Flux • A Flux allocation • Allows you to run jobs on the compute nodes • Some units cost-share Flux rates • Regular Flux: $11.72/core/monthLSA, Engineering, Medical School $6.60/month • Large Memory Flux: $23.82/core/monthLSA, Engineering, Medical School $13.30/month • GPU Flux: $107.10/2 CPU cores and 1 GPU/monthLSA, Engineering, Medical School $60/month • Flux Operating Environment: $113.25/node/monthLSA, Engineering, Medical School $63.50/month • Flux pricing at http://arc.research.umich.edu/flux/hardware-services/ • Rackham grants are available for graduate students • Details at http://arc.research.umich.edu/resources-services/flux/flux-pricing/ • To inquire about Flux allocations please email flux-support@umich.edu cja 2014
Using Flux • An MToken (or a Software Token) • Required for access to the login nodes • Improves cluster security by requiring a second means of proving your identity • You can use either an MToken or an application for your mobile device (called a Software Token) for this • Information on obtaining and using these tokens at http://cac.engin.umich.edu/resources/login-nodes/tfa cja 2014
Logging in to Flux • ssh flux-login.engin.umich.edu • MToken (or Software Token) required • You will be randomly connected a Flux login node • Currently flux-login1 or flux-login2 • Firewalls restrict access to flux-login.To connect successfully, either • Physically connect your ssh client platform to the U-M campus wired or MWireless network, or • Use VPN software on your client platform, or • Use ssh to login to an ITS login node (login.itd.umich.edu), and ssh to flux-login from there cja 2014
Modules • The module command allows you to specify what versions of software you want to use module list -- Show loaded modulesmodule loadname-- Load module name for usemodule avail -- Show all available modulesmodule avail name -- Show versions of module name*module unload name -- Unload module namemodule -- List all options • Enter these commands at any time during your session • A configuration file allows default module commands to be executed at login • Put module commands in file ~/privatemodules/default • Don’t put module commands in your .bashrc / .bash_profile cja 2014
Flux environment • The Flux login nodes have the standard GNU/Linux toolkit: • make, autoconf, awk, sed, perl, python, java, emacs, vi, nano, … • Watch out for source code or data files written on non-Linux systems • Use these tools to analyze and convert source files to Linux format • file • dos2unix cja 2014
Lab 1 Task: Invoke R interactively on the login node • module load Rmodule list • R q() • Please run only very small computations on the Flux login nodes, e.g., for testing cja 2014
Lab 2 Task: Run R in batch mode • module load R • Copy sample code to your login directorycd cp~cja/hpc-sample-code.tar.gz. tar -zxvfhpc-sample-code.tar.gz cd ./hpc-sample-code • Examine Rbatch.pbsand Rbatch.R • Edit Rbatch.pbs with your favorite Linux editor • Change #PBS -Memail address to your own cja 2014
Lab 2 Task: Run R in batch mode • Submit your job to FluxqsubRbatch.pbs • Watch the progress of your job qstat -u uniqname where uniqname is your own uniqname • When complete, look at the job’s outputless Rbatch.out • Copy your results to your local workstation (change uniqnameto your own uniqname)scpuniqname@flux-xfer.engin.umich.edu:hpc-sample-code/Rbatch.out Rbatch.out cja 2014
Lab 3 Task: Use the multicore package The multicore package allows you to use multiple cores on the same node • module load Rcd ~/sample-code • Examine Rmulti.pbsand Rmulti.R • Edit Rmulti.pbs with your favorite Linux editor • Change #PBS -Memail address to your own cja 2014
Lab 3 Task: Use the multicore package • Submit your job to FluxqsubRmulti.pbs • Watch the progress of your job qstat -u uniqname where uniqname is your own uniqname • When complete, look at the job’s outputless Rmulti.out • Copy your results to your local workstation (change uniqnameto your own uniqname)scpuniqname@flux-xfer.engin.umich.edu:hpc-sample-code/Rmulti.outRmulti.out cja 2014
Compiling Code • Assuming default module settings • Use mpicc/mpiCC/mpif90 for MPI code • Use icc/icpc/ifort with -mp for OpenMP code • Serial code, Fortran 90:ifort -O3 -ipo -no-prec-div –xHost -o prog prog.f90 • Serial code, C:icc -O3 -ipo -no-prec-div –xHost –o progprog.c • MPI parallel code:mpicc -O3 -ipo -no-prec-div –xHost -o progprog.cmpirun -np 2 ./prog cja 2014
Lab 4 Task: compile and execute simple programs on the Flux login node • Copy sample code to your login directory:cd cp~brockp/cac-intro-code.tar.gz. tar -xvzfcac-intro-code.tar.gz cd ./cac-intro-code • Examine, compile & execute helloworld.f90: ifort-O3 -ipo -no-prec-div -xHost -o f90hello helloworld.f90 ./f90hello • Examine, compile & execute helloworld.c: icc-O3 -ipo -no-prec-div -xHost -o chellohelloworld.c ./chello • Examine, compile & execute MPI parallel code: mpicc-O3 -ipo -no-prec-div -xHost -o c_ex01 c_ex01.c mpirun-np 2 ./c_ex01 cja 2014
Makefiles • The make command automates your code compilation process • Uses a makefile to specify dependencies between source and object files • The sample directory contains a sample makefile • To compile c_ex01: make c_ex01 • To compile all programs in the directory make • To remove all compiled programs make clean • To make all the programs using 8 compiles in parallel make -j8 cja 2014
Flux Batch Operations cja 2014
Portable Batch System • All production runs are run on the compute nodes using the Portable Batch System(PBS) • PBS manages all aspects of cluster job execution except job scheduling • Flux uses the Torque implementation of PBS • Flux uses the Moab scheduler for job scheduling • Torque and Moab work together to control access to the compute nodes • PBS puts jobs into queues • Flux has a single queue, named flux cja 2014
Cluster workflow • You create a batch script and submit it to PBS • PBS schedules your job, and it enters the flux queue • When its turn arrives, your job will execute the batch script • Your script has access to any applications or data stored on the Flux cluster • When your job completes, anything it sent to standard output and error are saved and returned to you • You can check on the status of your job at any time, or delete it if it’s not doing what you want • A short time after your job completes, it disappears cja 2014
Basic batch commands • Once you have a script, submit it:qsubscriptfile$ qsubsinglenode.pbs6023521.nyx.engin.umich.edu • You can check on the job status:qstatjobidqstat -u user $ qstat-u cja nyx.engin.umich.edu: Req'dReq'dElap Job ID Username Queue JobnameSessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 6023521.nyx.engi cjaflux hpc101i -- 1 1 -- 00:05 Q -- • To delete your jobqdeljobid$ qdel 6023521$ cja 2014
Loosely-coupled batch script #PBS -N yourjobname #PBS -V #PBS -A youralloc_flux #PBS -l qos=flux #PBS -q flux #PBS –l procs=12,pmem=1gb,walltime=01:00:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below:cd $PBS_O_WORKDIR mpirun ./c_ex01 cja 2014
Tightly-coupled batch script #PBS -N yourjobname #PBS -V #PBS -A youralloc_flux #PBS -l qos=flux #PBS -q flux #PBS –l nodes=1:ppn=12,mem=47gb,walltime=02:00:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below: cd $PBS_O_WORKDIR matlab -nodisplay -r script cja 2014
Lab 5 • Task: Run an MPI job on 8 cores • Compile c_ex05cd ~/cac-intro-codemake c_ex05 • Edit file runwith your favorite Linux editor • Change #PBS -Maddress to your own • I don’t want Brock to get your email! • Change #PBS -Aallocation to FluxTraining_flux, or to your own allocation, if desired • Change #PBS -lallocation to flux • Submit your jobqsubrun cja 2014
PBS attributes • As always, man qsub is your friend-N : sets the job name, can’t start with a number-V : copy shell environment to compute node-A youralloc_flux: sets the allocation you are using-l qos=flux: sets the quality of service parameter-q flux: sets the queue you are submitting to-l : requests resources, like number of cores or nodes-M : whom to email, can be multiple addresses-m : when to email: a=job abort, b=job begin, e=job end-joe: join STDOUT and STDERR to a common file-I: allow interactive use-X : allow X GUI use cja 2014
PBS resources (1) • A resource (-l) can specify: • Request wallclock (that is, running) time-l walltime=HH:MM:SS • Request C MB of memory per core-l pmem=Cmb • Request T MB of memory for entire job-l mem=Tmb • Request M cores on arbitrary node(s)-l procs=M • Request a token to uselicensed software-l gres=stata:1-l gres=matlab-l gres=matlab%Communication_toolbox cja 2014
PBS resources (2) • A resource (-l) can specify:For multithreaded code: • Request M nodes with at least N cores per node-l nodes=M:ppn=N • Request Mcores with exactlyNcores per node (note the differencevis a visppn syntax and semantics!)-l nodes=M,tpn=N(you’ll only use this for specific algorithms) cja 2014
Interactive jobs • You can submit jobs interactively: qsub -I -X -V -l procs=2 -l walltime=15:00 -A youralloc_flux-l qos=flux –q flux • This queues a job as usual • Your terminal session will be blocked until the job runs • When your job runs, you'll get an interactive shell on one of your nodes • Invoked commands will have access to all of your nodes • When you exit the shell your job is deleted • Interactive jobs allow you to • Develop and test on cluster node(s) • Execute GUI tools on a cluster node • Utilize a parallel debugger interactively cja 2014
Lab 6 • Task: Run an interactive job • Enter this command (all on one line):qsub -I -V -l procs=1 -l walltime=30:00 -A FluxTraining_flux -l qos=flux -q flux • When your job starts, you’ll get an interactive shell • Copy and paste the batch commands from the “run” file, one at a time, into this shell • Experiment with other commands • After thirty minutes, your interactive shell will be killed cja 2014
Lab 7 Task: Run Matlab interactively • module load matlab • Start an interactive PBS sessionqsub -I -V -l procs=2-l walltime=30:00 -A FluxTraining_flux -l qos=flux -q flux • Run Matlab in the interactive PBS sessionmatlab -nodisplay cja 2014
Introduction to Scheduling cja 2014
The Scheduler (1/3) • Flux scheduling policies: • The job’s queue determines the set of nodes you run on • The job’s account and qos determine the allocation to be charged • If you specify an inactive allocation, your job will never run • The job’s resource requirements help determine when the job becomes eligible to run • If you ask for unavailable resources, your job will wait until they become free • There is no pre-emption cja 2014