510 likes | 725 Views
Introduction to Flux: Hands-on Session. Dr. Charles J Antonelli LSAIT Research Systems Group The University of Michigan November 30, 2011. Roadmap. High Performance Computing HPC Resources U-M Flux Architecture Using Flux Gaining Insight. High Performance Computing. Advantages of HPC.
E N D
Introduction to Flux:Hands-on Session Dr. Charles J Antonelli LSAIT Research Systems GroupThe University of Michigan November 30, 2011
Roadmap • High Performance Computing • HPC Resources • U-M Flux Architecture • Using Flux • Gaining Insight
Advantages of HPC • Cheaper than the mainframe • More scalable than your laptop • Buy or rent only what you need • COTS hardware • COTS software • COTS expertise
Disadvantages of HPC • Serial applications • Tightly-coupled applications • Truly massive I/O or memory requirements • Difficulty/impossibility of porting software • No COTS expertise
Programming Models • Two basic parallel programming models • Message-passingThe application consists of several processes running on different nodes and communicating with each other over the network • Used when the data are too large to fit on a single node, and simple synchronization is adequate • “Coarse parallelism” • Implemented using MPI (Message Passing Interface) libraries • Multi-threadedThe application consists of a single process containing several parallel threads that communicate with each other using synchronization primitives • Used when the data can fit into a single process, and the communications overhead of the message-passing model is intolerable • “Fine-grained parallelism” or “shared-memory parallelism” • Implemented using OpenMP (Open Multi-Processing) compilers and libraries • Both
Good parallel • Embarrassingly parallel • Folding@home, RSA Challenges, password cracking, … • Regular structures • Divide&conquer, e.g. Quicksort • Pipelined: N-body problems, matrix multiplication • O(n2) -> O(n)
Less good parallel • Serial algorithms • Those that don’t parallelize easily • Irregular data & communications structures • E.g., surface/subsurface water hydrology modeling • Tightly-coupled algorithms • Unbalanced algorithms • Master/worker algorithms, where the worker load is unbalanced
Amdahl’s Law If you enhance a fraction f of a computation by a speedup S, the overall speedup is:
Some Resources • U-M • Cirrus Project http://orci.research.umich.edu/about-orci/cirrus-project/ • Flux: shared leased computing • ORCI • Meta • XSEDE (was TeraGrid) • Amazon EC2 • Google App Engine • Microsoft Azure • …
The Flux cluster Login nodes Compute nodes … Data transfernode Storage …
The truth Login nodes Compute nodes nyx … Data transfernode flux Storage … shared
The Flux node 48 GB RAM 12 Intel cores Local disk InfiniBand Ethernet
Flux hardware • 2,000 Intel cores (4,000 in January 2012)172 Flux nodes (340 in January 2012) • 48 GB RAM/node4 GB RAM/core (average) • 4X Infiniband network (interconnects all nodes) • 40 Gbps, <2 us latency • Latency an order of magnitude less than Ethernet • Lustre Filesystem • Scalable, high-performance, open • Supports MPI-IO for MPI jobs • Mounted on all login and compute nodes
Flux software • Default Software: • Intel Compilers with OpenMPI for Fortran and C • Optional software: • PGI Compilers • Unix/GNU tools • gcc/g++/gfortran • Licensed software: • Abacus, ANSYS, Mathematica, Matlab, STATA SE, … • See http://cac.engin.umich.edu/resources/software/index.html • You can choose software using the module command
Flux network • All Flux nodes are interconnected via Infiniband and a campus-wide private Ethernet network • The Flux login nodes are also connected to the campus backbone network • The Flux data transfer node will soon be connected over a 10 Gbps connection to the campus backbone network • This means • The Flux login nodes can access the Internet • The Flux compute nodes cannot • If Infiniband is not available for a compute node, code on that node will fall back to Ethernet communications
Flux data • Lustre filesystem mounted on /nobackup on all login, compute, and transfer nodes • 143 TB of short-term storage for batch jobs(375 TB in January 2012) • Large, fast, short-term • NFS filesystems mounted on /home and /home2 on all nodes • 40 GB of storage per user for development & testing • Small, slow, long-term
Flux data • Flux does not provide large, long-term storage • Alternatives: • ITS Value Storage • Departmental server • CAEN can mount your storage on the login nodes • Issue df–kh command on a login node to see what other groups have mounted
Globus Online GridFTP • Features • High-speed data transfer, much faster than SCP or SFTP • Reliable & persistent • Minimal client software: Mac OS X, Linux, Windows • GridFTP Endpoints • Gateways through which data flow • Exist for XSEDE, OSG, … • UMich: umich#flux, umich#nyx • Add your own server endpoint: contact flux-support • Add your own client endpoint! • More information • http://cac.engin.umich.edu/resources/loginnodes/globus.html
Using Flux • Two requirements: • A Flux account • Allows login to the Flux login nodes • Develop, compile, and test code • Available to members of U-M community, free • Get an account by visiting https://www.engin.umich.edu/form/cacaccountapplication • A Flux allocation • Allows you to run jobs on the compute nodes • Current rate is $18 per core-month • Discounted to $11.20 per core-month until July 1, 2012 • To inquire about flux allocations please email flux-support@umich.edu
Flux On-Demand • Alternative to a static allocation • Pay only for the core time you use • Pros • Accommodates “bursty” usage patterns • Cons • Limit of 50 cores total • Limit of 25 cores for any user
Logging in to Flux • ssh flux-login.engin.umich.edu • You will be randomly connected a Flux login node • Currently flux-login1 or flux-login2 • Firewalls restrict access to flux-login.To connect successfully, either • Physically connect your ssh client platform to the U-M campus wired network, or • Use VPN software on your client platform, or • Use ssh to login to an ITS login node, and ssh to flux-login from there
Modules • The module command allows you to specify what versions of software you want to use module list -- Show loaded modulesmodule loadname-- Load module name for usemodule avail -- Show all available modulesmodule avail name -- Show versions of module name*module unload name -- Unload module namemodule -- List all options • Enter these commands at any time during your session • A configuration file allows default module commands to be executed at login • Put module commands in file ~/privatemodules/default
Flux environment • The Flux login nodes have the standard GNU/Linux toolkit: • make, autoconf, awk, sed, perl, python, java, emacs, vi, nano, … • Source code written on non-Linux systems • On Flux, use these tools to convert source files to Linux format • dos2unix • mac2unix
Compiling Code • Assuming default module settings • Use mpicc/mpiCC/mpif90 for MPI code • Use icc/icpc/ifort with -mp for OpenMP code • Serial code, Fortran 90:ifort -O3 -ipo -no-prec-div –xHost -o prog prog.f90 • Serial code, C:icc -O3 -ipo -no-prec-div –xHost –o progprog.c • MPI parallel code:mpicc -O3 -ipo -no-prec-div –xHost -o progprog.cmpirun -np 2 ./prog
Lab 1 • Task: compile and execute simple programs on the Flux login node Copy sample code to your login directory:cd cp~brockp/cac-intro-code.tar.gz. tar -xvzfcac-intro-code.tar.gz cd ./cac-intro-code • Examine, compile & execute helloworld.f90: ifort-O3 -ipo -no-prec-div -xHost -o f90hello helloworld.f90 ./f90hello • Examine, compile & execute helloworld.c: icc-O3 -ipo -no-prec-div -xHost -o chellohelloworld.c ./chello • Examine, compile & execute MPI parallel code: mpicc-O3 -ipo -no-prec-div -xHost -o c_ex01 c_ex01.c … ignore the “feupdateenv is not implemented and will always fail” warning mpirun-np 2 ./c_ex01 … ignore runtime complaints about missing NICs
Makefiles • The make command automates your code compilation process • Uses a makefile to specify dependencies between source and object files • The sample directory contains a sample makefile • To compile c_ex01: make c_ex01 • To compile all programs in the directory make • To remove all compiled programs make clean • To make all the programs using 8 compiles in parallel make -j8
Portable Batch System • All production runs are run on the compute nodes using the Portable Batch System(PBS) • PBS manages all aspects of cluster job execution • Flux uses the Torque implementation of PBS • Flux uses the Moab scheduler for job scheduling • Torque and Moab work together to control access to the compute nodes • PBS puts jobs into queues • Flux has a single queue, named flux
Using the cluster • You create a batch script and submit it to PBS • PBS schedules your job, and it enters the flux queue • When its turn arrives, your job will execute the batch script • Your script has access to any applications or data stored on the Flux cluster • When your job completes, its standard output and error are saved and returned to you • You can check on the status of your job at any time, or delete it if it’s not doing what you want • A short time after your job completes, it disappears
Sample serial script #PBS -N yourjobname #PBS -V #PBS -q flux #PBS -A youralloc_flux #PBS -l qos=youralloc_flux #PBS –l procs=1,walltime=00:05:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below: cd $PBS_O_WORKDIR ./f90hello
Sample batch script #PBS -N yourjobname #PBS -V #PBS -q flux #PBS -A youralloc_flux #PBS -l qos=youralloc_flux #PBS –l procs=16,walltime=00:05:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below: cat $PBS_NODEFILEcd $PBS_O_WORKDIR mpirun ./c_ex01 Lists the node(s) your job ran on Change to submission directory No need to specify -np
Batch job mechanics • Once you have a script, submit it:qsubscriptfile$ qsubsinglenode.pbs6023521.nyx.engin.umich.edu • You can check on the job status:qstat –u username $ qstat -u cja nyx.engin.umich.edu: Req'dReq'dElap Job ID Username Queue JobnameSessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - ----- 6023521.nyx.engi cjaflux hpc101i -- 1 1 -- 00:05 Q -- • To delete your jobqdeljobid$ qdel 6023521$
Lab 2 • Task: Run an MPI job on 8 cores • Compile c_ex05 cd ~/cac-intro-code make c_ex05 • Edit file runwith your favorite Linux editor • Change #PBS –Maddress to your own • I don’t want Brock to get your email! • Change both #PBS –Aand #PBS –lallocation names to FluxTraining_flux, or to your own allocation name, if desired • Submit your job qsubrun
PBS attributes • As always, man qsub is your friend-N : sets the job name, can’t start with a number-V : copy shell environment to compute node-q flux: sets the queue you are submitting to-A youralloc_flux: sets the allocation you are using-l qos=youralloc_flux: mustmatch the allocation-l : requests resources, like number of cores or nodes-M : whom to email, can be multiple addresses-m : when to email: a=job abort, b=job begin, e=job end-joe: join STDOUT and STDERR to a common file-I: allow interactive use-X : allow X GUI use
PBS resources • A resource (-l) can specify: • Request wallclock (that is, running) time-l walltime=HH:MM:SS • Request C MB of memory per core-l pmem=Cmb • Request T MB of memory for entire job-l mem=Tmb • Request M cores on arbitrary node(s)-l procs=M • Request M nodes with N cores per node (only if necessary)-l nodes=M:ppn=N • Request a token to uselicensed software-l gres=stata:1-l gres=matlab-l gres=matlab%Communication_toolbox
Interactive jobs • You can submit jobs interactively: qsub-I -X -V -q flux -l procs=2 -l walltime=15:00 -l qos=youralloc_flux -A youralloc_flux • This queues a job as usual • Your terminal session will be blocked until the job runs • When it runs, you will be connected to one of your nodes • Invoked serial commands will run on that node • Invoked parallel commands (e.g., via mpirun) will run on all of your nodes • When you exit the terminal session your job is deleted • Interactive jobs allow you to • Test your code on cluster node(s) • Execute GUI tools on a cluster node with output on your local platform’s X server • Utilize a parallel debugger interactively
Lab 3 • Task: Run an interactive job • Enter this command (all on one line):qsub-I -X -V -q flux -l procs=2 -l walltime=15:00 -l qos=FluxTraining_flux -A FluxTraining_flux • When your job starts, you’ll get an interactive shell • Copy and paste the batch commands from the “run” file, one at a time, into this shell • Experiment with other commands • After fifteen minutes, your interactive shell will be killed
The Scheduler (1/3) • Flux scheduling policies: • The job’s queue determines the set of nodes you run on • The job’s account and qos determine the allocation to be charged • If you specify an inactive allocation, your job will never run • The job’s resource requirements help determine when the job becomes eligible to run • If you ask for unavailable resources, your job will wait until they become free • There is no pre-emption 44
The Scheduler (2/3) • Flux scheduling policies: • If there is competition for resources among eligible jobs in the allocation or in the cluster, two things help determine when you run: • How long you have waited for the resource • How much of the resource you have used so far • This is called “fairshare” • The scheduler will reserve nodes for a job with sufficient priority • This is intended to prevent starving jobs with large resource requirements 45
The Scheduler (3/3) • Flux scheduling policies: • If there is room for shorter jobs in the gaps of the schedule, the scheduler will fit smaller jobs in those gaps • This is called “backfill” Cores Time 46
Gaining insight • There are several commands you can run to get some insight over the scheduler’s actions: • freenodes : shows the number of free nodes and cores currently available • showq : shows the state of the queue (like qstat -a), except shows running jobs in order of finishing • diagnose -p : shows the factors used in computing job priority • checkjobjobid : Can show why your job might not be starting • showstart –e all: Gives you a coarse estimate of job start time; use the smallest value returned 47
Flux Resources • http://cac.engin.umich.edu/started • Cluster news, RSS feed and outages listed here • http://cac.engin.umich.edu/ • Getting an account, training, basic tutorials • http://www.engin.umich.edu/caen/hpc • Getting an allocation, Flux On-Demand, Flux info • For assistance: flux-support@umich.edu • Read by a team of people • Cannot help with programming questions, but can help with operational Flux and basic usage questions
Summary • The Flux cluster is just a collection of similar Linux machines connected together to run your code, much faster than your laptop can • Unlike laptops, there is limited GUI access, command line encouraged • Some important commands are qsubqstat -u usernameqdeljobid • Developand test, then submit your jobs in bulk and let the scheduler do the dirty work 49
Any Questions? • Charles J. AntonelliLSAIT Research Systems Groupcja@umich.eduhttp://www.umich.edu/~cja734 926 8421