380 likes | 499 Views
Flux for PBS Users HPC 105. Dr. Charles J Antonelli LSAIT ARS August, 2013. Flux. Flux is a university - wide shared computational discovery / high -performance computing service. Interdisciplinary Provided by Advanced Research Computing at U-M (ARC) Operated by CAEN HPC
E N D
Flux for PBS UsersHPC 105 Dr. Charles J Antonelli LSAIT ARSAugust, 2013
Flux • Flux is a university-wideshared computational discovery / high-performance computing service. • Interdisciplinary • Provided by Advanced Research Computing at U-M (ARC) • Operated by CAEN HPC • Hardware procurement, software licensing, billing support by U-M ITS • Used across campus • Collaborative since 2010 • Advanced Research Computing at U-M (ARC) • College of Engineering’s IT Group (CAEN) • Information and Technology Services • Medical School • College of Literature, Science, and the Arts • School of Information http://arc.research.umich.edu/resources-services/flux/ cja 2013
The Flux cluster Login nodes Compute nodes Data transfernode Storage … cja 2013
Flux node 48 GB RAM 12 Intel cores Local disk Ethernet InfiniBand cja 2013
Flux Large Memory node 1 TB RAM 40 Intel cores Local disk Ethernet InfiniBand cja 2013
Flux hardware • 8,016 Intel cores 200 Intel Large Memory cores632 Flux nodes 5 Flux Large Memory nodes • 48/64 GB RAM/node 1 TB RAM/ Large Memory node4 GB RAM/core (allocated) 25 GB RAM/Marge Memory core • 4X Infiniband network (interconnects all nodes) • 40 Gbps, <2 us latency • Latency an order of magnitude less than Ethernet • Lustre Filesystem • Scalable, high-performance, open • Supports MPI-IO for MPI jobs • Mounted on all login and compute nodes ES13
Flux software • Licensed software • http://cac.engin.umich.edu/resources/software/flux-softwareet al • Compilers & Libraries: • Intel , PGI, GNU • OpenMP • OpenMPI cja 2013
Using Flux • Three basic requirements to use Flux: • A Flux account • An MToken (or a Software Token) • A Flux allocation cja 2013
Using Flux • A Flux account • Allows login to the Flux login nodes • Develop, compile, and test code • Available to members of U-M community, free • Get an account by visiting https://www.engin.umich.edu/form/cacaccountapplication cja 2013
Flux Account Policies To qualify for a Flux account: • You must have an active institutional role • On the Ann Arbor campus • Not a Retiree or Alumni role • Your uniqname must have a strong identity type • Not a friend account • You must be able to receive email sent to uniqname@umich.edu • You must have run a job in the last 13 months http://cac.engin.umich.edu/resources/systems/user-accounts cja 2013
Using Flux • An MToken (or a Software Token) • Required for access to the login nodes • Improves cluster security by requiring a second means of proving your identity • You can use either an MToken or an application for your mobile device (called a Software Token) for this • Information on obtaining and using these tokens at http://cac.engin.umich.edu/resources/login-nodes/tfa cja 2013
Using Flux • A Flux allocation • Allows you to run jobs on the compute nodes • Current rates: (through June 30, 2016) • $18 per core-month for Standard Flux • $24.35 per core-month for Large Memory Flux • $8 cost-share per core-month for LSA, Engineering, and Medical School • Details at http://arc.research.umich.edu/resources-services/flux/flux-pricing/ • To inquire about Flux allocations please email flux-support@umich.edu cja 2013
Flux Allocations • To request an allocation send email to flux-support@umich.eduwith • the type of allocation desired • Regular or Large-Memory • the number of cores needed • the start date and number of months for the allocation • the shortcode for the funding source • the list of people who should have access to the allocation • the list of people who can change the user list and augment or end the allocations http://arc.research.umich.edu/resources-services/flux/managing-a-flux-project/ cja 2013
Flux Allocations • An allocation specifies resources that are consumed by running jobs • Explicit core count • Implicit memory usage (4 or 25 GB per core) • When any resource fully in use, new jobs are blocked • An allocation may be ended early • On the monthly anniversary • You may have multiple active allocations • Jobs draw resources from all active allocations cja 2013
lsa_flux Allocation • LSA funds a shared allocation named lsa_flux • Usable by anyone in the College • 60 cores • For testing, experimentation, exploration • Not for production runs • Each user limited to 30 concurrent jobs https://sites.google.com/a/umich.edu/flux-support/support-for-users/lsa_flux cja 2013
Monitoring Allocations • Visit https://mreports.umich.edu/mreports/pages/Flux.aspx • Select your allocation from the list at upper left • You’ll see all allocations you can submit jobs against • Four sets of outputs • Allocation details (start & end date, cores, shortcode) • Financial overview (cores allocated vs. used, by month) • Usage summary table (core-months by user and month • Drill down for individual job run data • Usage charts (by user) • Details & screenshots:http://arc.research.umich.edu/resources-services/flux/check-my-flux-allocation/ cja 2013
Storing data on Flux • Lustre filesystem mounted on /scratch on all login, compute, and transfer nodes • 640 TB of short-term storage for batch jobs • Pathname depends on your allocation and uniqname • e.g., /scratch/lsa_flux/cja • Can share through UNIX groups • Large, fast, short-term • Data deleted 60 days after allocation expires • http://cac.engin.umich.edu/resources/storage/flux-high-performance-storage-scratch • NFS filesystems mounted on /home and /home2 on all nodes • 80 GB of storage per user for development & testing • Small, slow, long-term cja 2013
Storing data on Flux • Flux does not provide large, long-term storage • Alternatives: • LSA Research Storage • ITS Value Storage • Departmental server • CAEN HPC can mount your storage on the login nodes • Issue df-kh command on a login node to see what other groups have mounted cja 2013
Storing data on Flux LSA Research Storage • 2 TB of secure, replicated data storage • Available to each LSA faculty member at no cost • Additional storage available at $30/TB/yr • Turn in existing storage hardware for additional storage • Request by visitinghttps://sharepoint.lsait.lsa.umich.edu/Lists/Research%20Storage%20Space/NewForm.aspx?RootFolder= • Authenticate with Kerberos login and password • Select NFS as the method for connecting to your storage cja 2013
Copying data to Flux • Using the transfer host:rsync-avz/your/cluster1/directory flux-xfer.engin.umich.edu:newdirnamersync-avz/your/cluster1/directoryflux-xfer.engin.umich.edu:/scratch/youralloc/youruniqname • Or use scp, sftp, WinSCP, Cyberduck, FileZilla http://cac.engin.umich.edu/resources/login-nodes/transfer-hosts cja 2013
Globus Online • Features • High-speed data transfer, much faster than SCP or SFTP • Reliable & persistent • Minimal client software: Mac OS X, Linux, Windows • GridFTP Endpoints • Gateways through which data flow • Exist for XSEDE, OSG, … • UMich: umich#flux, umich#nyx • Add your own server endpoint: contact flux-support@umich.edu • Add your own client endpoint! • More information • http://cac.engin.umich.edu/resources/login-nodes/globus-gridftp cja 2013
Connecting to Flux • ssh flux-login.engin.umich.edu • Login with token code, uniqname, and Kerberos password • You will be randomly connected a Flux login node • Currently flux-login1 or flux-login2 • Do not run compute- or I/O-intensive jobs here • Processes killed automatically after 30 minutes • Firewalls restrict access to flux-login.To connect successfully, either • Physically connect your ssh client platform to the U-M campus wired or MWireless network, or • Use VPN software on your client platform, or • Use ssh to login to an ITS login node (login.itd.umich.edu), and ssh to flux-login from there cja 2013
Lab 1 Task: Use the multicore package The multicore package allows you to use multiple cores on the same node • module load R • Copy sample code to your login directorycd cp ~cja/hpc-sample-code.tar.gz . tar -zxvfhpc-sample-code.tar.gz cd ./hpc-sample-code • Examine Rmulti.pbsand Rmulti.R • Edit Rmulti.pbs with your favorite Linux editor • Change #PBS -Memail address to your own cja 2013
Lab 1 Task: Use the multicore package • Submit your job to FluxqsubRmulti.pbs • Watch the progress of your job qstat -u uniqname where uniqname is your own uniqname • When complete, look at the job’s outputless Rmulti.out cja 2013
Lab 2 • Task: Run an MPI job on 8 cores • Compile c_ex05cd ~/cac-intro-codemake c_ex05 • Edit file runwith your favorite Linux editor • Change #PBS -Maddress to your own • I don’t want Brock to get your email! • Change #PBS -Aallocation to FluxTraining_flux, or to your own allocation, if desired • Change #PBS -lallocation to flux • Submit your jobqsubrun cja 2013
PBS resources (1) • A resource (-l) can specify: • Request wallclock (that is, running) time-l walltime=HH:MM:SS • Request C MB of memory per core-l pmem=Cmb • Request T MB of memory for entire job-l mem=Tmb • Request M cores on arbitrary node(s)-l procs=M • Request a token to uselicensed software-l gres=stata:1-l gres=matlab-l gres=matlab%Communication_toolbox cja 2013
PBS resources (2) • A resource (-l) can specify:For multithreaded code: • Request M nodes with at least N cores per node-l nodes=M:ppn=N • Request Mcores with exactlyNcores per node (note the differencevis a visppn syntax and semantics!)-l nodes=M,tpn=N(you’ll only use this for specific algorithms) cja 2013
Interactive jobs • You can submit jobs interactively: qsub-I -V -l procs=2 -l walltime=15:00 -A youralloc_flux-l qos=flux –q flux • This queues a job as usual • Your terminal session will be blocked until the job runs • When it runs, you will be connected to one of your nodes • Invoked serial commands will run on that node • Invoked parallel commands (e.g., via mpirun) will run on all of your nodes • When you exit the terminal session your job is deleted • Interactive jobs allow you to • Test your code on cluster node(s) • Execute GUI tools on a cluster node with output on your local platform’s X server • Utilize a parallel debugger interactively cja 2013
Lab 3 Task: compile and execute an MPI program on a compute node • Copy sample code to your login directory:cd cp~brockp/cac-intro-code.tar.gz. tar -xvzfcac-intro-code.tar.gz cd ./cac-intro-code • Start an interactive PBS sessionqsub -I -V -l procs=2 -l walltime=30:00 -A FluxTraining_flux -l qos=flux -q flux • On the compute node, compile & execute MPI parallel code: cd $PBS_O_WORKDIRmpicc-O3 -ipo -no-prec-div -xHost -o c_ex01 c_ex01.c mpirun-np 2 ./c_ex01 cja 2013
Lab 4 Task: Run Matlab interactively • module load matlab • Start an interactive PBS sessionqsub -I -V -l procs=2 -l walltime=30:00 -A FluxTraining_flux -l qos=flux -q flux • Run Matlab in the interactive PBS sessionmatlab -nodisplay cja 2013
The Scheduler (1/3) • Flux scheduling policies: • The job’s queue determines the set of nodes you run on • flux, fluxm • The job’s account determines the allocation to be charged • If you specify an inactive allocation, your job will never run • The job’s resource requirements help determine when the job becomes eligible to run • If you ask for unavailable resources, your job will wait until they become free • There is no pre-emption cja 2013
The Scheduler (2/3) • Flux scheduling policies: • If there is competition for resources among eligible jobs in the allocation or in the cluster, two things help determine when you run: • How long you have waited for the resource • How much of the resource you have used so far • This is called “fairshare” • The scheduler will reserve nodes for a job with sufficient priority • This is intended to prevent starving jobs with large resource requirements cja 2013
The Scheduler (3/3) • Flux scheduling policies: • If there is room for shorter jobs in the gaps of the schedule, the scheduler will fit smaller jobs in those gaps • This is called “backfill” Cores Time cja 2013
Job monitoring • There are several commands you can run to get some insight over your jobs’ execution: • freenodes : shows the number of free nodes and cores currently available • mdiag-a youralloc_name: shows resources defined for your allocation and who can run against it • showq-w acct=yourallocname: shows jobs using your allocation (running/idle/blocked) • checkjobjobid : Can show why your job might not be starting • showstart -e all jobid: Gives you a coarse estimate of job start time; use the smallest value returned cja 2013
Job Arrays • Submit copies of identical jobs • Invoked via qsub –t: qsub –t array-spec pbsbatch.txt Where array-spec can be m-n a,b,c m-n%slotlimit e.g. qsub –t 1-50%10 Fifty jobs, numbered 1 through 50, only ten can run simultaneously • $PBS_ARRAYID records array identifier cja 2013 35
Dependent scheduling • Submit jobs whose execution scheduling depends on other jobs • Invoked via qsub –W: qsub -W depend=type:jobid[:jobid]… Where depend can be after Schedule after jobids have started afterok Schedule after jobids have finished, only if no errors afternotok Schedule after jobids have finished, only if errors afterany Schedule after jobids have finished, regardless of status Inverted semantics for before,beforeok,beforenotok,beforeany cja 2013 36
Some Flux Resources • http://arc.research.umich.edu/resources-services/flux/ • U-M Advanced Research Computing Flux pages • http://cac.engin.umich.edu/ • CAEN HPC Flux pages • http://www.youtube.com/user/UMCoECAC • CAEN HPC YouTube channel • For assistance: flux-support@umich.edu • Read by a team of people including unit support staff • Cannot help with programming questions, but can help with operational Flux and basic usage questions cja 2013
Any Questions? • Charles J. AntonelliLSAIT Advocacy and Research Supportcja@umich.eduhttp://www.umich.edu/~cja734 763 0607 cja 2013