140 likes | 154 Views
Using the BYU Supercomputers. Resources . Basic Usage. After your account is activated: ssh marylou5.byu.edu You will be logged in to an interactive node Jobs that run on the supercomputer are submitted to the batch queuing system You can develop code on the interactive nodes. Running Jobs.
E N D
Basic Usage • After your account is activated: • ssh marylou5.byu.edu • You will be logged in to an interactive node • Jobs that run on the supercomputer are submitted to the batch queuing system • You can develop code on the interactive nodes
Running Jobs • The process • User creates a shell script that will: • tell the scheduler what is needed • run the user’s job • User submits the shell script to the batch scheduler queue • Machines register with the scheduler offering to run jobs • Scheduler allocates jobs to machines and tracks the jobs • The shell script is run on the first node of the group of nodes assigned to a job • When finished, all stdout and stderr are collected back and given to the user in files
Scheduling Jobs • Basic commands • qsub scheduling_shell_script • qsub –q anynode scheduling_shell_script • qsub –q test scheduling_shell_script • showq [-u username] • qdel jobnumber • checkjob [-v] jobnumber
Job Submission Scripts #!/bin/bash #PBS -l nodes=4:ppn=1,walltime=00:05:00 #PBS -M your_id@byu.edu #PBS -m ae #PBS -N Hello cd hello echo "The root node is `hostname`" echo "Here are all the nodes being used" cat $PBS_NODEFILE echo "From here on is the output from the program" mpirun hello -M email address -m send email on (a=abort, b=begin, e=end) -l define resources -N jobname -l procs=4 (any 4 processors)
Viewing Your Jobs bash-2.05a$ showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME m1015i.1581.0 taskman Running 1 18:39:00 Wed Aug 14 08:06:24 m1015i.1582.0 taskman Running 1 18:39:00 Wed Aug 14 08:06:24 m1015i.1580.0 taskman Running 1 18:39:00 Wed Aug 14 08:06:24 … m1015i.1615.0 taskman Running 1 21:33:42 Wed Aug 14 11:01:06 m1015i.1613.0 taskman Running 1 23:43:05 Wed Aug 14 13:10:29 m1015i.1575.0 dvd Running 4 2:15:10:38 Wed Aug 14 04:38:02 m1015i.1127.0 mdt36 Running 8 2:23:14:21 Wed Aug 7 12:41:45 … m1015i.1567.0 jar65 Running 4 9:04:07:44 Tue Aug 13 17:35:08 m1015i.1569.0 jar65 Running 4 9:08:28:16 Tue Aug 13 21:55:40 m1015i.1547.0 to5 Running 8 9:21:11:49 Wed Aug 14 10:39:13 m1015i.1546.0 to5 Running 8 9:21:11:49 Wed Aug 14 10:39:13 35 Active Jobs 150 of 184 Processors Active (81.52%) 26 of 34 Nodes Active (76.47%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME m1015i.1513.0 jl447 Idle 2 5:00:00:00 Tue Aug 13 07:08:09 m1015i.1572.0 dvd Idle 8 3:00:00:00 Tue Aug 13 10:45:18 … 23 Idle Jobs NON-QUEUED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 58 Active Jobs: 35 Idle Jobs: 23 Non-Queued Jobs: 0
The process -bash-3.2$ qsub hello.pbs 1844186.fslsched.fsl.byu.edu -bash-3.2$ showq -u qos active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING STARTTIME 0 active jobs 0 of 2968 processors in use by local jobs (0.00%) 343 of 371 nodes active (92.45%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 1844186 qos Idle 4 00:05:00 Wed Jan 6 10:27:52 1 eligible job blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 blocked jobs Total job: 1
The process -bash-3.2$ checkjob -v 1844186 job 1844186 (RM job '1844186.fslsched.fsl.byu.edu') AName: Hello State: Idle Creds: user:qos group:qos account:qos class:batch WallTime: 00:00:00 of 00:05:00 SubmitTime: Wed Jan 6 10:27:52 (Time Queued Total: 00:03:08 Eligible: 00:02:47) NodeMatchPolicy: EXACTNODE Total Requested Tasks: 4 Total Requested Nodes: 4 Req[0] TaskCount: 4 Partition: ALL TasksPerNode: 1 NodeCount: 4 UMask: 0000 OutputFile: m5int02:/fslhome/qos/hello/Hello.o1844186 ErrorFile: m5int02:/fslhome/qos/hello/Hello.e1844186 Partition List: ALL,base,SHARED SrcRM: base DstRM: base DstRMJID: 1844186.fslsched.fsl.byu.edu Submit Args: hello.pbs Flags: RESTARTABLE,FSVIOLATION Attr: FSVIOLATION,checkpoint StartPriority: 1644 PE: 4.00 Node Availability for Partition base -------- available for 2 tasks - m5-8-[5]:m5-18-[15] rejected for Class - m5-20-[5-16]:m5f-1-[1-2]:mgpu-1-[1]:m5-21-[1-16]:mgpu-1-[2] rejected for State - m5-1-[1-16]:m5-2-[1-16]:m5-3-[1-16]:m5-4-[1-16]:m5-5-[1-16]:m5-6-[1-16]:m5-7-[1-16]:m5-8-[1-16]:m5-9-[1-16]: m5-10-[1-16]:m5-11-[1-16]:m5-12-[1-16]:m5-13-[1-16]:m5-14-[1-16]:m5-15-[1-16]:m5-16-[1-16]:m5-17-[1-16]:m5-18-[1-16]:m5-19-[1-16]: m5-20-[1-12]:m5q-2-[1-16]:m5q-1-[1-16] NOTE: job cannot run in partition base (insufficient idle nodes available: 2 < 4)
Developing Code • Normal linux code development tools • gcc, g++, gdb, etc. • Intel compiler • icc, ifort • Editing • vi • emacs • edit on your own machine and transfer • Parallel code development • icc –openmp • gcc –fopenmp • mpicc • You will need to run • mpi-selector --list • mpi-selector --set fsl_openmpi_intel-1.3.3 (check the name)
-bash-3.2$ cat Hello.o1843568 The root node is m5-17-7.local Here are all the nodes being used m5-17-7 m5-17-14 m5-5-1 m5-5-5 From here on is the output from the program I am running on m5-17-7.local I am running on m5-5-1.local I am running on m5-17-14.local I am running on m5-5-5.local I am proc 0 of 4 running on m5-17-7.local I am proc 2 of 4 running on m5-5-1.local I am proc 1 of 4 running on m5-17-14.local I am proc 3 of 4 running on m5-5-5.local 14:01:33 up 78 days, 4:54, 0 users, load average: 6.00, 5.97, 5.91 14:01:33 up 78 days, 4:54, 0 users, load average: 6.06, 5.91, 4.98 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT Sending messages Sending messages Receiving messages Receiving messages 14:01:33 up 78 days, 4:55, 0 users, load average: 7.11, 6.45, 6.41 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT 14:01:33 up 78 days, 4:55, 0 users, load average: 4.07, 6.21, 7.51 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT Sending messages Sending messages Receiving messages Receiving messages 2: 0: Hello 2: 1: Hello 2: 3: Hello 1: 0: Hello 1: 2: Hello 1: 3: Hello 0: 1: Hello 0: 2: Hello 3: 0: Hello 3: 1: Hello 3: 2: Hello 0: 3: Hello Output • stderr and stdout from each node are collected into files • Jobname.oJOBNUM • Jobname.eJOBNUM
Policies Per User Policies Max Jobs Running soft limit of 400, hard limit of 525 Max Processors Running soft limit of 440, hard limit of 550 Max Jobs Eligible hard limit of 768 Max Processors Eligible hard limit of 1600 Per Research Group Policies Max Processors Running soft limit of 512, hard limit of 630 Per Job Policies In addition to the other policies, each job is subject to the following limitations: Max Total Running Time Requested No job will be allowed to request more than 16 days of total running time. NOTE: Most high-performance computing facilities limit this to between 24 and 72 hours. Max CPU Time Requested CPU Time is the product of CPU count and total running time requested. Currently, this is the equivalent of 128 processors for 14 days, or 1792 processor-days. For example, a job could use 256 processors for 7 days, or 384 processors for 112 hours.
Backfill Scheduling Job C 10 node system Job D Job B Job A time A B C D
Backfill Scheduling • Requires real time limit to be set • More accurate (shorter) estimate gives more chance to be running earlier • Short jobs can move through system quicker • Uses system better by avoiding waste of cycles during wait