281 likes | 354 Views
LoadLeveler vs. NQE/NQS: Clash of The Titans. NERSC User Services Oak Ridge National Lab 6/6/00. NERSC Batch Systems. LoadLeveler - IBM SP NQS/NQE - Cray T3E/J90’s This talk will focus on the MPP systems Using the batch system on the J90’s is similar to the T3E
E N D
LoadLeveler vs. NQE/NQS:Clash of The Titans NERSC User Services Oak Ridge National Lab 6/6/00
NERSC Batch Systems • LoadLeveler - IBM SP • NQS/NQE - Cray T3E/J90’s • This talk will focus on the MPP systems • Using the batch system on the J90’s is similar to the T3E • The IBM batch system: http://hpcf.nersc.gov/running_jobs/ibm/batch.html • The Cray batch system: http://hpcf.nersc.gov/running_jobs/cray/batch.html • Batch differences between IBM and Cray: http://hpcf.nersc.gov/running_jobs/ibm/lldiff.html
About the T3E • 644 application processors (PEs) • 33 command PEs • Additional PEs for OS • NQE/NQS jobs run on application PEs • Interactive jobs (“mpprun” jobs) run on command PEs • Single system image • A single parallel job must run on a contiguous set of PEs • A job will not be scheduled if there are enough idle PEs but they are fragmented throughout the torus
About the SP • 256 compute nodes • 8 login nodes • Additional nodes for file system, network, etc. • Each node has 2 processors that share memory • Each node can have either 1 or 2 MPI tasks • Each node runs full copy of AIX OS • LoadLeveler jobs can run only on the compute nodes • Interactive jobs (“poe” jobs) can run on either compute or login nodes
How To Use a Batch System • Write a batch script • must use keywords specific to the scheduler • default values will be different for each site • Submit your job • commands are specific to scheduler • Monitor your job • commands are specific to scheduler • run limits are specific to site • Check results when complete • Call NERSC consultants when your job disappears :o)
T3E Batch Terminology PE - processor element (a single CPU) Torus - the high-speed connection between PEs. All communication between PEs must go through the torus. Swapping - when a job is stopped by the system to allow a higher priority job run on that PE. The job may stay in memory. Also called “gang-scheduling”. Migrating - when a job is moved to a different set of PEs to better pack the torus Checkpoint - when a job is stopped by the system and an image is saved to be restarted at a later time.
More T3E Batch Terminology Pipe Queue - a queue in the NQE portion of the scheduler. It determines which batch queues the job may be submitted to. The user must specify this on the cqsub command line if anything other than “regular”. Batch Queue - a queue on the NQS portion of the scheduler. The batch queues are served in a first-fit manner. The user should not specify any batch queue on the command line or in their script.
NQS/NQE • Developed by Cray • Very complex set of scheduling parameters • Complicated to understand • Fragile • Powerful and flexible • Allows checkpoint/restart
What NQE Does • Users submit jobs to NQE • NQE assigns it a unique identifier called the taskid and stores it in a database • The status of the job is “NPend” • NQE examines various parameters and decides when to pass the job to the LWS • The LWS then submits the job to an NQS batch queue (see next slide for NQS details) • After job completes NQE stores the job information for about 4 hours
What NQS Does • NQS receives a job from the LWS • The job is placed in a batch queue which is determined by number of requested PEs and time • The status of the job is now “NSubm” • NQS batch queues are served in a first-fit manner • When the job is ready to be scheduled, it is sent to the GRM (global resource manager) • At this point the status of the job is “R03” • The job may be stopped for checkpointing or swapping but still have a “running” status in NQS
NQS/NQE Commands • cqsub - submit your job % cqsub -la regular script_file Task id t7225 inserted into database nqedb. • cqstatl - monitor your NQE job • qstat - monitor your NQS job • cqdel - delete your queued or running job % cqdel t7225
Sample T3E Batch Script #QSUB -s /bin/csh #Specify C Shell for 'set echo' #QSUB -A abc #charge account abc for this job #QSUB -r sample #Job name #QSUB -eo -o batch_log.out #Write error and output to single file. #QSUB -l mpp_t=00:30:00 #Wallclock time #QSUB -l mpp_p=8 #PEs to be used (Required). ja #Turn on Job Accounting mpprun -n 8 ./a.out #Execute on 8 PEs reading data.in ja -s
Monitoring Your Job on the T3E % cqstatl -a | grep jimbob t4417 l441h4 scheduler.main jimbob NQE Database NPend t4605 (1259.mcurie) l513v8 lws.mcurie jimbob nqs@mcurie NSubm t4777 (1082.mcurie) l541l2 monitor.main jimbob NQE Database NComp t4884 (1092.mcurie) l543l1 lws.mcurie jimbob nqs@mcurie NSubm t4885 (1093.mcurie) l545l1 lws.mcurie jimbob nqs@bmcurie Nsubm t4960 l546 scheduler.main jimbob NQE Database NPend % qstat -a | grep jimbob 1259.mcurie l513v8 jimbob pe32@mcurie 2771 26 255 1800 R03 1092.mcurie l543l1 jimbob pe32@mcurie 3416 26 252 1800 R03 1093.mcurie l545l1 jimbob pe32@mcurie 921 28672 1800 Qge
Monitoring Your Job on the T3E (cont’) • Use commands pslist (see next slide) and tstat to check running jobs • Using ps on a command PE will list all instances of a parallel job because the T3E has a single system image % mpprun -n 4 ./a.out % ps -u jimbob PID TTY TIME CMD 7523 ? 0:01 csh 7568 ? 12:13 a.out 16991 ? 12:13 a.out 16992 ? 12:13 a.out 16993 ? 12:13 a.out
Monitoring Your Job on the T3E (cont’) S USER RK APID JID PE_RANG NPE TTY TIME CMD STATUS - -------- -- -------- ------ ------- --- -------- -------- ------------ ------- a user1 0 29451 29786 000-015 16 ? 02:50:32 sander b buffysum 0 29567 29787 016-031 16 ? 02:57:45 osiris.e <snip> ACTIVE PEs = 631 q buffysum 1 18268 29715 146-161 16 ? 00:42:28 osiris.e Swapped 1 of 16 r miyoung 1 77041 28668 172-235 64 ? 03:52:11 vasp s bufyysum 1 53202 30069 236-275 40 ? 00:18:16 osiris.e Swapped 1 of 40 t willow 1 51069 27914 276-325 50 ? 00:53:03 MicroMag. u hal 1 77007 30569 326-357 32 ? 00:26:09 alknemd ACTIVE PEs = 266 BATCH = 770 INTERACTIVE = 12 WAIT QUEUE: user uid gid acid Label Size ApId Command Reason Flags giles 13668 2607 2607 - 64 55171 xlatqcdp Ap. limit a---- bobg 14721 2751 2751 - 54 68936 Cmdft Ap. limit a---- jimbo 15761 3009 3009 - 32 77407 pop.8x4 Ap. limit af---
Possible Job States on the T3E ST Job State Description R03 Running The job is currently running. NSubm Submitted The job has been submitted to the NQS scheduler and is being considered to run. NPend Pending The job is still residing in the NQE database and is not being considered to run. This is probably because you already have 3 jobs in the queue. NComp Completed The job has completed. NTerm Terminated The job was terminated, probably due to an error in the batch script.
Current Queue Limits on the T3E Pipe Q Batch Q MAX PE Time debug debug_small 32 33 min debug_medium128 10 min production pe16 16 4 hr pe32 32 4 hr pe64 64 4 hr pe128 128 4 hr pe256 256 4 hr pe512 512 4 hr long long128 128 12 hr long 256 256 12 hr
Queue Configuration on the T3E Time (PDT) Action 7:00 am long256 stopped pe256 stopped 10:00 pm pe512 started long 128, pe128 stopped and checkpointed pe64, pe32, pe16 run as backfill 1:00 am pe512 stopped and checkpointed long256, pe256, long 128, pe128 started
LoadLeveler • Product of IBM • Conceptually very simple • Few commands and options available • Packs system well with backfilling algorithm • Allows MIMD jobs • Does not have checkpoint/restart to favor certain jobs
SP/LoadLeveler Terminology Keyword - used to specify your job parameters (e.g. number of nodes and wallclock time) to the LoadLeveler scheduler Node - a set of 2 processors that share memory and a switch adapter. NERSC users are charged for exclusive use of a node. Job ID - the identifier for a LoadLeveler job, e.g. gs01013.1234.0. Switch - a high-speed connection between the nodes. All communication between nodes goes through the switch. Class - a user submits a batch job to a particular class. Each class has a different priority and different limits.
What LoadLeveler Does • Jobs are submitted directly to LoadLeveler • The following keywords are set: • node_usage = not_shared • tasks_per_node = 2 • The user can override tasks_per_node but not node_usage • Incorrect keywords and parameters are passed silently to scheduler! • NERSC only checks for valid repo and class names • Prolog script creates $SCRATCH and $TMPDIR directories and environment variables • $SCRATCH is a global (GPFS) filesystem and $TMPDIR is local
LoadLeveler Commands • llsubmit - submit your job % llsubmit script_file llsubmit: The job "gs01007.nersc.gov.101" has been submitted. • llqs - monitor your job • llq - get details about one of your queued or running jobs • llcancel - delete your queued or running job % llcancel gs01005.84.0 llcancel: Cancel command has been sent to the central manager.
Sample SP Batch Script #!/usr/bin/csh #@ job_name = myjob #@ account_no = repo_name #@ output = myjob.out #@ error = myjob.err #@ job_type = parallel #@ environment = COPY_ALL #@ notification = complete #@ network.MPI = css0,not_shared,us #@ node_usage = not_shared #@ class = regular #@ tasks_per_node = 2 #@ node = 32 #@ wall_clock_limit= 01:00:00 #@ queue ./a.out < input
Monitoring Your Job on the SP gseaborg% llqs Step Id JobName UserName Class ST NDS WallClck Submit Time ---------------- --------------- -------- ------- -- --- -------- ----------- gs01007.1087.0 a240 buffy regular R 32 00:31:44 3/13 04:30 gs01001.529.0 s1.x willow regular R 64 00:28:17 3/12 21:45 gs01001.578.0 xdnull xander debug R 5 00:05:19 3/14 12:44 gs01009.929.0 gs01009.nersc.g spike regular R 128 03:57:27 3/13 05:17 gs01001.530.0 s2.x willow regular I 64 04:00:00 3/12 21:48 gs01001.532.0 s3.x willow regular I 64 04:00:00 3/12 21:50 gs01001.533.0 y1.x willow regular I 64 04:00:00 3/12 22:17 gs01001.534.0 y2.x willow regular I 64 04:00:00 3/12 22:17 gs01001.535.0 y3.x willow regular I 64 04:00:00 3/12 22:17 gs01001.537.0 gs01001.nersc.g spike regular I 128 02:30:00 3/13 06:10 gs01009.930.0 gs01009.nersc.g spike regular I 128 02:30:00 3/13 07:17
Monitoring Your Job on the SP (cont’) • Issuing a ps command will show only what is running on that login node, not any instances of your parallel job • If you could issue a ps command on a compute node running 2 MPI tasks of your parallel job, you would see: gseaborg% ps -u jimbob UID PID TTY TIME CMD 14397 9444 - 58:37 a.out 14397 10878 - 0:00 pmdv2 14397 11452 0:00 <defunct> 14397 15634 - 0:00 LoadL_starter 14397 16828 - 58:28 a.out 14397 19696 - 0:00 pmdv2 14397 19772 - 0:02 poe 14397 20878 0:00 <defunct>
Possible Job States on the SP ST Job State Description R Running The job is currently running. I Idle The job is being considered to run. NQ Not Queued The job is not being considered to run. This is probably because you have submitted more than 10 jobs. ST Starting The job is starting to run. HU User Hold The user put the job on hold. You must issue the llhold -r command in order for it to be considered for scheduling. HS System Hold The job was put on hold by the system. This is probably because you are over disk quota in $HOME.
Current Class Limits on the SP CLASS NODE TIME PRIORITY debug 16 30 min 20000 premium 256 4 hr 10000 regular 256 4 hr 5000 low 256 4 hr 1 interactive 8 20 min 15000 Same configuration runs all the time.
More Information • Please see NERSC Web documentation • The IBM batch system: http://hpcf.nersc.gov/running_jobs/ibm/batch.html • The Cray batch system: http://hpcf.nersc.gov/running_jobs/cray/batch.html • Batch differences between IBM and Cray: http://hpcf.nersc.gov/running_jobs/ibm/lldiff.html