580 likes | 711 Views
An Overview of the Portable Batch System. Gabriel Mateescu National Research Council Canada I M S B gabriel.mateescu@nrc.ca www.sao.nrc.ca/~gabriel/presentations/sgi_pbs. Outline. PBS highlights PBS components Resources managed by PBS Choosing a PBS scheduler
E N D
An Overview of the Portable Batch System Gabriel Mateescu National Research Council Canada I M S B gabriel.mateescu@nrc.ca www.sao.nrc.ca/~gabriel/presentations/sgi_pbs
Outline • PBS highlights • PBS components • Resources managed by PBS • Choosing a PBS scheduler • Installation and configuration of PBS • PBS scripts and commands • Adding preemptive job scheduling to PBS
PBS Highlights • Developed by Veridian / MRJ • Robust, portable, effective, extensible batch job queuing and resource management system • Supports different schedulers • Supports heterogeneous clusters • Open PBS - open source version • PBS Pro - commercial version
Recent Versions of PBS • PBS 2.2, November 1999: • both the FIFO and SGI scheduler have bugs in enforcing resource limits • poor support for stopping & resuming jobs • OpenPBS 2.3, September 2000 • better FIFO scheduler: resource limits enforced, backfilling added • PBS Pro 5.0, September 2000 • claims support for job stopping/resuming, better scheduling, IRIX cpusets
Resources managed by PBS • PBS manages jobs, CPUs, memory, hosts and queues • PBS accepts batch jobs, enqueues them, runs the jobs, and delivers output back to the submitter • Resources - describe attributes of jobs, queues, and hosts • Scheduler - chooses the jobs that fit within queue and cluster resources
Main Components of PBS • Three daemons: • pbs_server server, • pbs_sched scheduler, • pbs_mom job executor & resource monitor • The server accepts commands and communicates with the daemons • qsub - submit a job • qstat - view queue and job status • qalter - change job’s attributes • qdel - delete a job
Batch Queuing Job exclusive scheduling Queue A Queue B SGI Origin System Node (CPUs + memory)
Resource Examples • ncpus number of CPUs per job • mem resident memory per job • pmem per-process memory • vmem virtual memory per job • cput CPU time per job • walltime real time per job • file file size per job
Resource limits • resources_max - per job limit for a resource; determines whether a job fits in a queue • resources_default - default amount of a resource assigned to a job • resources_available - advice to the scheduler on how much of a resource can be used by all running jobs
Choosing a Scheduler (1) • FIFO scheduler: • First-fit placement: enqueues a job in the first queue where it may fit even if it does not currently fit there and there is another queue where it will fit • Supports per job and (in version 2.3) per queue resource limits: ncpus, mem • Supports per server limits on the number of CPUs, and memory, (based on server attribute resources_available)
Choosing a Scheduler (2) • Algorithms in FIFO scheduler • FIFO - sort jobs by job queuing time running the earliest job first • Backfill - relax FIFO rule for parallel jobs, as long as out-of-order jobs do not delay jobs submitted before by the FIFO order • Fair share: sort & schedule jobs based on past usage of the machine by the job owners • Round-robin - pick a job from each queue • By key - sort jobs by a set of keys: shortest_job_first, smallest_memory_first
Choosing a Scheduler (3) • FIFO scheduler supports round robin load balancing as of version 2.3 • FIFO scheduler • allows decoupling the job requirements on the number of CPUs from that on the amount memory • simple first-fit placement can lead to the need that the user specifies an execution queue for the jobs, when the job could fit in more than one queue
Choosing a Scheduler (4) • SGI scheduler • supports FIFO, fair share, backfilling, and attempts to avoid job starvation • supports both per job limits and per queue limits on number of CPUs, memory • per server limit is the number of node cards • makes a best effort in choosing a queue where to run a job. A job not having enough resources to run is kept in the submit queue • ties the number of cpus allocated to the memory allocated per job
Resource allocation • SGI scheduler allocates nodes - node = [ PE_PER_NODE cpus, MB_PER_NODE Mbyte ] • Number of nodes N for a job is such that [ ncpus, mem] <= [ N*PE_PER_NODE, N* MB_PER_NODE ] where ncpus and mem are the job’s memory and cpu job limitsspecified, e.g., with#PBS -l mem • Job attributes Resource_List.{ncpus, mem} set to Resource_List.ncpus = N * PE_PER_NODE Resource_List.mem = N * MB_PER_NODE
Queue and Server Limits • FIFO scheduler: • per job limits (ncpus, mem) are defined by resources_max queue attributes • as of version 2.3, resources_max also defines per queue limits • per server resource limits enforced with resources_available attributes
Queue and Server Limits • SGI scheduler: • per job limits (ncpus, mem) are defined by resources_max queue attributes • resources_max also defines per queue limits • per server limit is given by the number of Origin node cards. Unlike the FIFO scheduler, resource_available limits are not enforced
Job enqueing (1) • The scheduler places each job in some queue • This involves several tests for resources • Which queue a job is enqueued into depends on • what limits are tested • first-fit versus best fit placement • A job can fit in a queue if the resources requested by the job do not exceed the maximum value of the resources defined for the queue. For example, for the resource ncpus Resource_List.ncpus <= resources_max.ncpus
Job enqueing (2) • A job fits in a queue if the amount of resources assigned to the queue plus the requested resources do not exceed the maximum number of resources for the queue. For example, for ncpus resources_assigned.ncpus + Resource_List.ncpus <= resources_max.ncpus • A job fits in the system if the sum of all assigned resources does not exceed the available resources. For example, for the ncpus resource, Σ resources_assigned.ncpus + Resource_List.ncpus <= resources_available.ncpus
First fit versus best fit • The FIFO scheduler finds the first queue where a can fit and dispatches the job to that queue • if the jobs does not actually fit it will wait for the requested resources in the execution queue • The SGI scheduler keeps the job in the submit queue until it finds an execution queue where the job fits then dispatches the job to that queue • If queues are defined to have monotonically increasing resource limits (e.g., CPU time) , then first fit is not a penalty. • However, if a job can fit in several queues, then SGI scheduler will find a better schedule
Limits on the number of running jobs • Per queue and per server limits on the number of running jobs: • max_running • max_user_run, max_group_run max number of running jobs per user or group • Unlike the FIFO scheduler, the SGI scheduler enforces these limits only on a per queue basis • It enforces MAX_JOBS from the scheduler config file - substitute for max_running
SGI Origin Install (1) • Source files under OpenPBS_v2_3/src • Consider the SGI scheduler • Make sure the machine dependent values defines in scheduler.cc/samples/sgi_origin/toolkit.hmatch the actual machine hardware #define MB_PER_NODE ((size_t) 512*1024*1024) #define PE_PER_NODE 2 • May set PE_PER_NODE =1 to allocate half-nodes if MB_PER_NODE is set accordingly
SGI Origin Install (2) • Bug fixes in scheduler.cc/samples/sgi_origin/pack_queues.c • Operator precedence bug (line 198): for ( qptr = qlist; qptr != NULL; qptr = qptr->next) { if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) { // bad operator precedence bypasses this function if ( !schd_evaluate_system(...) ) { // DONT_START_JOB (0) so don’t change allfull continue; } // ... }
SGI Origin Install (3) • Fix of a logical bug in pack_queues.c: if a system limit is exceeded should not try to schedule the job for ( qptr = qlist; qptr != NULL; qptr = qptr->next) { if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) { if ( !schd_evaluate_system(...) ) { // DONT_START_JOB (0) so don’t change allfull continue; } // ... } for (qptr=(allfull)?NULL:qlist; qptr !=NULL; qptr=qptr->next) { // if allfull set, do not attempt to schedule }
SGI Origin Install (4) • Fix of a logical bug in user_limits.c, function user_running() • This function counts number of running jobs so must test for equality between job status and ‘R’ user_running ( ...) { for ( job= queue->jobs; job != NULL; job = job->next) { if ( (job_state == ‘R’) && (!strcmp(job->owner,user) ) ) jobs_running++; // … }
SGI Origin Install (5) • The limit npcus is not enforced in the function mom_over_limit(), located in the file mom_mach.c under the directory src/resmom/irix6array #define SGI_ZOMBIE_WRONG 1 int mom_over_limit( ... ) { // ... #if !defined(SGI_ZOMBIE_WRONG) return (TRUE); #endif // ... }
SGI Origin Install (4) Script to run the configure command ___________________________________________________ #!/bin/csh -f set PBS_HOME=/usr/local/pbs set PBS_SERVER_HOME=/usr/spool/pbs # Select SGI or FIFO scheduler set SCHED="--set-sched-code=sgi_origin --enable-nodemask #set SCHED="--set-sched-code=fifo --enable-nodemask” $HOME/PBS/OpenPBS_v2_3/configure \ --prefix=$PBS_HOME \ --set-server-home=$PBS_SERVER_HOME \ --set-cc=cc --set-cflags="-Dsgi -D_SGI_SOURCE -64 -g" \ --set-sched=cc $SCHED --enable-array --enable-debug
SGI Origin Install (5) ___________________________________________________ # cd /usr/local/pbs # makePBS # make # make install # cd /usr/spool/pbs the script from the previous slide sched_priv config decay_usage
Configuring for SGI scheduler • Queue types • one submit queue • one or several execution queues • Per server limit on the number of running job • Load Control • Fair share scheduling • Past usage of the machine used in ranking the jobs • Decayed past usage per user is kept in sched_priv/decay_usage • Scheduler restart action • PBS manager tool: qmgr
Queue definition • File sched_priv/config SUBMIT_QUEUE submit BATCH_QUEUES hpc,back MAX_JOBS 256 ENFORCE_PRIME_TIME False ENFORCE_DEDICATED_TIME False SORT_BY_PAST_USAGE True DECAY_FACTOR 0.75 SCHED_ACCT_DIR /usr/spool/pbs/server_priv/accounting SCHED_RESTART_ACTION RESUBMIT
Load Control • Load control for SGI scheduler sched_priv/config TARGET_LOAD_PCT 90% TARGET_LOAD_VARIANCE -15%,+10% • Load Control for FIFO scheduler mom_priv/config $max_load 2.0 $ideal_load 1.0
PBS for SGI scheduler • Qmgr tool s server managers=bob@n0.bar.com create queue submit s q submit queue_type = Execution s q submit resources_max.ncpus = 4 s q submit resources_max.ncpus = 1gb s q submit resources_default.mem = 256mb s q submit resources_default.ncpus = 1 s q submit resources_default.nice = 15 s q submit enabled = True s q submit started = True
PBS for SGI scheduler create queue hpc s q hpc queue_type = Execution s q hpc resources_max.ncpus = 2 s q hpc resources_max.ncpus = 512mb s q hpc resources_default.mem = 256mb s q hpc resources_default.ncpus = 1 s q hpc acl_groups = marley s q hpc acl_group_enable = True s q hpc enabled = True s q hpc started = True
PBS for SGI scheduler • Server attributes set server default_queue = submit s server acl_hosts = *.bar.com s server acl_host_enable = True s server scheduling = True s server query_other_jobs = True
PBS for FIFO scheduler • File sched_config instead of config and queues are not defined there • Submit queue is Route queue s q submit queue_type = Route s q submit route_destinations = hpc s q submit route_destinations += back • Server attributes s server resources_available.mem = 1gb s server resources_available.ncpus = 4
PBS Job Scripts • Job scripts contain PBS directives and shell commands #PBS -l ncpus=2 #PBS -l walltime=12:20:00 #PBS -m ae #PBS -c c=30 cd ${PB_O_WORKDIR} mpirun -np 2 foo.x
Basic PBS commands • Jobs are submitted with qsub % qsub [-q hpc] foo.pbs 13.node0.bar.com • Job status is queried with qstat [-f|-a] to get job owner, name, queue, status, session ID, # CPUs, walltime % qstat -a 13 • Alter job attributes % qalter -l walltime 20:00:00 13
Job Submission and Tracking • Find jobs in status R (running) or submitted by user bob % qselect -s R % qselect -u bob • Query queue status to find if the queue is enabled/started, and the number of jobs in the queue qstat [-f | -a ] -Q • Delete a job: qdel 13
Job Environment and I/O • The job’s current directory is the submitter’s $HOME, which is also the default location for the files created by the job. Changed with cd in the script • The standard out and err of the job are spooled to JobName.{o|e}JobID in the submitter’s current directory. Override this with #PBS -o | -e pathname
Tips • Trace the history of a job % tracejob - give a time-stamped sequence of events affecting a job • Cron jobs for cleaning up daemon work files under mom_logs, sched_logs, server_logs • #crontab -e 9 2 * * 0 find /usr/spool/pbs/mom_logs -type f -mtype +7 -exec rm {} \; 9 2 * * 0 find /usr/spool/pbs/sched_logs -type f -mtype +7 -exec rm {} \; 9 2 * * 0 find /usr/spool/pbs/server_logs -type f -mtype +7 -exec rm {} \;
Sample PBS Front-End node1 node0 Execution server Submission server pbs_server, pbs_sched, pbs_mom qsub, qdel, ...
PBS for clusters • File staging - copy files (other than stdout/stderr) from a submission-only host to the server #PBS -W stagein=/tmp/bar@n1:/home/bar/job1 #PBS -W stageout=/tmp/bar/job1/*@n1:/home/bar/job1 PBS uses the directory /tmp/bar/job1 as a scratch directory • File staging may precede job starting - helps in hiding latencies
Setting up a PBS Cluster • Assume n1 runs the pbs_mom daemon • $PBS_SERVER_HOME/server_priv/nodes n0 np=2 gaussian n1 np=2 irix • n0:$PBS_SERVER_HOME/mom_priv/config $clienthost n1 $ideal_load 1.5 $max_load 2.0 • n1:$PBS_SERVER_HOME/mom_priv/config $ideal_load 1.5 $max_load 2.0
Setting up a PBS Cluster • Qmgr tool s server managers=bob@n0.bar.com create queue hpc s q hpc queue_type = Execution s q hpc Priority = 100 s q hpc resources_max.ncpus = 2 s q hpc resources_max.nodect = 1 s q hpc acl_groups = marley s q hpc acl_group_enable = True
Setting up a PBS Cluster • Server attributes set server default_node = n0 set server default_queue = hpc s server acl_hosts = *.bar.com s server acl_host_enable = True s s resources_default.nodect = 1 s s resources_default.nodes = 1 s s resources_default.neednodes = 1 set server max_user_run = 2
PBS features • The job submitter can request a number of nodes with some properties • For example • request a node with the property gaussian: #PBS -l nodes=gaussian • request two nodes with the property irix #PBS -l nodes=2:irix
PBS Security Features • All files used by PBS are owned by root and can be written only by root • Configuration files: sched_priv/config, mom_priv/config are readable only by root • $PBS_HOME/pbs_environment defines $PATH; it is writable only by root • pbs_mom daemon accepts connections from a privileged port on localhost or from a host listed in mom_priv/config • The server accepts commands from selected hosts and users
Why preemptive scheduling? • Resource reservation (CPU, memory) is needed to achieve high job throughput • Static resource reservation may lead to low machine utilization, high job waiting times, and hence slow job turn-around • An approach is needed to achieve both high job throughput and rapid job turn-around
Static Reservation Pitfall (1) Parallel Computer or Cluster Physics Group Biotech Group Partition boundary Node (CPU + memory) Job Requests
Static Reservation Pitfall (2) • Physics Group’s Job 1 is assigned 3 nodes and dispatched • Biotech Group’s Job 2 is also dispatched, while Job 3 cannot execute before Job 2 finishes: there is only 1 node available for the group • However, there are enough resources for Job 3
Proposed Approach (1) • Leverage the features of the Portable Batch System (PBS) • Extend PBS with preemptive job scheduling • All queues but one have reserved resources (CPUs, memory) and hold jobs that cannot be preempted. These are the dedicated queues • Define a queue for jobs that may be preempted: the background queue