230 likes | 351 Views
Job Submission. Andrew Pangborn & Myles Maxfield. The Grid. <Insert some structural picture of grid>?. The Problem. At one end are computing resources managed by batch queuing systems and other middleware At the other end are end-users and their jobs/applications
E N D
Job Submission Andrew Pangborn & Myles Maxfield Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
The Grid • <Insert some structural picture of grid>? Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
The Problem • At one end are computing resources managed by batch queuing systems and other middleware • At the other end are end-users and their jobs/applications • Need software and protocols for submitting jobs to the computing resources Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
Job Submission • More motivation stuff? Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
Batch Queuing Systems • Submitting a job directly to the batch queuing system • One or more queues • Priorities • Two common architectures • Client/server • Dynamic offloading • User credential (delegation) • Jobs have states (e.g. Pending, Running) Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
Batch Queuing Systems • Important examples: • Portable Batch System • TORQUE • Xgrid • Sun Grid Engine • Load Sharing Facility • Condor Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
Portable Batch System (PBS) • Originally developed for NASA • Client/server architecture • Server: pbs_server • Client: pbs_mom • Works with MPI with built-in shell script variables Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
PBS Example litherum@gras:~$ cat test.sh #!/bin/sh #testpbs echo This is a test echo today is `date` echo This is `hostname` echo The current working directory is `pwd` ls -alF /home uptime Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
PBS Example litherum@gras:~$ qsub test.sh 6.gras.carrion.rit.edu litherum@gras:~$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 6.gras test.sh litherum 00:00:00 C batch litherum@gras:~$ cat test.sh.o6 This is a test today is Sat Jan 17 18:20:20 EST 2009 This is carrion02 The current working directory is /home/litherum total 20 drwxr-xr-x 31 litherum litherum 4096 Jan 17 18:19 litherum/ 18:20:20 up 131 days, 21:20, 0 users, load average: 0.00, 0.00, 0.00 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
Torque • Built on top of PBS • Supports reservations, where you can reserve specific resources for specific times. • Supports partitions, where you can partition a cluster into smaller sub-clusters. Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
Torque litherum@gras:~$ showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 0 Active Jobs 0 of 4 Processors Active (0.00%) 0 of 2 Nodes Active (0.00%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 0 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
Xgrid • Apple • Essentially the same as Condor • GUI! =) • Client/server model http://upload.wikimedia.org/wikipedia/en/6/62/XgridAdminTool.jpg Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
Sun Grid Engine • Open source, like everything new Sun puts out • Supports • Reservations • Job dependencies, • Checkpointing • Multiple scheduling algorithms • Web interface • Professional! Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
Load Sharing Facility • Used by GRAM, which we’ll talk about later Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
Condor • More about this later, but it implements its own scheduler Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
Challenging! • These queuing systems are hard to use • There may be many systems employed in a given grid • Wouldn’t it be nice if all this were unified in a single implementation? Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
Condor • <Multiple slide on condor> - Andrew Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
GRAM • Pluggable! • Can’t make up their mind how to describe jobs • Will submit jobs to: • Condor • LSF • PBS/Torque • ??? • Unified interface, identifier for which cluster/service to use Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
GRAM Example maxfield@tg-login1:~> globusrun-ws -submit -factory https://tg-login.ornl.teragrid.org:84 44/wsrf/services/ManagedJobFactoryService -factory-type PBS -streaming -job-command /bin/ hostname Delegating user credentials...Done. Submitting job...Done. Job ID: uuid:89538014-e4f2-11dd-81df-0010180bb4e6 Termination time: 01/18/2009 23:57 GMT Current job state: Pending Current job state: Active tg-c15 Current job state: CleanUp-Hold Current job state: CleanUp Current job state: Done Destroying job...Done. Cleaning up any delegated credentials...Done. Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
UNICORE • <couple slides on UNICORE> Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
Condor-G • Something about condor-G? • Transition into upperware? Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
Upperware • Talk about motivation for upperware applications Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu
GridShell Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu