400 likes | 658 Views
OGF 19 Condor Software Forum Condor-G. What Is It?. Condor-G is a specialization of Condor. It is also known as the “grid universe”. Condor-G speaks many different job management protocols. Condor-G benefits from all the wonderful Condor features, like a real job queue. Grid Fault-Tolerance.
E N D
What Is It? • Condor-G is a specialization of Condor. It is also known as the “grid universe”. • Condor-G speaks many different job management protocols. • Condor-G benefits from all the wonderful Condor features, like a real job queue.
Grid Fault-Tolerance • Condor-G does whatever it takes to run your jobs, even if … • Your local machine machine crashes • The grid service is temporarily unavailable • The network goes down
Remote Resource Access: Globus Globus JobManager Globus GRAM Protocol “globusrun myjob …” fork() Organization A Organization B
Globus Globus JobManager Globus GRAM Protocol “globusrun myjob …” fork() Organization A Organization B
Globus + Condor Globus JobManager Globus GRAM Protocol “globusrun myjob …” Submit to Condor Condor Pool Organization A Organization B
Globus + Condor Globus JobManager Globus GRAM Protocol “globusrun …” Submit to Condor Condor Pool Organization A Organization B
Condor-G + Globus + Condor Globus JobManager Condor-G Globus GRAM Protocol myjob1 myjob2 myjob3 myjob4 myjob5 … Submit to Condor Condor Pool Organization A Organization B
Condor-G Fault-Tolerance:Lost Contact with Remote Jobmanager Can we contact gatekeeper? Yes - jobmanager crashed No – retry until we can talk to gatekeeper again… Can we reconnect to jobmanager? No – machine crashed or job completed Yes – network was down Restart jobmanager Has job completed? No – is job still running? Yes – update queue
Just to be fair… • The gatekeeper doesn’t have to submit to a Condor pool. • It could be PBS, LSF, Sun Grid Engine… • Condor-G will work fine whatever the remote batch system is.
Other Grid Protocols Works with WS-GRAM, NorduGrid, Unicore Credential Management Pull refreshed credentials from MyProxy Push refreshed credentials to remote systems Job Scheduling Use Matchmaking to select resources for jobs GlideIn Allows late binding of resources and job checkpoint/migration Other Condor-G Features
GT2 [.1|2|4] GT4 Condor PBS/LSF NorduGrid Unicore HTTPS WSRF Condor-G Job Description (Job ClassAd) Condor-G
Pre-WS GRAM • Submit filegrid_resource = gt2 \ foo.edu/jobmanager-pbsglobus_rsl = (queue=long)\ (condor_submit=(universe java))
OGSA GRAM • Submit filegrid_resource = gt3 http://foo.edu/\ ogsa/services/base/gram/\ PBSManagedJobFactoryServiceglobus_rsl = (queue=long)\ (condor_submit=(universe java)) • Museum mode
WS GRAM • Submit filegrid_resource = gt4 foo.edu PBSglobus_xml = <queue>long</queue>
NorduGrid • Submit filegrid_resource = nordugrid foo.edunordugrid_rsl = (queue=long)
Unicore • Submit filegrid_resource = unicore usite.org vsitekeystore_file = keystorekeystore_passphrase_file = keystore.pwkeystore_alias = my cert
Condor • Submit filegrid_resource = condor schedd.foo.edu \ cm.foo.eduremote_universe = java
PBS • Submit filegrid_resource = pbs
LSF • Submit filegrid_resource = lsf
Grid Universe Fault-Tolerance: Credential Management • Authentication in many grid protocols is done with limited-lifetime X509 proxies • Proxy may expire before jobs finish executing • Condor can put jobs on hold and email user to refresh proxy • Condor can automatically retrieve new proxies from MyProxy • When the proxy is refreshed, Condor forwards it to the jobs
MyProxy • Submit fileMyProxyHost = foo.edu:12345MyProxyServerDN = /DC=org/DC=doegrids…MyProxyCredentialName = proxy_fileMyProxyRefreshThreshold = 240 #minsMyProxyNewProxyLifetime = 12 #hrsMyProxyPassword = password • Or give password on command linecondor_submit -p password submit.desc
Condor-G Matchmaking • Use Condor-G matchmaking with grid universe jobs • Allows Condor-G to dynamically assign computing jobs to grid sites • An example of lazy planning
Condor-G Matchmaking, cont. • Normally a grid universe job must specify the site in the submit description file via the “grid_resource” attribute like so: Executable = foo Universe = grid Grid_Resource = gt2 \ beak.cs.wisc.edu/jobmanager-pbs queue
Condor-G Matchmaking, cont. • With matchmaking, grid universe jobs can use requirements and rank: Executable = foo Universe = grid Grid_Resource = $$(ResourceName) Requirements = arch == LINUX Rank = NumberOfNodes * random() Queue • The $$(x) syntax inserts information from the target ClassAd when a match is made.
Condor-G Matchmaking, cont. • Where do these target ClassAds representing Globus gatekeepers come from? Several options: • Simple script on gatekeeper publishes an ad via condor_advertise command-line utility (method used by D0 JIM, USCMS) • Program to query Globus MDS and convert information into ClassAd (method used by EDG) • Run HawkEye with appropriate plugins on the gatekeeper • For explanation of Condor-G matchmaking setup for USCMS, seehttp://www.cs.wisc.edu/condor/USCMS_matchmaking.html
Condor-G Matchmaking: Creating the Resource Ad • Machine AdMyType = “Machine”TargetType = “Job”Name = “foo.edu”Machine = “foo.edu”ResourceName = “gt4 foo.edu PBS”UpdateSequenceNumber = 4Requirements = TARGET.JobUniverse == 9 && \ CurMatches < 10CurMatches = 0NumberOfNodes = 300Rank = 0.0CurrentRank = 0.0WantAdRevaluate = True
Condor-G Matchmaking: Creating the Resource Ad • Advertising a resourcecondor_advertise UPDATE_STARTD_AD \ ad-file • Call periodically • Use unix time for UpdateSequenceNumber
But Wait, There’s More… • What if you want to run standard universe jobs on grid resources • For matchmaking and dynamic scheduling of jobs • For job checkpointing and migration • For remote system calls • What if you don’t want to send a job to a site until the moment the job will start running (late binding)
One Solution: Condor-G GlideIn • You can use the Grid Universe to run Condor daemons on grid resources • When the resources run these GlideIn jobs, they will temporarily join your Condor Pool • You can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the grid resources
personal Condor Globus Grid your workstation 600 Condor jobs LSF PBS glide-in jobs Condor Condor Pool Friendly Condor Pool
GlideIn Concerns • What if a grid resource kills my GlideIn job? • That resource will disappear from your pool and your jobs will be rescheduled on other machines • Standard universe jobs will resume from their last checkpoint like usual • What if all my jobs are completed before a GlideIn job runs? • If a GlideIn Condor daemon is not matched with a job in 10 minutes, it terminates, freeing the resource
matchmaker condor_submit schedd (Job caretaker) Startd (Runs job) Condor
condor_submit schedd (Job caretaker) Globus gatekeeper gahp gridmanager PBS or LSF Condor-G
matchmaker condor_submit schedd (Job caretaker) schedd startd condor-gahp gridmanager Condor-C
schedd condor_submit schedd (Job caretaker) gridmanager condor-gahp gridmanager pbs/lsf-gahp PBS or LSF Condor-C to non-Condor
schedd Globus gatekeeper condor_submit schedd (Job caretaker) gridmanager gahp gridmanager pbs/lsf-gahp condor-gahp PBS or LSF Gliding in Condor-C 1. Glide-in 2. Submit jobs
Matchmaking with Condor-C • In all of these examples, Condor-C went to a specific remote schedd • This is not required: you can do matchmaking
schedd condor_submit schedd (Job caretaker) matchmaker … submit job schedd condor-gahp gridmanager Matchmaking with Condor-C