480 likes | 495 Views
What’s New in Condor-G. Outline. What is Condor-G Released New Features In Development. What Is Condor-G. Use Condor to run jobs on the Grid Uses Globus Toolkit GRAM (submit a remote job) GASS (transfer job’s files) Two components Globus Universe GlideIn. Globus Universe.
E N D
Outline • What is Condor-G • Released New Features • In Development
What Is Condor-G • Use Condor to run jobs on the Grid • Uses Globus Toolkit • GRAM (submit a remote job) • GASS (transfer job’s files) • Two components • Globus Universe • GlideIn
Globus Universe • Run a job on a Grid resource • Features • Job management • Fault tolerance • Credential management • Roughly equivalent to the vanilla universe
How It Works Condor-G Grid Resource Schedd LSF
600 Globus jobs How It Works Condor-G Grid Resource Schedd LSF
600 Globus jobs How It Works Condor-G Grid Resource Schedd LSF GridManager
600 Globus jobs How It Works Condor-G Grid Resource JobManager Schedd LSF GridManager
600 Globus jobs How It Works Condor-G Grid Resource JobManager Schedd LSF GridManager User Job
GlideIn • Run the Condor daemons on Grid resources as user jobs • Create your own personal Condor pool from temporarily-acquired Grid resources • Brings the full power of Condor to the Grid
Globus Grid LSF PBS Condor Condor-G
Globus Grid 600 Condor jobs LSF PBS Condor Condor-G
Globus Grid Condor-G 600 Condor jobs LSF PBS Condor
Globus Grid Condor-G 600 Condor jobs LSF PBS glide-ins Condor
Globus Grid Condor-G 600 Condor jobs LSF PBS glide-ins Condor
Globus Grid Condor-G 600 Condor jobs LSF PBS glide-ins Condor
Globus Grid Condor-G 600 Condor jobs LSF PBS glide-ins Condor
Released New Features • Stuff we’ve added in the past year • Released and ready for use in Condor 6.6
Globus ASCII Helper Protocol (GAHP) • Encapsulates Globus libraries in separate process • Simple ASCII protocol • Easy for legacy applications to use Globus when they can’t link directly with the libraries
How It Works - GAHP Condor-G Grid Resources JobManager Schedd GridManager JobManager GAHP Client JobManager GAHP Server
File Staging • Arbitrary input and output files can be staged to and from execution site • Same syntax as other universes • Limitation • Output files must be explicitly named
File Staging (cont) • Input, Output, and Error can be URLs • Files will be transferred directly to and from execution site • Output and Error can be staged or streamed
Credential Refresh • Renewed credentials are used by Condor-G and forwarded to the execution site automatically • No processes need to be restarted
Better Credential Management • One GridManager process can handle multiple credential files with same subject • More efficient when you want to have different credential lifetimes for different jobs
Grid Match-Making • Globus jobs matched with Globus resources by the Condor match-maker using ClassAds • Current limitation • User/admin must create resources ads
Fault Tolerance • Condor-G does its best to automatically recover from failures • User can guide decisions with job policy expressions • Periodic Release • GlobusResubmit • Rematch
PeriodicRelease Expression • Condor-G puts problematic jobs on hold • This expression tells Condor-G when to release and retry such jobs
GlobusResubmit Expression • Tells Condor-G when a problematic job submission should be abandoned • When this expression becomes true • Best effort is made to clean up current job submission • New job submission is attempted
Rematch Expression • Tells Condor-G when a problematic resource should be abandoned • Evaluated when GlobusResubmit evaluates to true • When this expression becomes true • Best effort is made to clean up current job submission • Job is rematched
Job Ad Example GlobusContactString = TARGET.gatekeeper_url Requirements = TARGET.Arch == “LINUX” && TARGET.OpSys == “LINUX” Rank = TARGET.Mflops PeriodicRelease = ((NumMatches < 10) && ((CurrentTime-EnteredCurrentStatus) > 600)) GlobusResubmit = NumSystemHolds >= NumMatches Rematch = True
Hardening • Regular testing on the CMS testbed with real applications • Many bugs and integration issues found and fixed • Hostile Environment
Hostile Environment • Full disks • Machine crashes • File server lock-ups • Network outages • Power outages
One CMS Dataset Run • 300 jobs • Last fall • ~50 (16%) of the jobs stalled and required human recovery • Multiple service restarts (20 daemon crashes over 6 hours) • Now • 0 jobs stalled • 0 service restarts
Integration Work • Dozens of Condor-G improvements and bug fixes • Over 40 Globus “bugzilla” incidents, many with patches • Globus 2.2.4 has 21 “Advisories” as of 4/11/04 • Use latest version of both
Scalability • Submitting several hundred jobs produced high load on server • Machine became unresponsive • We saw a load average of 1000 at one point • Caused Globus JobManager processes
Grid Manager Monitor Agent • New tool Condor-G can use to reduce this load • Efficient job status polling program • Allows Condor-G to shut down JobManager processes when they’re not needed
Load Reduced • 400 jobs (/bin/sleep 900) • Without Grid Monitor • 42 hours to complete • Peak load average of 610 • With Grid Monitor • 40 minutes • Peak load average of 104
Miscellaneous Stuff • Email notification on job completion • Port range restrictions • Problem jobs put on hold
In Development • Stuff we’re currently working on • Will be released sometime in the next year
Job Policy Expressions • PeriodicHold • PeriodicRemove • OnExitHold • OnExitRemove
Improved GlideIn • MDS use optional • User specifies necessary information • Automatic setup • GlideIn job transfers and installs binaries if needed • Binaries can come from submit machine
New Job Types • Submit jobs directly to other schedulers (not through Globus) • Why? • Richer interface semantics • Not supported by Globus
NorduGrid • Grid batch system designed by Nordic countries • Globus GRAM didn’t offer necessary semantics • Client control of file staging • Automatic cleanup of abandoned jobs
Oracle • Oracle DBMS supports a job queue • Run this query in 5 hours • Run this query every Monday • Condor can add more management features
Generic Job Interface • Re-arrange GridManager to allow easy addition of new job types • Define appropriate interface • Plug-ins for new job types?
Globus Toolkit 3.0 • OGSA (Open Grid Services Architecture) • Submit jobs to GT3 sites • Grid Service client interface to Condor-G
Miscellaneous • Condor-G for Windows • MyProxy credential management • URLs for executable, staged files
Thank You! • Questions? • Also… • Condor-G & Globus Q/A session • Wednesday, 9am-12pm, room TBA • E-mail condor-admin@cs.wisc.edu