New Ways to Fetch Work

The new hook infrastructure in Condor 7.1.* New Ways to Fetch Work

What’s the problem? • Users wanted to take advantage of Condor’s resource management daemon (condor_startd) to run jobs, but they had their own scheduling system. • Specialized scheduling needs • Jobs live in their own database or other storage than a Condor job queue

Fetch vs. push • Instead of trying to get these jobs into a condor_schedd, or try to push them to the condor_startd, just get the condor_startd to fetch (pull) the work • Lower latency than the overhead of matchmaking and the schedd • Fetching only requires an outbound network connection which makes life easier if you “glide-in” behind a firewall

What’s the dumb solution? • Put code directly into the condor_startd that can talk directly to the other scheduling system(s) • We’d have to support other protocols • We’d have to link even more libraries and dependencies into our code • Very inflexible

Another dumb solution… • “Make it a web service!” • Mostly the same problems: • What protocol? • What format to describe the jobs? • Add a dependency on libCurl? • What if I don’t want a webserver to be handling my jobs? • Security? Authentication? Privacy?

Our solution (hopefully not dumb) • Make a system of “hooks” that you can plug into: • A hook is a point during the life-cycle of a job where the Condor daemons will invoke an external program • The hook invocation points have to be hard-coded into Condor, but then anyone can implement their own hooks to do what they want

Why isn’t that dumb? • All the logic, code, libraries, etc, to fetch jobs from any given system lives completely outside of the Condor source and binaries • New hooks can be installed without a new version of Condor • No new library dependencies for us • Hooks are written by people who know what they’re doing…

How does Condor communicate with hooks? • Passing around ASCII ClassAds via standard input and standard output • Some hooks get control data via a command-line argument (argv) • Hooks can be written in any language (scripts, binaries, whatever you want) so long as you can read/write STDIN/OUT • Decades of UNIX wisdom can’t be wrong!

What hooks are available? • Hooks for fetching work (condor_startd): • FETCH_JOB • REPLY_FETCH • EVICT_CLAIM • Hooks for running jobs (condor_starter): • PREPARE_JOB • UPDATE_JOB_INFO • JOB_EXIT

HOOK_FETCH_JOB • Invoked by the startd whenever it wants to try to fetch new work • FetchWorkDelay expression • Hook gets a current copy of the slot ClassAd • Hook prints the job ClassAd to STDOUT • If STDOUT is empty, there’s no work

HOOK_REPLY_FETCH • Invoked by the startd once it decides what to do with the job ClassAd returned by HOOK_FETCH_WORK • Gives your external system a chance to know what happened • argv[1]: “accept” or “reject” • Gets a copy of slot and job ClassAds • Condor ignores all output • Optional hook

HOOK_EVICT_CLAIM • Invoked if the startd has to evict a claim that’s running fetched work • Informational only: you can’t stop or delay this train once it’s left the station • STDIN: Both slot and job ClassAds • STDOUT: > /dev/null

HOOK_PREPARE_JOB • Invoked by the condor_starter when it first starts up (only if defined) • Opportunity to prepare the job execution environment • Transfer input files, executables, etc. • INPUT: both slot and job ClassAds • OUTPUT: ignored, but starter won’t continue until this hook exits • Not specific to fetched work

HOOK_UPDATE_JOB_INFO • Periodically invoked by the starter to let you know what’s happening with the job • INPUT: both ClassAds • Job ClassAd is updated with additional attributes computed by the starter: • ImageSize, JobState, RemoteUserCpu, etc. • OUTPUT: ignored

HOOK_JOB_EXIT • Invoked by the starter whenever the job exits for any reason • Argv[1] indicates what happened: • “exit”: Died a natural death • “evict”: Booted off prematurely by the startd (PREEMPT == TRUE, condor_off, etc) • “remove”: Removed by condor_rm • “hold”: Held by condor_hold

HOOK_JOB_EXIT … • “HUH!?! condor_rm? What are you talking about?” • The starter hooks can be defined even for regular Condor jobs*, local universe, etc. • INPUT: copy of the job ClassAd with extra attributes about what happened: • ExitCode, JobDuration, etc. • OUTPUT: Ignored * Except for dumb exceptions… the schedd doesn’t distinguish rm vs. hold when telling the starter to go away (yet). Argh!

Defining hooks • Each slot can have its own hook ”keyword” • Prefix for config file parameters • Can use different sets of hooks to talk to different external systems on each slot • Global keyword used when the per-slot keyword is not defined • Keyword is inserted by the startd into its copy of the job ClassAd and given to the starter

Defining hooks: example # Most slots fetch work from the database system STARTD_JOB_HOOK_KEYWORD = DB # Slot4 fetches and runs work from a web service SLOT4_JOB_HOOK_KEYWORD = WEB # The database system needs to both provide work and # know the reply for each attempted claim DB_DIR = /usr/local/condor/fetch/db DATABASE_HOOK_FETCH_WORK = $(DB_DIR)/fetch_work.php DATABASE_HOOK_REPLY_FETCH = $(DB_DIR)/reply_fetch.php # The web system only needs to fetch work WEB_DIR = /usr/local/condor/fetch/web WEB_HOOK_FETCH_WORK = $(WEB_DIR)/fetch_work.php

Semantics of fetched jobs • Condor_startd treats them just like any other kind of job: • All the standard resource policy expressions apply (START, SUSPEND, PREEMPT, RANK, etc). • Fetched jobs can coexist in the same pool with jobs pushed by Condor, COD, etc. • Fetched work != Backfill

Semantics continued • If the startd is unclaimed and fetches a job, a claim is created • If that job completes, the claim is reused and the startd fetches again • Keep fetching until either: • The claim is evicted by Condor • The fetch hook returns no more work

Limitations for fetched jobs • No schedd/shadow means no “standard universe” for checkpointing, migration, and remote system calls • Could use stand-alone checkpointing • Application-specific checkpointing • Other features that are unavailable: • User policy expressions (e.g. periodic hold) • No DAGMan (you’re on your own) • …

Limitations of the hooks • If the starter can’t run your fetched job because your ClassAd is bogus, no hook is invoked to tell you about it • We need a HOOK_STARTER_FAILURE • No hook when the starter is about to evict you (so you can checkpoint) • Can implement this yourself with a wrapper script and the SoftKillSig attribute

More information • New section in the Condor 7.1 manual: • Chapter 4: Miscellaneous Concepts • 4.4: Job Hooks • http://www.cs.wisc.edu/condor/manual/v7.1/4_4Job_Hooks.html • Any questions?

New Ways to Fetch Work