1 / 23

New Ways to Fetch Work

The new hook infrastructure in Condor 7.1.*. New Ways to Fetch Work. What’s the problem?. Users wanted to take advantage of Condor’s resource management daemon (condor_startd) to run jobs, but they had their own scheduling system. Specialized scheduling needs

gusty
Download Presentation

New Ways to Fetch Work

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The new hook infrastructure in Condor 7.1.* New Ways to Fetch Work

  2. What’s the problem? • Users wanted to take advantage of Condor’s resource management daemon (condor_startd) to run jobs, but they had their own scheduling system. • Specialized scheduling needs • Jobs live in their own database or other storage than a Condor job queue

  3. Fetch vs. push • Instead of trying to get these jobs into a condor_schedd, or try to push them to the condor_startd, just get the condor_startd to fetch (pull) the work • Lower latency than the overhead of matchmaking and the schedd • Fetching only requires an outbound network connection which makes life easier if you “glide-in” behind a firewall

  4. What’s the dumb solution? • Put code directly into the condor_startd that can talk directly to the other scheduling system(s) • We’d have to support other protocols • We’d have to link even more libraries and dependencies into our code • Very inflexible

  5. Another dumb solution… • “Make it a web service!” • Mostly the same problems: • What protocol? • What format to describe the jobs? • Add a dependency on libCurl? • What if I don’t want a webserver to be handling my jobs? • Security? Authentication? Privacy?

  6. Our solution (hopefully not dumb) • Make a system of “hooks” that you can plug into: • A hook is a point during the life-cycle of a job where the Condor daemons will invoke an external program • The hook invocation points have to be hard-coded into Condor, but then anyone can implement their own hooks to do what they want

  7. Why isn’t that dumb? • All the logic, code, libraries, etc, to fetch jobs from any given system lives completely outside of the Condor source and binaries • New hooks can be installed without a new version of Condor • No new library dependencies for us • Hooks are written by people who know what they’re doing…

  8. How does Condor communicate with hooks? • Passing around ASCII ClassAds via standard input and standard output • Some hooks get control data via a command-line argument (argv) • Hooks can be written in any language (scripts, binaries, whatever you want) so long as you can read/write STDIN/OUT • Decades of UNIX wisdom can’t be wrong!

  9. What hooks are available? • Hooks for fetching work (condor_startd): • FETCH_JOB • REPLY_FETCH • EVICT_CLAIM • Hooks for running jobs (condor_starter): • PREPARE_JOB • UPDATE_JOB_INFO • JOB_EXIT

  10. HOOK_FETCH_JOB • Invoked by the startd whenever it wants to try to fetch new work • FetchWorkDelay expression • Hook gets a current copy of the slot ClassAd • Hook prints the job ClassAd to STDOUT • If STDOUT is empty, there’s no work

  11. HOOK_REPLY_FETCH • Invoked by the startd once it decides what to do with the job ClassAd returned by HOOK_FETCH_WORK • Gives your external system a chance to know what happened • argv[1]: “accept” or “reject” • Gets a copy of slot and job ClassAds • Condor ignores all output • Optional hook

  12. HOOK_EVICT_CLAIM • Invoked if the startd has to evict a claim that’s running fetched work • Informational only: you can’t stop or delay this train once it’s left the station • STDIN: Both slot and job ClassAds • STDOUT: > /dev/null

  13. HOOK_PREPARE_JOB • Invoked by the condor_starter when it first starts up (only if defined) • Opportunity to prepare the job execution environment • Transfer input files, executables, etc. • INPUT: both slot and job ClassAds • OUTPUT: ignored, but starter won’t continue until this hook exits • Not specific to fetched work

  14. HOOK_UPDATE_JOB_INFO • Periodically invoked by the starter to let you know what’s happening with the job • INPUT: both ClassAds • Job ClassAd is updated with additional attributes computed by the starter: • ImageSize, JobState, RemoteUserCpu, etc. • OUTPUT: ignored

  15. HOOK_JOB_EXIT • Invoked by the starter whenever the job exits for any reason • Argv[1] indicates what happened: • “exit”: Died a natural death • “evict”: Booted off prematurely by the startd (PREEMPT == TRUE, condor_off, etc) • “remove”: Removed by condor_rm • “hold”: Held by condor_hold

  16. HOOK_JOB_EXIT … • “HUH!?! condor_rm? What are you talking about?” • The starter hooks can be defined even for regular Condor jobs*, local universe, etc. • INPUT: copy of the job ClassAd with extra attributes about what happened: • ExitCode, JobDuration, etc. • OUTPUT: Ignored * Except for dumb exceptions… the schedd doesn’t distinguish rm vs. hold when telling the starter to go away (yet). Argh!

  17. Defining hooks • Each slot can have its own hook ”keyword” • Prefix for config file parameters • Can use different sets of hooks to talk to different external systems on each slot • Global keyword used when the per-slot keyword is not defined • Keyword is inserted by the startd into its copy of the job ClassAd and given to the starter

  18. Defining hooks: example # Most slots fetch work from the database system STARTD_JOB_HOOK_KEYWORD = DB # Slot4 fetches and runs work from a web service SLOT4_JOB_HOOK_KEYWORD = WEB # The database system needs to both provide work and # know the reply for each attempted claim DB_DIR = /usr/local/condor/fetch/db DATABASE_HOOK_FETCH_WORK = $(DB_DIR)/fetch_work.php DATABASE_HOOK_REPLY_FETCH = $(DB_DIR)/reply_fetch.php # The web system only needs to fetch work WEB_DIR = /usr/local/condor/fetch/web WEB_HOOK_FETCH_WORK = $(WEB_DIR)/fetch_work.php

  19. Semantics of fetched jobs • Condor_startd treats them just like any other kind of job: • All the standard resource policy expressions apply (START, SUSPEND, PREEMPT, RANK, etc). • Fetched jobs can coexist in the same pool with jobs pushed by Condor, COD, etc. • Fetched work != Backfill

  20. Semantics continued • If the startd is unclaimed and fetches a job, a claim is created • If that job completes, the claim is reused and the startd fetches again • Keep fetching until either: • The claim is evicted by Condor • The fetch hook returns no more work

  21. Limitations for fetched jobs • No schedd/shadow means no “standard universe” for checkpointing, migration, and remote system calls • Could use stand-alone checkpointing • Application-specific checkpointing • Other features that are unavailable: • User policy expressions (e.g. periodic hold) • No DAGMan (you’re on your own) • …

  22. Limitations of the hooks • If the starter can’t run your fetched job because your ClassAd is bogus, no hook is invoked to tell you about it • We need a HOOK_STARTER_FAILURE • No hook when the starter is about to evict you (so you can checkpoint) • Can implement this yourself with a wrapper script and the SoftKillSig attribute

  23. More information • New section in the Condor 7.1 manual: • Chapter 4: Miscellaneous Concepts • 4.4: Job Hooks • http://www.cs.wisc.edu/condor/manual/v7.1/4_4Job_Hooks.html • Any questions?

More Related