1 / 41

Condor Administrator’s How-to

Learn how to manage and optimize the Condor system with this comprehensive online collection of tutorials, guides, and tips. Find answers to common questions and discover advanced techniques for job scheduling and resource allocation.

gmays
Download Presentation

Condor Administrator’s How-to

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Condor Administrator’s How-to

  2. Where to Find the Online How-to Collection • Go to http://www.cs.wisc.edu/condor/ • Click on “Condor Admin How-to Recipes”Currently, that takes you here:http://nmi.cs.wisc.edu/node/1465 Dan, Condor Week 2008

  3. Brief OverviewofSelected Bits Dan, Condor Week 2008

  4. Question • How does Condor decide which job gets to run on an execute machine? Dan, Condor Week 2008

  5. condor_submit schedd (job queue) startd (Job Executor) The Life of a Condor Job central manager (collector + negotiator) central manager 2 central manager 3 (collector + negotiator) job ClassAd flocking machine ClassAd job runs Dan, Condor Week 2008

  6. First Stop: Authorization • User must be authorized to submit to scheddALLOW_WRITE = allow1, allow2, …DENY_WRITE = deny1, deny2, …user@uid_domain/network • By defualt, all authenticated users may submit jobs within trusted networkALLOW_WRITE = */networkHOSTALLOW_WRITE = network (old style) Dan, Condor Week 2008

  7. Next Stop: The Job Queue • MAX_JOBS_RUNNING = 200 • Job priority = integer • orders a user’s jobs • higher priority will run sooner Dan, Condor Week 2008

  8. Authorization of the Schedd to Join Pool • ALLOW_ADVERTISE_SCHEDDDENY_ADVERTISE_SCHEDD • Default: ALLOW/DENY_DAEMON • Default: ALLOW/DENY_WRITE • COLLECTOR_REQUIREMENTS • Default: true Dan, Condor Week 2008

  9. Next Stop: NegotiatorFair Share • User priority Inversely proportional to fair share • Example: two users, 60 batch slots • priority 50 - gets 40 slots • priority 100 - gets 20 slots Dan, Condor Week 2008

  10. Fair Share Dynamics • User priority changes over time • wants to be equal to number of slots in use • Example: • User steadily running 100 jobs: priority 100 • Stops running jobs: • 1 day later: priority 50 • 2 days later: priority 25 • Configure speed of adjustment:PRIORITY_HALFLIFE = 86400 Dan, Condor Week 2008

  11. Modified Fair Share • User Priority Factor • multiplies the “real user priority” • result is called “effective user priority” • Example: condor_userprio -setfactor atlas@hep.wisc.edu 4.0 condor_userprio -setfactor cms@hep.wisc.edu 1.0 • atlas steadily uses 10 slots - effective priority 40 • cms steadily uses 20 slots - effective priority 20 Dan, Condor Week 2008

  12. Reporting Condor Pool Usage % condor_userprio -usage -allusers Last Priority Update: 7/30 09:59 Accumulated Usage Last User Name Usage (hrs) Start Time Usage Time ------------------------------ ----------- ---------------- ---------------- … osg_usatlas1@hep.wisc.edu 599739.09 4/18/2006 14:37 7/30/2007 07:24 jherschleb@lmcg.wisc.edu 799300.91 4/03/2006 12:56 7/30/2007 09:59 szhou@lmcg.wisc.edu 1029384.68 4/03/2006 12:56 7/30/2007 09:59 osg_cmsprod@hep.wisc.edu 2013058.70 4/03/2006 16:54 7/30/2007 09:59 ------------------------------ ----------- ---------------- ---------------- Number of users: 271 8517482.95 4/03/2006 12:56 7/29/2007 10:00 • When upgrading Condor, preserve the central manager’s AccountantLog • Happens automatically if you follow general rule:preserve Condor’s LOCAL_DIR Dan, Condor Week 2008

  13. Matchmaking • Job requirements and machine requirements must both be met • Machine requirements are configured via the START expressionSTART = Owner == "appinstaller" Dan, Condor Week 2008

  14. Adding to Job Requirements APPEND_REQUIREMENTS = MY.Owner != "appinstaller" || TARGET.IsAppInstallerMachine =?= True Dan, Condor Week 2008

  15. Adding Attribute to Machine ClassAd IsAppInstallerMachine = True STARTD_ATTRS = $(STARTD_ATTRS) IsAppInstallerMachine Dan, Condor Week 2008

  16. Choosing Between Matching Machines • NEGOTIATOR_PRE_JOB_RANK • job rank expression • NEGOTIATOR_POST_JOB_RANK • PREEMPTION_RANK Dan, Condor Week 2008

  17. Example NEGOTIATOR_PRE_JOB_RANK = (IsDesktop =!= True && isUndefined(RemoteOwner)) + isUndefined(RemoteOwner) • Most desirable to least: • 2 unclaimed and not a desktop • 1 unclaimed and desktop • 0 claimed Dan, Condor Week 2008

  18. Authorizing Schedd to Claim Startd • ALLOW/DENY_WRITE • It is the schedd which is authorized by the startd, not the user. Dan, Condor Week 2008

  19. Preemption Dan, Condor Week 2008

  20. Machine Rank • Numerical expression: • higher number preempts lower number • user priority is secondary to rank, because higher rank job preempts claim to machine • Example: • CMS gets 1st prio, CDF gets 2nd, others 3rd RANK = 2*(User == “cms@hep.wisc.edu”) + 1*(User == “cdf@hep.wisc.edu”) Dan, Condor Week 2008

  21. Another Rank Example Rank = (Group =?= "LMCG") * (1000 + RushJob) Dan, Condor Week 2008

  22. Note on Scope of Condor Policies • pool-wide scope: example negotiator • user priorities, factors, etc. • preemption policy related to user priority • steering jobs via negotiator job rank • execute machine/slot scope: startd • machine rank, requirements • preemption/suspension policy • customized machine ClassAd values • submit machine scope • queue policy, automatic additions to job requirements, and insertion of arbitrary ClassAd attributes into job • personal scope • environmental configurations: _CONDOR_<config val>=value Dan, Condor Week 2008

  23. Preemption Policy • Should Condor jobs yield to non-condor activity on the machine? • Should some types of jobs never be interrupted? After 4 days? • Should some jobs immediately preempt others? After 30 minutes? • Is suspension more desirable than killing? • Can need for preemption be decreased by steering jobs towards the right machines? Dan, Condor Week 2008

  24. Example Preemption Policy When a claim is preempted, do not allow killing of jobs younger than 4 days old. MaxJobRetirementTime = 3600 * 24 * 4 • Applies to all forms of preemption: • user priority, machine rank, machine activity, graceful shutdown Dan, Condor Week 2008

  25. Another Preemption Policy • Expression can refer to attributes of batch slot and job, so can be highly customized. MaxJobRetirementTime = 3600 * 24 * 4 * (OSG_VO =?= “uscms”) Dan, Condor Week 2008

  26. More Preemption Controls • PREEMPTION_REQUIREMENTS • controls user-priority based preemption at the level of the negotiator • PREEMPT/SUSPEND • controls preemption by machine activity (e.g. keyboard or cpu activity) • RANK • allows preemption by more desirable jobs Dan, Condor Week 2008

  27. Preemption Policy Pitfall • If you disable all forms of preemption, you probably want to limit lifespan of claims: PREEMPTION_REQUIRMENTS = False PREEMPT = False RANK = 0 CLAIM_WORKLIFE = 3600 • Otherwise, reallocation of resources will not happen until a user runs out of matching jobs. Dan, Condor Week 2008

  28. What Happens to Preempted Jobs? • Back to idle in job queue • NumJobStarts >= 1 • job policy:periodic_hold, periodic_remove • admin policy:SYSTEM_PERIODIC_HOLDSYSTEM_PERIODIC_REMOVE Dan, Condor Week 2008

  29. Back to the Negotiator:Group Accounting Dan, Condor Week 2008

  30. Fair Sharing Between Groups • Useful when: • multiple user ids belong to same group • group’s share of pool is not tied to specific machines # Example group settings GROUP_NAMES = group_physics, group_chemistry GROUP_QUOTA_group_physics = 200 GROUP_QUOTA_group_chemistry = 100 GROUP_AUTOREGROUP = True GROUP_PRIO_FACTOR_group_physics = 10 GROUP_PRIO_FACTOR_group_chemistry = 10 DEFAULT_PRIO_FACTOR = 100 Dan, Condor Week 2008

  31. Setting Group Identity • The job advertises its own group identity: +AccountingGroup = “group_physics.dan” group name group user • Anyone can declare any identity. • This is not the unix/windows identity the job runs as. • It is solely for accounting and prioritization purposes. Dan, Condor Week 2008

  32. Monitoring Usage % condor_userprio -usage -allusers Last Priority Update: 7/30 09:59 Accumulated Usage Last User Name Usage (hrs) Start Time Usage Time ------------------------------ ----------- ---------------- ---------------- … group_physics.atlas@hep.wisc.edu 599739.09 4/18/2006 14:37 7/30/2007 07:24 group_physics.cms@hep.wisc.edu 799300.91 4/03/2006 12:56 7/30/2007 09:59 group_chemistry.han@che.wisc.edu 1029384.68 4/03/2006 12:56 7/30/2007 09:59 group_chemistry.ben@che.wisc.edu 2013058.70 4/03/2006 16:54 7/30/2007 09:59 ------------------------------ ----------- ---------------- ---------------- Number of users: 271 8517482.95 4/03/2006 12:56 7/29/2007 10:00 % condor_userprio -all -allusers Dan, Condor Week 2008

  33. How do groups compete? • Group using least share of its quota gets top priority in matchmaking. Dan, Condor Week 2008

  34. How do user’s within group compete? • Each group user has its own user priority • Fair share between group members determined by the usual user priority mechanism Dan, Condor Week 2008

  35. May Group Exceed its Quota? • Yes, but only ifGROUP_AUTOREGROUP = TrueOR, if undefinedGROUP_AUTOREGROUP_<groupname> = True Dan, Condor Week 2008

  36. When Exceeding Quota, How do Users Compete? • All non-group users plus group users trying to exceed their quota compete for remaining machines. • The user priority of the group user (e.g. “group_physics.dan”) is used to determine fair share. • Can set default priority factor for all members of group:GROUP_PRIO_FACTOR_<groupname> = 10 Dan, Condor Week 2008

  37. The End of the Story Dan, Condor Week 2008

  38. condor_submit schedd (job queue) startd (Job Executor) The Life of a Condor Job central manager (collector + negotiator) central manager 2 central manager 3 (collector + negotiator) job ClassAd flocking machine ClassAd job runs Dan, Condor Week 2008

  39. Extending the Reach • FLOCK_TO = <remote collector> • requires bi-directional connectivity • in Linux, can use GCB to connect private networks • Grid Universe: Globus, Condor-C • condor_glidein • JobRouter Dan, Condor Week 2008

  40. Trivia • What’s the difference?IsHighPrioUser = Owner == “dan” • RANK = IsHighPrioUser • RANK = $(IsHighPrioUser) • case 1 needs:STARTD_ATTRS = IsHighPrioUser Dan, Condor Week 2008

  41. Where to Find the Online How-to Collection • Go to http://www.cs.wisc.edu/condor/ • Click on “Condor Admin How-to Recipes”Currently, that takes you here:http://nmi.cs.wisc.edu/node/1465 Dan, Condor Week 2008

More Related