410 likes | 423 Views
Learn how to manage and optimize the Condor system with this comprehensive online collection of tutorials, guides, and tips. Find answers to common questions and discover advanced techniques for job scheduling and resource allocation.
E N D
Where to Find the Online How-to Collection • Go to http://www.cs.wisc.edu/condor/ • Click on “Condor Admin How-to Recipes”Currently, that takes you here:http://nmi.cs.wisc.edu/node/1465 Dan, Condor Week 2008
Brief OverviewofSelected Bits Dan, Condor Week 2008
Question • How does Condor decide which job gets to run on an execute machine? Dan, Condor Week 2008
condor_submit schedd (job queue) startd (Job Executor) The Life of a Condor Job central manager (collector + negotiator) central manager 2 central manager 3 (collector + negotiator) job ClassAd flocking machine ClassAd job runs Dan, Condor Week 2008
First Stop: Authorization • User must be authorized to submit to scheddALLOW_WRITE = allow1, allow2, …DENY_WRITE = deny1, deny2, …user@uid_domain/network • By defualt, all authenticated users may submit jobs within trusted networkALLOW_WRITE = */networkHOSTALLOW_WRITE = network (old style) Dan, Condor Week 2008
Next Stop: The Job Queue • MAX_JOBS_RUNNING = 200 • Job priority = integer • orders a user’s jobs • higher priority will run sooner Dan, Condor Week 2008
Authorization of the Schedd to Join Pool • ALLOW_ADVERTISE_SCHEDDDENY_ADVERTISE_SCHEDD • Default: ALLOW/DENY_DAEMON • Default: ALLOW/DENY_WRITE • COLLECTOR_REQUIREMENTS • Default: true Dan, Condor Week 2008
Next Stop: NegotiatorFair Share • User priority Inversely proportional to fair share • Example: two users, 60 batch slots • priority 50 - gets 40 slots • priority 100 - gets 20 slots Dan, Condor Week 2008
Fair Share Dynamics • User priority changes over time • wants to be equal to number of slots in use • Example: • User steadily running 100 jobs: priority 100 • Stops running jobs: • 1 day later: priority 50 • 2 days later: priority 25 • Configure speed of adjustment:PRIORITY_HALFLIFE = 86400 Dan, Condor Week 2008
Modified Fair Share • User Priority Factor • multiplies the “real user priority” • result is called “effective user priority” • Example: condor_userprio -setfactor atlas@hep.wisc.edu 4.0 condor_userprio -setfactor cms@hep.wisc.edu 1.0 • atlas steadily uses 10 slots - effective priority 40 • cms steadily uses 20 slots - effective priority 20 Dan, Condor Week 2008
Reporting Condor Pool Usage % condor_userprio -usage -allusers Last Priority Update: 7/30 09:59 Accumulated Usage Last User Name Usage (hrs) Start Time Usage Time ------------------------------ ----------- ---------------- ---------------- … osg_usatlas1@hep.wisc.edu 599739.09 4/18/2006 14:37 7/30/2007 07:24 jherschleb@lmcg.wisc.edu 799300.91 4/03/2006 12:56 7/30/2007 09:59 szhou@lmcg.wisc.edu 1029384.68 4/03/2006 12:56 7/30/2007 09:59 osg_cmsprod@hep.wisc.edu 2013058.70 4/03/2006 16:54 7/30/2007 09:59 ------------------------------ ----------- ---------------- ---------------- Number of users: 271 8517482.95 4/03/2006 12:56 7/29/2007 10:00 • When upgrading Condor, preserve the central manager’s AccountantLog • Happens automatically if you follow general rule:preserve Condor’s LOCAL_DIR Dan, Condor Week 2008
Matchmaking • Job requirements and machine requirements must both be met • Machine requirements are configured via the START expressionSTART = Owner == "appinstaller" Dan, Condor Week 2008
Adding to Job Requirements APPEND_REQUIREMENTS = MY.Owner != "appinstaller" || TARGET.IsAppInstallerMachine =?= True Dan, Condor Week 2008
Adding Attribute to Machine ClassAd IsAppInstallerMachine = True STARTD_ATTRS = $(STARTD_ATTRS) IsAppInstallerMachine Dan, Condor Week 2008
Choosing Between Matching Machines • NEGOTIATOR_PRE_JOB_RANK • job rank expression • NEGOTIATOR_POST_JOB_RANK • PREEMPTION_RANK Dan, Condor Week 2008
Example NEGOTIATOR_PRE_JOB_RANK = (IsDesktop =!= True && isUndefined(RemoteOwner)) + isUndefined(RemoteOwner) • Most desirable to least: • 2 unclaimed and not a desktop • 1 unclaimed and desktop • 0 claimed Dan, Condor Week 2008
Authorizing Schedd to Claim Startd • ALLOW/DENY_WRITE • It is the schedd which is authorized by the startd, not the user. Dan, Condor Week 2008
Preemption Dan, Condor Week 2008
Machine Rank • Numerical expression: • higher number preempts lower number • user priority is secondary to rank, because higher rank job preempts claim to machine • Example: • CMS gets 1st prio, CDF gets 2nd, others 3rd RANK = 2*(User == “cms@hep.wisc.edu”) + 1*(User == “cdf@hep.wisc.edu”) Dan, Condor Week 2008
Another Rank Example Rank = (Group =?= "LMCG") * (1000 + RushJob) Dan, Condor Week 2008
Note on Scope of Condor Policies • pool-wide scope: example negotiator • user priorities, factors, etc. • preemption policy related to user priority • steering jobs via negotiator job rank • execute machine/slot scope: startd • machine rank, requirements • preemption/suspension policy • customized machine ClassAd values • submit machine scope • queue policy, automatic additions to job requirements, and insertion of arbitrary ClassAd attributes into job • personal scope • environmental configurations: _CONDOR_<config val>=value Dan, Condor Week 2008
Preemption Policy • Should Condor jobs yield to non-condor activity on the machine? • Should some types of jobs never be interrupted? After 4 days? • Should some jobs immediately preempt others? After 30 minutes? • Is suspension more desirable than killing? • Can need for preemption be decreased by steering jobs towards the right machines? Dan, Condor Week 2008
Example Preemption Policy When a claim is preempted, do not allow killing of jobs younger than 4 days old. MaxJobRetirementTime = 3600 * 24 * 4 • Applies to all forms of preemption: • user priority, machine rank, machine activity, graceful shutdown Dan, Condor Week 2008
Another Preemption Policy • Expression can refer to attributes of batch slot and job, so can be highly customized. MaxJobRetirementTime = 3600 * 24 * 4 * (OSG_VO =?= “uscms”) Dan, Condor Week 2008
More Preemption Controls • PREEMPTION_REQUIREMENTS • controls user-priority based preemption at the level of the negotiator • PREEMPT/SUSPEND • controls preemption by machine activity (e.g. keyboard or cpu activity) • RANK • allows preemption by more desirable jobs Dan, Condor Week 2008
Preemption Policy Pitfall • If you disable all forms of preemption, you probably want to limit lifespan of claims: PREEMPTION_REQUIRMENTS = False PREEMPT = False RANK = 0 CLAIM_WORKLIFE = 3600 • Otherwise, reallocation of resources will not happen until a user runs out of matching jobs. Dan, Condor Week 2008
What Happens to Preempted Jobs? • Back to idle in job queue • NumJobStarts >= 1 • job policy:periodic_hold, periodic_remove • admin policy:SYSTEM_PERIODIC_HOLDSYSTEM_PERIODIC_REMOVE Dan, Condor Week 2008
Back to the Negotiator:Group Accounting Dan, Condor Week 2008
Fair Sharing Between Groups • Useful when: • multiple user ids belong to same group • group’s share of pool is not tied to specific machines # Example group settings GROUP_NAMES = group_physics, group_chemistry GROUP_QUOTA_group_physics = 200 GROUP_QUOTA_group_chemistry = 100 GROUP_AUTOREGROUP = True GROUP_PRIO_FACTOR_group_physics = 10 GROUP_PRIO_FACTOR_group_chemistry = 10 DEFAULT_PRIO_FACTOR = 100 Dan, Condor Week 2008
Setting Group Identity • The job advertises its own group identity: +AccountingGroup = “group_physics.dan” group name group user • Anyone can declare any identity. • This is not the unix/windows identity the job runs as. • It is solely for accounting and prioritization purposes. Dan, Condor Week 2008
Monitoring Usage % condor_userprio -usage -allusers Last Priority Update: 7/30 09:59 Accumulated Usage Last User Name Usage (hrs) Start Time Usage Time ------------------------------ ----------- ---------------- ---------------- … group_physics.atlas@hep.wisc.edu 599739.09 4/18/2006 14:37 7/30/2007 07:24 group_physics.cms@hep.wisc.edu 799300.91 4/03/2006 12:56 7/30/2007 09:59 group_chemistry.han@che.wisc.edu 1029384.68 4/03/2006 12:56 7/30/2007 09:59 group_chemistry.ben@che.wisc.edu 2013058.70 4/03/2006 16:54 7/30/2007 09:59 ------------------------------ ----------- ---------------- ---------------- Number of users: 271 8517482.95 4/03/2006 12:56 7/29/2007 10:00 % condor_userprio -all -allusers Dan, Condor Week 2008
How do groups compete? • Group using least share of its quota gets top priority in matchmaking. Dan, Condor Week 2008
How do user’s within group compete? • Each group user has its own user priority • Fair share between group members determined by the usual user priority mechanism Dan, Condor Week 2008
May Group Exceed its Quota? • Yes, but only ifGROUP_AUTOREGROUP = TrueOR, if undefinedGROUP_AUTOREGROUP_<groupname> = True Dan, Condor Week 2008
When Exceeding Quota, How do Users Compete? • All non-group users plus group users trying to exceed their quota compete for remaining machines. • The user priority of the group user (e.g. “group_physics.dan”) is used to determine fair share. • Can set default priority factor for all members of group:GROUP_PRIO_FACTOR_<groupname> = 10 Dan, Condor Week 2008
The End of the Story Dan, Condor Week 2008
condor_submit schedd (job queue) startd (Job Executor) The Life of a Condor Job central manager (collector + negotiator) central manager 2 central manager 3 (collector + negotiator) job ClassAd flocking machine ClassAd job runs Dan, Condor Week 2008
Extending the Reach • FLOCK_TO = <remote collector> • requires bi-directional connectivity • in Linux, can use GCB to connect private networks • Grid Universe: Globus, Condor-C • condor_glidein • JobRouter Dan, Condor Week 2008
Trivia • What’s the difference?IsHighPrioUser = Owner == “dan” • RANK = IsHighPrioUser • RANK = $(IsHighPrioUser) • case 1 needs:STARTD_ATTRS = IsHighPrioUser Dan, Condor Week 2008
Where to Find the Online How-to Collection • Go to http://www.cs.wisc.edu/condor/ • Click on “Condor Admin How-to Recipes”Currently, that takes you here:http://nmi.cs.wisc.edu/node/1465 Dan, Condor Week 2008