1.05k likes | 1.21k Views
Condor Administration Paradyn-Condor Week UW Campus March 2002. Outline. Other sources of Information User Priorities Policy Expressions Life-cycle of a job – submit to complete Daemons – what they do and require Startd states and activities Useful admin commands
E N D
Condor Administration Paradyn-Condor WeekUW CampusMarch 2002
Outline • Other sources of Information • User Priorities • Policy Expressions • Life-cycle of a job – submit to complete • Daemons – what they do and require • Startd states and activities • Useful admin commands • Authorization and Authentication • General Security Comments/Worries
Outline, cont. • Installation Layout • Contrib Modules • Walk-thru of UW-Madison’s condor_config files
Other Sources • Condor Manual • Condor Web Site • “How to Build a Beowulf Cluster on Linux” by Thomas Sterling, MIT Press, published in 2001 • Email to condor-admin@cs.wisc.edu
User Priorities • Command condor_userprio • How it all works • About nice_user • Config file Settings: • Priority_Halflife, Default_Prio_Factor, Nice_User_Prio_Factor, Remote_Prio_Factor, Account_local_Domain
Introduction to Condor’s Configuration Files • Condor’s configuration is a concatenation of multiple files, in order - definitions in later files overwrites previous definitions • Layout and purpose of the different files: • Global config file • Other shared files • Local config file • Root config file (optional)
Global Config File • All shared settings across your entire pool • Found either in file pointed to with the CONDOR_CONFIG environment variable, /etc/condor/condor_config, or the home directory of the “condor” user • Most settings can be in this file • Only works as a “global” file if it is on a shared file system
Other shared files • You can configure a number of other shared config files: • files to hold common settings to make it easier to maintain (for example, all policy expressions, which we’ll see later) • platform-specific config files
Local config file • Any machine-specific settings • local policy settings for a given owner • different daemons to run (for example, on the Central Manager!) • Can either be on the local disk of each machine, or have separate files in a shared directory, each named by hostname
Root config file (optional) • You can specify a “root” config file, which is always processed after all other files • This allows root to specify certain settings which cannot be changed by another user (like the path to the Condor daemons) • Only useful if daemons are started as root but someone else has access to edit Condor’s config files
Basic syntax • # is a comment • A “\” at the end of a line is a line-continuation, so both lines are treated as one big entry • All names are case insensitive • “Macros” have the form: • Attribute_Name = value • You reference other macros with: • A = $(B)
Policy Configuration I am adding nodes to the Cluster… but the Engineering Department has priority on these nodes. (Boss Fat Cat)
The Machine (Startd) Policy Expressions START – When is this machine willing to start a job RANK - Job Preferences SUSPEND - When to suspend a job CONTINUE - When to continue a suspended job PREEMPT – When to nicely stop running a job KILL - When to immediately kill a preempting job
Freida’s Current Settings START = True RANK = SUSPEND = False CONTINUE = PREEMPT = False KILL = False
Freida’s New Settings for the Chemistry nodes START = True RANK = Department == “Chemistry” SUSPEND = False CONTINUE = PREEMPT = False KILL = False
Submit file with Custom Attribute Executable = charm-run Universe = standard +Department = Chemistry queue
What if “Department” not specified? START = True RANK = Department =!= UNDEFINED && Department == “Chemistry” SUSPEND = False CONTINUE = PREEMPT = False KILL = False
Another example START = True RANK = Department =!= UNDEFINED && ((Department == “Chemistry”)*2 + Department == “Physics”) SUSPEND = False CONTINUE = PREEMPT = False KILL = False
Policy Configuration, cont The Cluster is fine. But not the desktop machines. Condor can only use the desktops when they would otherwise be idle. (Boss Fat Cat)
So Frieda decides she wants the desktops to: • START jobs when their has been no activity on the keyboard/mouse for 5 minutes and the load average is low • SUSPEND jobs as soon as activity is detected • PREEMPT jobs if the activity continues for 5 minutes or more • KILL jobs if they take more than 5 minutes to preempt
Macros in the Config File NonCondorLoadAvg = (LoadAvg - CondorLoadAvg) BackgroundLoad = 0.3 HighLoad = 0.5 KeyboardBusy = (KeyboardIdle < 10) CPU_Busy = ($(NonCondorLoadAvg) >= $(HighLoad)) MachineBusy = ($(CPU_Busy) || $(KeyboardBusy)) ActivityTimer = (CurrentTime - EnteredCurrentActivity)
Desktop Machine Policy START = $(CPU_Idle) && KeyboardIdle > 300 SUSPEND = $(MachineBusy) CONTINUE = $(CPU_Idle) && KeyboardIdle > 120 PREEMPT = (Activity == "Suspended") && $(ActivityTimer) > 300 KILL = $(ActivityTimer) > 300
Policy Review • Users submitting jobs can specify Requirements and Rank expressions • Administrators can specify Startd Policy expressions individually for each machine (Start,Suspend,etc) • Expressions can use any job or machine ClassAd attribute • Custom attributes easily added • Bottom Line: Enforce almost any policy!
Additional Policy Parameters • WANT_SUSPEND • WANT_VACATE
True True True True True False START Road Map of the Policy Expressions WANT SUSPEND SUSPEND = Expression PREEMPT = Activity WANT VACATE False Vacating KILL True Killing
Negotiator Policy Expressions • PREEMPTION_REQUIREMENTS • PREEMPTION_RANK Examples: PREEMPTION_REQUIREMENTS = $(StateTimer) > (1 * $(HOUR)) && RemoteUserPrio > SubmittorPrio * 1.2 PREEMPTION_RANK = (RemoteUserPrio * 1000000) - ImageSize
The Condor Daemons • condor_master (controls everything else) • condor_startd (executing jobs) • condor_starter (helper for starting jobs) • condor_schedd (submitting jobs) • condor_shadow (submit-side helper) • condor_collector (only on Central Manager) • condor_negotiator (only on CM) • You only have to run the daemon(s) for the service(s) you want to provide
condor_master • Starts up all other Condor daemons • If there are any problems and a daemon exists, it restarts the daemon and sends email to the administrator • Checks the time stamps on the binaries it is configured to spawn, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version
condor_master (cont’d) • Provides access to many remote administration commands: • condor_reconfig • condor_restart, condor_off, condor_on • Default server for many other commands: • condor_config_val, etc. • Periodically runs condor_preen to clean up any files Condor might have left on the machine (the rest of the daemons clean up after themselves, as well)
condor_startd • Represents a machine to the Condor pool • Enforces the wishes of the machine owner (the owner’s “policy”) • Responsible for starting, suspending, and stopping jobs • Spawns the appropriate condor_starter, depending on the type of job • Provides other administrative commands: (for example, condor_vacate)
condor_starter • Spawned by the condor_startd to handle all the details of starting and managing the job (for example, transferring the job’s binary to the executing machine or sending back exit status) • On SMP machines, you get one condor_starter per CPU • For PVM jobs, the starter also spawns a PVM daemon (condor_pvmd)
condor_schedd • Represents users to the Condor pool • Maintains persistent queue of jobs • Queue is not strictly FIFO (priority based) • Responsible for contacting available machines and spawning waiting jobs • Services most user commands: • condor_submit • condor_rm • condor_q
condor_shadow • Represents the job on the submit machine • Services requests from “standard” jobs for “remote system calls”, including all file I/O • Is responsible for making decisions on behalf of the job (for example, where to store the checkpoint file) • There will be one condor_shadow process running on your submit machine for each currently running Condor job
condor_shadow (cont’d) • The shadow doesn’t put much load on your submit machine: • Almost always blocked waiting for requests from the job or doing I/O • Relatively small memory footprint • Still, you can limit the impact of the shadows on a given submit machine: • They can be started by Condor with a “nice-level” that you configure (renice) • Can put a limit on the total number of shadows running on a machine
condor_collector • Collects information from all other Condor daemons in the pool • Each daemon sends a periodic update called a “ClassAd” to the collector • Services queries for information: • Queries from other Condor daemons • Queries from users (condor_status)
condor_negotiator • Performs “matchmaking” in Condor • Gets information from the collector about all available machines and all idle jobs • Tries to match jobs with machines that will serve them • Both the job and the machine must satisfy each other’s requirements (this is called “2-way matching”) • Handles User Priorities
Execute-Only Execute-Only Submit-Only Regular Node Regular Node Central Manager = Process Spawned negotiator collector schedd schedd schedd schedd master master master master master master startd startd startd startd startd Layout of a General Condor Pool = ClassAd Communication Pathway
Job Startup Startd Schedd Starter Customer Job Shadow Condor Syscall Lib Submit
PREEMPTING UNCLAIMED CLAIMED OWNER begin MATCHED Machine States
Viewing things with condor_status • condor_status has lots of different options to display various kinds of info • Supports “-constraint” so you can only view ClassAds that match an expression you specify • Supports “-format” so you can get the data in whatever form you want (very useful for writing scripts) • View any kind of daemon ClassAd
Viewing things with condor_q • View the job queue • The “-long” option is useful to see the entire ClassAd for a given job • Also supports the “-constraint” option • Can view job queues on remote machines with the “-name” option
Looking at condor_q -analyze • You specify a job or set of jobs you want to analyze • condor_q will try to figure out why the job isn’t running • The output is not as user-friendly as we’d like (though we’re working on it) • Good at finding errors in Requirements expressions set by users
Host/IP Security in Condor • You can configure each machine in your pool to allow or deny certain actions from different groups of machines: • “read” access - querying information • condor_status, condor_q, etc • “write” access - updating information • condor_submit, adding a node to the pool, etc • “administrator” access • condor_on, off, reconfig, restart... • “owner” access • Things a machine owner can do (vacate)
Setting up Host/IP-address Security in Condor (part 1) • To configure, you list what hosts are allowed or denied to perform each action • If you list hosts that are allowed, everything else is denied • If you list hosts that are denied, everything else is allowed • If you list both, only hosts that are listed in “allow” but not in “deny” are allowed
Setting up Host/IP-address Security in Condor (part 2) • There are many possibilities for specifying which hosts are allowed or denied: • Host names, domain names • IP addresses, subnets • Wildcards • ‘*’ can be used anywhere (once) in a host name (for example, “infn-corsi*.corsi.infn.it) • ‘*’ can be used at the end of any IP address (e.g. “128.105.101.*” or “128.105.*”)
Setting up Host/IP-address Security in Condor (part 3) • Can define values that effect all daemons: • HOSTALLOW_WRITE, HOSTDENY_READ, HOSTALLOW_ADMINISTRATOR, etc. • Can define daemon-specific settings: • HOSTALLOW_READ_SCHEDD, HOSTDENY_WRITE_COLLECTOR, etc. • Write access doesn’t automatically provide read access: you must grant both!
Example Host/IP Security Settings HOSTALLOW_WRITE = *.infn.it HOSTALLOW_ADMINISTRATOR = infn-corsi1*, \ $(CONDOR_HOST), axpb07.bo.infn.it, \ $(FULL_HOSTNAME) HOSTDENY_ADMINISTRATOR = infn-corsi15 HOSTDENY_READ = *.gov, *.mil HOSTDENY_ADMINISTRATOR_NEGOTIATOR = *
New Security Features in v6.3 • AUTHENTICATION_METHODS • Kerberos, GSI (X.509 certs), FS, NTSSPI • Strong Encryption • Demo/BoF in 3397