Upgrading Condor Best Practices

Upgrading CondorBest Practices

The problem • More frequent releases of Condor • Every six to nine months? • Understand this is a problem for users • We’re willing to help out

Overview • Config file management • Condor testing strategies • Standard Universe issues

Config files • LOCAL_CONFIG_FILE • Used for #include-like behaviour: • LOCAL_CONFIG_FILE = \ • $(HOSTS), $(GLOBAL), $(POLICY)…

Typical Config file ## Try to save this much swap space by not starting new shadows. ## Specified in megabytes. #RESERVED_SWAP = 5 Commented out lists the default value

Config file editing • Never edit base condor_config file • Except to specify the local file • Put all edits in a local file • One local file per config type • E.g. for schedds, CMs, types of execute machines • Can mix and match

Dealing with a new config • Diff base config with your config • Understand new items • Documented in manual version-history • Existing ones rarely change • Usually capacity changes • Almost always, overwriting base file works

Managing config files • Centralized management key • Cfengine, rsync, nfs (!) etc.

Testing new versions

Compatibility Guarantees • No guarantees… • But we try very hard! • Both forward and backward • Especially within one machine • Federation techniques require this

Incremental testing! • Three basic components of Condor: • Central Manager • Submit points • Execute machines • Test each independently

Testing Central Manager • Take advantage of statelessness • Condor HAD can help out here If it breaks, existing jobs keep running

Testing schedds • Adding a new test schedd easy • Test jobs useful too, not just sleep • Schedd can be bottleneck • Probably only place you need to check cpu performance

Testing startds • Easy to test a few at once • Be careful when running std uni • Glide in can be very helpful • But beware of root specific issues • Admin slots helpful

Now that we’ve tested… Always be undo-able! (never overwrite files) Rely on master restart on stat change

Big bang approach • What we do at CS • Just change a symlink to the binaries • Master does the rest… • Can be a big hit on shared filesystems

Incremental restart • First, restart CM • No jobs lost • Send, reboot schedd • If restart happens in 20 minutes, jobs keep running • What about the startds? • Might be OK for standard uni • Work on this coming soon…

Standard Universe • More sensitive to backward compatibility • CheckpointPlatform clarifications • condor_qedit -constraint 'LastCheckpointPlatform =?= "LINUX INTEL 2.6.x normal"' LastCheckpointPlatform '"LINUX INTEL 2.6.x normal 0xffffe000"'

Draining old Std Uni • Keep a few old startds around • To finish old standard uni jobs • Set start to “JobUniverse == 1” • Or maybe rank… • Only on the old platforms

When to upgrade? • Zeroth law of software engineering • Development series actually pretty stable • We’ll let you know about security issues • Probably don’t need every minor version • Don’t be more than one major stable version behind

In summary… • Keep config files under control • Test each component in isolation • Be aware of standard universe issues

Any questions? • Thank you!

Upgrading Condor Best Practices