1 / 22

Upgrading Condor Best Practices

Learn best practices for upgrading Condor to handle frequent releases, manage config files, conduct testing, and deal with Standard Universe issues. Centralized management and incremental testing are key. Understand compatibility guarantees and how to incrementally test each component. Avoid big bang approach and prioritize incremental restarts. When to upgrade? Follow the zeroth law of software engineering and stay informed about security issues. Keep config files under control and be aware of Standard Universe issues.

kunkel
Download Presentation

Upgrading Condor Best Practices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Upgrading CondorBest Practices

  2. The problem • More frequent releases of Condor • Every six to nine months? • Understand this is a problem for users • We’re willing to help out

  3. Overview • Config file management • Condor testing strategies • Standard Universe issues

  4. Config files • LOCAL_CONFIG_FILE • Used for #include-like behaviour: • LOCAL_CONFIG_FILE = \ • $(HOSTS), $(GLOBAL), $(POLICY)…

  5. Typical Config file ## Try to save this much swap space by not starting new shadows. ## Specified in megabytes. #RESERVED_SWAP = 5 Commented out lists the default value

  6. Config file editing • Never edit base condor_config file • Except to specify the local file • Put all edits in a local file • One local file per config type • E.g. for schedds, CMs, types of execute machines • Can mix and match

  7. Dealing with a new config • Diff base config with your config • Understand new items • Documented in manual version-history • Existing ones rarely change • Usually capacity changes • Almost always, overwriting base file works

  8. Managing config files • Centralized management key • Cfengine, rsync, nfs (!) etc.

  9. Testing new versions

  10. Compatibility Guarantees • No guarantees… • But we try very hard! • Both forward and backward • Especially within one machine • Federation techniques require this

  11. Incremental testing! • Three basic components of Condor: • Central Manager • Submit points • Execute machines • Test each independently

  12. Testing Central Manager • Take advantage of statelessness • Condor HAD can help out here If it breaks, existing jobs keep running

  13. Testing schedds • Adding a new test schedd easy • Test jobs useful too, not just sleep • Schedd can be bottleneck • Probably only place you need to check cpu performance

  14. Testing startds • Easy to test a few at once • Be careful when running std uni • Glide in can be very helpful • But beware of root specific issues • Admin slots helpful

  15. Now that we’ve tested… Always be undo-able! (never overwrite files) Rely on master restart on stat change

  16. Big bang approach • What we do at CS • Just change a symlink to the binaries • Master does the rest… • Can be a big hit on shared filesystems

  17. Incremental restart • First, restart CM • No jobs lost • Send, reboot schedd • If restart happens in 20 minutes, jobs keep running • What about the startds? • Might be OK for standard uni • Work on this coming soon…

  18. Standard Universe • More sensitive to backward compatibility • CheckpointPlatform clarifications • condor_qedit -constraint 'LastCheckpointPlatform =?= "LINUX INTEL 2.6.x normal"' LastCheckpointPlatform '"LINUX INTEL 2.6.x normal 0xffffe000"'

  19. Draining old Std Uni • Keep a few old startds around • To finish old standard uni jobs • Set start to “JobUniverse == 1” • Or maybe rank… • Only on the old platforms

  20. When to upgrade? • Zeroth law of software engineering • Development series actually pretty stable • We’ll let you know about security issues • Probably don’t need every minor version • Don’t be more than one major stable version behind

  21. In summary… • Keep config files under control • Test each component in isolation • Be aware of standard universe issues

  22. Any questions? • Thank you!

More Related