1 / 10

Configuration Life-Cycle Management on the TeraGrid

Configuration Life-Cycle Management on the TeraGrid. Ti Leggett. Challenges of Managing Computational Resources. Software, hardware, and user needs change rapidly Maintaining uniform resources Handling one-offs Staying current with patches and security updates

onaona
Download Presentation

Configuration Life-Cycle Management on the TeraGrid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Configuration Life-Cycle Management on the TeraGrid Ti Leggett

  2. Challenges of Managing Computational Resources • Software, hardware, and user needs change rapidly • Maintaining uniform resources • Handling one-offs • Staying current with patches and security updates • Documenting how and what machines run

  3. Managing Configurations • Unattended OS deployment • Jumpstart, Kickstart, Yast • Cluster distributions • OSCAR, ROCKS • Configuration management systems • Cfengine, LCFG, Bcfg2

  4. UC/ANL Cluster Configuration Management • A microcosm of machine classes • Cluster goals are to maximize availability, predictability and reliability • Originally used SystemImager to duplicate similar classes • Switched to Bcfg2 early 2005

  5. Cluster Uniformity • Necessary for the user • Necessary for the administrator • UC/ANL has two compute classes and many management classes running two different OS versions

  6. Security • Performing security patches • Auditing cluster status • Updating machines after extended downtime or maintenance • Aiding intrusion detection

  7. Reusability • Machine failures • Disk failures • Non-disk failures • Machine replication • New machines

  8. Specification as Documentation • Dealing with administrator absences • Using version control • Teaching new administrators • Dealing with already running and working machines

  9. Future Work • Reduce dependency on tape backups • Integrate with tools such as Nagios, Nessus, and iptables • Integration with LDAP

  10. Questions?

More Related