1 / 20

Large Farm 'Real Life Problems' and their Solutions

Large Farm 'Real Life Problems' and their Solutions. Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL. Outline. Farms at the CERN CC: The Tools Framework The Working Teams Real Life Use Cases Collaborations Summary Useful Links. =. +. +. LEAF. QUATTOR. LEMON. The Tools Framework.

gladys
Download Presentation

Large Farm 'Real Life Problems' and their Solutions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL

  2. Outline Farms at the CERN CC: • The Tools Framework • The Working Teams • Real Life Use Cases • Collaborations • Summary • Useful Links ThorstenKleinwort IT/FIO/FS

  3. = + + LEAF QUATTOR LEMON The Tools Framework • ELFms • Quattor: • Installation (Kickstart + SWREP) • Configuration (CDB + NCM) • Management (SPMA + NCM) • Lemon: • Monitoring • Batch system statistics • LEAF: • State management (SMS) • Hardware management (HMS) ThorstenKleinwort IT/FIO/FS

  4. The Tools Framework (cont’d) • The evolution of the ELFms tools is described in various previous presentations: • HEPiX II/2003 (Vanouver): • ‘The new Fabric Management Tools in Production at CERN’ • HEPiX I/2004 (Edinburgh): • ‘ELFms, status, deployment’ by German Cancio • ‘Lemon Web Monitoring’ by Miroslav Siket • CHEP 2004 (Interlaken): • ‘Current Status of Fabric Management at CERN’ by German • This HEPiX: • `Experience in the use of quattor tool suite outside CERN’ => Progress has been made, improvements are ongoing, Quattor is more and more used outside CERN ThorstenKleinwort IT/FIO/FS

  5. Tools (cont’d): • Other tools [interfacing CDB]: • Script: PrepareInstall.pl: • Does all necessary steps to prepare a machine install • Can run with a list of hosts (for mass installs) • Gets all the necessary information from CDB • Creates a kickstart file for each node • Local Script: maintenance: • Script to rundown a node: • Drains batch nodes • Warns users on interactive nodes • Can execute configurable script at the end, e.g. reboot ThorstenKleinwort IT/FIO/FS

  6. Tools (cont’d) • Automated Fabric [LEAF]: • State Management System SMS: • Other CDB changes are done by SMS: • Change OS/Cluster • Systems have state: • ‘production’ or ‘standby’ • Hardware Management System HMS: • Workflow to track hardware changes [interfaces CDB]: • New machine arrival • Machine moves • Machine interventions (Vendor calls), retirements ThorstenKleinwort IT/FIO/FS

  7. Other groups/teams in CERN-IT, like: • DB (ORACLE) • GD (LCG) • GM (EGEE) • Experiments(Data Challenges) • Changing requirements • New team • Now 7 staff, more to hire • Running more and more services in the CC • Doing most of the install and maintenance work on farm PCs • Following up h/w failures ‘Vendor calls’ • Farm/Cluster resource planning • Writing/improving the procedures/tools • Following up on new problems • 24/7 • Alarm display • Following procedures: • Acting on alarms • Open Remedy tickets • Email/phone notification • Machine reboots The Working Teams “Customers” Service Manager SysAdmins Operator ThorstenKleinwort IT/FIO/FS

  8. Another Management Tool • Remedy: • The problem tracking tool in CERN IT • Used in different workflows, e.g. by: • The Operator to open tickets following up on alarms • The Service Managers to ask for machine interventions • The SysAdmins to follow up on problems/general issues • HMS is implemented as a Remedy Workflow as well • Recently started to get statistics on hardware failures ThorstenKleinwort IT/FIO/FS

  9. Real Life Use Cases • Kernel upgrade (on LXBATCH, ~1500 hosts): • Put the new software into the repository (SWREP, precaching) • Put the new kernel RPM on the nodes:SPMA, with multi-package option (old kernel is still running!) • Configure the new kernel version for the cluster in CDB, and run the GRUB NCM component for configuring the node • Drain the nodes by disabling new batch jobs (maintenance) ThorstenKleinwort IT/FIO/FS

  10. Real Life Use Cases • Kernel upgrade (cont’d): • Node reboots when it is drained (could be at any time) • New machine comes up with new kernel, and goes back into production immediately • Least downtime for each node. Capacity is always available: • First reboot instantaneous, last one can be several days later • Everything runs automatically, some cleanup has to be done for few machines (don’t shutdown or h/w failure on startup) => caught by the monitoring/alarm ThorstenKleinwort IT/FIO/FS

  11. Real Life Use Cases (cont’d) • Configure batch resources (LSF): • LSF resources are defined, depending on availability, power and cluster of machines • Resources are defined in CDB • Configured on the node using NCM • The master file is generated from CDB2SQL in a cron job every day (reconfig takes several minutes) • Consistency of client/master due to CDB • Resources assignments are done in CDB on (sub-) cluster level (template structure) • Reassignments of (sub-)clusters in CDB are done with SMS tools ThorstenKleinwort IT/FIO/FS

  12. Real Life Use Cases (cont’d) • Emptying the Computer Centre • For the refurbishment of the CERN Computer Centre all machines had to be moved, either from one side to the other, or downstairs (vault) • ~ 2000 machines had to be moved • Taking the opportunity to add machines to CDB • As quattor and non-quattor nodes • Batch machines were moved in ‘racks=44 nodes’: • HMS was used to steer the moves • SMS/maintenance to shut down the machines • Rename/PrepareInstall to bring machines back ThorstenKleinwort IT/FIO/FS

  13. ThorstenKleinwort IT/FIO/FS

  14. Real Life Use Cases (cont’d) • New h/w arrival => mass installation • New machines (~400) arrive at CERN(in bunches of 50 – 100) • Racks have to be prepared: • Network equipment • Power supply • (Console service) • Plan machine membership (cluster) • Put machine into CDB: • h/w type • Cluster type/OS ThorstenKleinwort IT/FIO/FS

  15. Real Life Use Cases • New h/w arrival (cont’d) • Physical machine installation (HMS): • New DNS entry • OS installation: PrepareInstall • Installation by the SysAdmin • Burn-in test (h/w test, several days to weeks) • Follow up on h/w problems with Vendor • Add the machines to the alarm display (SURE) • Put machines into production ThorstenKleinwort IT/FIO/FS

  16. ThorstenKleinwort IT/FIO/FS

  17. Collaborations • External ‘Customers’: • EGEE, LCG, and other groups at CERN are now using Quattor managed machines: • They benefit from standard, manageable, and reproducible machine setups • They are able/should learn to do modifications themselves • External sites using Quattor: • IN2P3, NIKHEF, UAM Madrid,… discussing to or use already Quattor => see Rafael’s talk • This helps to enhance the tools: • Service nodes (for LCG-2) • Having a wider usage • Generalizing components ThorstenKleinwort IT/FIO/FS

  18. Summary • ELFms is deployed in production at CERN • Established technology – from Prototype to Production • Though enhancements are ongoing • Fundamental part of our infrastructure • Merged with our existing environment • Quattor and Lemon are generic software • Used by others inside/outside CERN • Hopefully a fruitful collaboration in the future ThorstenKleinwort IT/FIO/FS

  19. Useful Links: • ELFms: http://cern.ch/elfms • Quattor: http://quattor.org/ • Lemon: http://cern.ch/lemon • LEAF: http://cern.ch/leaf • Previous presentations: • HEPiX II/2003 (Vanouver):http://www.triumf.ca/hepix2003 • ‘The new Fabric Management Tools in Production at CERN’: • HEPiX I/2004 (Edinburgh):http://www.nesc.ac.uk/esi/events/291/ • ‘ELFms, status, deployment’ by German Cancio • ‘Lemon Web Monitoring’ by Miroslav Siket • CHEP 2004 (Interlaken):http://chep2004.web.cern.ch/chep2004/ • ‘Current Status of Fabric Management at CERN’ by German Cancio ThorstenKleinwort IT/FIO/FS

  20. Questions? ThorstenKleinwort IT/FIO/FS

More Related