200 likes | 344 Views
Large Farm 'Real Life Problems' and their Solutions. Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL. Outline. Farms at the CERN CC: The Tools Framework The Working Teams Real Life Use Cases Collaborations Summary Useful Links. =. +. +. LEAF. QUATTOR. LEMON. The Tools Framework.
E N D
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL
Outline Farms at the CERN CC: • The Tools Framework • The Working Teams • Real Life Use Cases • Collaborations • Summary • Useful Links ThorstenKleinwort IT/FIO/FS
= + + LEAF QUATTOR LEMON The Tools Framework • ELFms • Quattor: • Installation (Kickstart + SWREP) • Configuration (CDB + NCM) • Management (SPMA + NCM) • Lemon: • Monitoring • Batch system statistics • LEAF: • State management (SMS) • Hardware management (HMS) ThorstenKleinwort IT/FIO/FS
The Tools Framework (cont’d) • The evolution of the ELFms tools is described in various previous presentations: • HEPiX II/2003 (Vanouver): • ‘The new Fabric Management Tools in Production at CERN’ • HEPiX I/2004 (Edinburgh): • ‘ELFms, status, deployment’ by German Cancio • ‘Lemon Web Monitoring’ by Miroslav Siket • CHEP 2004 (Interlaken): • ‘Current Status of Fabric Management at CERN’ by German • This HEPiX: • `Experience in the use of quattor tool suite outside CERN’ => Progress has been made, improvements are ongoing, Quattor is more and more used outside CERN ThorstenKleinwort IT/FIO/FS
Tools (cont’d): • Other tools [interfacing CDB]: • Script: PrepareInstall.pl: • Does all necessary steps to prepare a machine install • Can run with a list of hosts (for mass installs) • Gets all the necessary information from CDB • Creates a kickstart file for each node • Local Script: maintenance: • Script to rundown a node: • Drains batch nodes • Warns users on interactive nodes • Can execute configurable script at the end, e.g. reboot ThorstenKleinwort IT/FIO/FS
Tools (cont’d) • Automated Fabric [LEAF]: • State Management System SMS: • Other CDB changes are done by SMS: • Change OS/Cluster • Systems have state: • ‘production’ or ‘standby’ • Hardware Management System HMS: • Workflow to track hardware changes [interfaces CDB]: • New machine arrival • Machine moves • Machine interventions (Vendor calls), retirements ThorstenKleinwort IT/FIO/FS
Other groups/teams in CERN-IT, like: • DB (ORACLE) • GD (LCG) • GM (EGEE) • Experiments(Data Challenges) • Changing requirements • New team • Now 7 staff, more to hire • Running more and more services in the CC • Doing most of the install and maintenance work on farm PCs • Following up h/w failures ‘Vendor calls’ • Farm/Cluster resource planning • Writing/improving the procedures/tools • Following up on new problems • 24/7 • Alarm display • Following procedures: • Acting on alarms • Open Remedy tickets • Email/phone notification • Machine reboots The Working Teams “Customers” Service Manager SysAdmins Operator ThorstenKleinwort IT/FIO/FS
Another Management Tool • Remedy: • The problem tracking tool in CERN IT • Used in different workflows, e.g. by: • The Operator to open tickets following up on alarms • The Service Managers to ask for machine interventions • The SysAdmins to follow up on problems/general issues • HMS is implemented as a Remedy Workflow as well • Recently started to get statistics on hardware failures ThorstenKleinwort IT/FIO/FS
Real Life Use Cases • Kernel upgrade (on LXBATCH, ~1500 hosts): • Put the new software into the repository (SWREP, precaching) • Put the new kernel RPM on the nodes:SPMA, with multi-package option (old kernel is still running!) • Configure the new kernel version for the cluster in CDB, and run the GRUB NCM component for configuring the node • Drain the nodes by disabling new batch jobs (maintenance) ThorstenKleinwort IT/FIO/FS
Real Life Use Cases • Kernel upgrade (cont’d): • Node reboots when it is drained (could be at any time) • New machine comes up with new kernel, and goes back into production immediately • Least downtime for each node. Capacity is always available: • First reboot instantaneous, last one can be several days later • Everything runs automatically, some cleanup has to be done for few machines (don’t shutdown or h/w failure on startup) => caught by the monitoring/alarm ThorstenKleinwort IT/FIO/FS
Real Life Use Cases (cont’d) • Configure batch resources (LSF): • LSF resources are defined, depending on availability, power and cluster of machines • Resources are defined in CDB • Configured on the node using NCM • The master file is generated from CDB2SQL in a cron job every day (reconfig takes several minutes) • Consistency of client/master due to CDB • Resources assignments are done in CDB on (sub-) cluster level (template structure) • Reassignments of (sub-)clusters in CDB are done with SMS tools ThorstenKleinwort IT/FIO/FS
Real Life Use Cases (cont’d) • Emptying the Computer Centre • For the refurbishment of the CERN Computer Centre all machines had to be moved, either from one side to the other, or downstairs (vault) • ~ 2000 machines had to be moved • Taking the opportunity to add machines to CDB • As quattor and non-quattor nodes • Batch machines were moved in ‘racks=44 nodes’: • HMS was used to steer the moves • SMS/maintenance to shut down the machines • Rename/PrepareInstall to bring machines back ThorstenKleinwort IT/FIO/FS
Real Life Use Cases (cont’d) • New h/w arrival => mass installation • New machines (~400) arrive at CERN(in bunches of 50 – 100) • Racks have to be prepared: • Network equipment • Power supply • (Console service) • Plan machine membership (cluster) • Put machine into CDB: • h/w type • Cluster type/OS ThorstenKleinwort IT/FIO/FS
Real Life Use Cases • New h/w arrival (cont’d) • Physical machine installation (HMS): • New DNS entry • OS installation: PrepareInstall • Installation by the SysAdmin • Burn-in test (h/w test, several days to weeks) • Follow up on h/w problems with Vendor • Add the machines to the alarm display (SURE) • Put machines into production ThorstenKleinwort IT/FIO/FS
Collaborations • External ‘Customers’: • EGEE, LCG, and other groups at CERN are now using Quattor managed machines: • They benefit from standard, manageable, and reproducible machine setups • They are able/should learn to do modifications themselves • External sites using Quattor: • IN2P3, NIKHEF, UAM Madrid,… discussing to or use already Quattor => see Rafael’s talk • This helps to enhance the tools: • Service nodes (for LCG-2) • Having a wider usage • Generalizing components ThorstenKleinwort IT/FIO/FS
Summary • ELFms is deployed in production at CERN • Established technology – from Prototype to Production • Though enhancements are ongoing • Fundamental part of our infrastructure • Merged with our existing environment • Quattor and Lemon are generic software • Used by others inside/outside CERN • Hopefully a fruitful collaboration in the future ThorstenKleinwort IT/FIO/FS
Useful Links: • ELFms: http://cern.ch/elfms • Quattor: http://quattor.org/ • Lemon: http://cern.ch/lemon • LEAF: http://cern.ch/leaf • Previous presentations: • HEPiX II/2003 (Vanouver):http://www.triumf.ca/hepix2003 • ‘The new Fabric Management Tools in Production at CERN’: • HEPiX I/2004 (Edinburgh):http://www.nesc.ac.uk/esi/events/291/ • ‘ELFms, status, deployment’ by German Cancio • ‘Lemon Web Monitoring’ by Miroslav Siket • CHEP 2004 (Interlaken):http://chep2004.web.cern.ch/chep2004/ • ‘Current Status of Fabric Management at CERN’ by German Cancio ThorstenKleinwort IT/FIO/FS
Questions? ThorstenKleinwort IT/FIO/FS