120 likes | 251 Views
Advanced Fabric Management. Bill Tomlin for CERN IT/FIO GRIDPP 10 th Collaboration Meeting June 2004. Managing a large installation. ~2800 nodes in the CERN CC Approaching 10,000 by 2008 Frequent mass installs, moves, retirements Daily failures of hardware Heterogeneous H/W & S/W
E N D
Advanced Fabric Management Bill Tomlin for CERN IT/FIO GRIDPP 10th Collaboration Meeting June 2004
Managing a large installation • ~2800 nodes in the CERN CC • Approaching 10,000 by 2008 • Frequent mass installs, moves, retirements • Daily failures of hardware • Heterogeneous H/W & S/W • Multiple functionality (batch, disk, tape, DB, web etc.) • Planning required • Data challenges, test-beds, capacity • Not easy to meet needs: • Find things • Know what’s happening • Maximize availability • Resource CC operations
Fabric Management in a nutshell quattor Automatic configuration Automatic installation LEAF SMS High-level control Effectively Managed Fabric HMS Workflow tools Visualization tools Lemon Managed hardware Effective monitoring
= + + Extremely Large Fabric management system
H T T P RDBMS S Q L S O A P pan Cache XML CCM Quattor: configuration, installation and management GUI CDB CLI Scripts Node Management Agents Node
LEAF – LHC Era Automated Fabric • SMS: State Management System • Issue high level configuration commands • Nodes automatically take themselves into and out of production • Used during software interventions e.g. kernel upgrade for a cluster • Used during hardware interventions e.g. move a rack of machines • Validates state transitions • Keep history – who, when, why • Handles concurrent requests
LEAF – LHC Era Automated Fabric • HMS: Hardware Management System • Result of process reengineering • Provides consistent, traceable workflows • Manages: • Installs • Moves • Renames • Retirements • Repairs • Implemented using Remedy • Web interface available • Allows visualization & searching for objects
Node Use Case: Move rack of machines 1. Import HMS 6. Shutdown work order 10. Install work order 7. Request move Sysadmins Operations 2. Set to standby 11. Set to production 8. Update SMS 9. Update LAN DB 3. Update 12. Update CDB 5. Take out of production 4. Refresh 14. Put into production 13. Refresh
LEAF Status • HMS • In production since late 2002 (installs only) • Rapid evolution – 16 production releases last year • Used successfully to move & install 100’s machines • Fully integrated (LAN DB, CDB, SMS, other workflow apps) • SMS • First production release January (stable CDB) • Now for all quattor-managed nodes (>2000) • All batch and interactive nodes change state automatically
LEAF Next Steps • Consolidate • Evolve smoother processes • Documentation • Populating data (warranties etc.) • Phase-out legacy components • Extend HMS to other equipment types, individual components • Extend SMS for more clusters, states (like shutdown) • Visualization tool to: • Get/set properties and states • Initialize workflows