210 likes | 359 Views
Fabric Management CCDB2 RTAG April 23 rd 2002 Tony.Cass@ CERN .ch with much help from German Cancio Melia. What is Fabric Management?. Maintaining. Large clusters of servers. In specific desired state. In specific desired state(s). What does this mean/involve?. Maintain Large clusters
E N D
Fabric ManagementCCDB2 RTAGApril 23rd 2002Tony.Cass@CERN.chwith much help from German Cancio Melia
What is Fabric Management? Maintaining Large clusters of servers In specific desired state In specific desired state(s)
What does this mean/involve? • Maintain • Large clusters • In desired state
What does this mean/involve? • Maintain • Install • Upgrade • Verify • Large clusters • In desired state
What does this mean/involve? • Maintain • Install: Two options • Image • Pro: All systems identical by construction • Con: Building & storing images • Con: Inflexible; reboot almost always required on change; this is disruptive: imagine impact of urgent security patch to application code or updating routing tables for tierX<->tierY transfers. • “Known Process” • Pro: Flexible; reboots only when essential • Con: guaranteeing reproducibility, especially over time. • Upgrade • Verify • Large clusters • In desired state
What does this mean/involve? • Maintain • Install: Two options • Image • Early approach: no standard installation procedures: easy to build image then replicate, very hard to define “known process” except on paper. • “Known Process” • Standardised s/w installation systems, e.g. RPM, bring known process fabric management comes to the fore---define which packages to install, then the installation tool handles the rest, including dependency issues. • Upgrade • Verify • Large clusters • In desired state
What does this mean/involve? • Maintain • Install • Upgrade • Clearly follows from choice of installation mechanism. • For image systems, upgrade is essentially installation of the new image • For known process systems, software package management and/or configuration systems adjust node to match change in desired state. • Verify • Large clusters • In desired state
What does this mean/involve? • Maintain • Install • Upgrade • Verify • As we’ve seen, verification that software is as desired is essential in known process systems: Did we get what we wanted? • But also, “do we still have what we want”? And this is equally needed for image installs: has anything changed, especially wrt security. • Software monitoring systems should be well integrated with the overall system monitoring • Raise alarms for exceptions and ensure they are followed just as for file system full errors. • Large clusters • In desired state
What does this mean/involve? • Maintain • Large clusters • Many boxes, so need to worry about • System errors & failures (what if system out for repair during upgrade?) • Mundane box related issues: arrivals, departures, repairs • Workflow for system upgrades (drain, upgrade, restart, …) • … • Most site dependent part of fabric management • In desired state
What does this mean/involve? • Maintain • Large clusters • In desired state • Need a way to • specify • update • recover • the desired state for each system. • This is fairly easy (well, apart from recover…); you just need a database associating some key (host name, MAC address) with the software packages & required configuration.
What does this mean/involve? • Maintain • Large clusters • In desired state(s) • The ease of specification of multiple states is the harder and more important part • define characteristics for clusters, not systems • host configuration defined by cluster membership, but should be able to override any aspect • inheritance especially useful • many system configuration details (ntp, name servers, …) are independent of system function; define these once and propagate to all clusters • allow similar clusters to share definition of the common configuration definition---avoid potential for drift if only one cluster definition is updated.
Standards Interlude • There are none. • Software installation tools exist for many platforms and distributions but all differ • Still, a good Fabric Management system should have a high level interface allowing free choice at this level • e.g. quattor: interfaced to both RH & Solaris installation tools • No widely acknowledged standards for defining system configuration. • Choices in this area generally define the different fabric management suites • “rules based” systems (cfengine) • “configuration language” systems (LCFG(ng), quattor) • There is work in this area, but obvious common standards are still far away. • CIM, HP/IBM work to define web services based standards, DCML
Some Systems • ELFms • Rocks • Cfengine • LCFG(ng) • OSCAR/SIS • Ganglia • MonALISA
Some Systems • ELFms • A complete package with • quattor (aii/spma/ncm) known process installation • Lemon monitoring integrated • Leaf for workflow management of software hardware processes • Rocks • Cfengine • LCFG(ng) • OSCAR/SIS • Ganglia • MonALISA
Some Systems • ELFms • Rocks • RH specific system, kickstart based but reinstalls nodes for configuration changes. • Limited config capabilites • No support for multiple packages versions (either in repository or on a node) • Cfengine • LCFG(ng) • OSCAR/SIS • Ganglia • MonALISA
Some Systems • ELFms • Rocks • Cfengine • A set of tools to administer and configure systems • Rules based approach • state maintained in set of rule files; cfengine tools read these, check the status and update systems accordingly • LCFG(ng) • OSCAR/SIS • Ganglia • MonALISA
Some Systems • ELFms • Rocks • Cfengine • LCFG(ng) • Known process installation and configuration • Key feature is introduction of “language” for description of required system configuration. • this approach adopted and enhanced by EDG/WP4 for quattor • OSCAR/SIS • Ganglia • MonALISA
Some Systems • ELFms • Rocks • Cfengine • LCFG(ng) • OSCAR/SIS • Image based installation (SIS) • Ganglia • MonALISA
Some Systems • ELFms • Rocks • Cfengine • LCFG(ng) • OSCAR/SIS • Ganglia • “a scalable distributed monitoring system for high-performance computing systems” • can monitor many standard parameters for systems • but not integrated with s/w installation systems for verification • MonALISA
Some Systems • ELFms • Rocks • Cfengine • LCFG(ng) • OSCAR/SIS • Ganglia • MonALISA • Distributed monitoring system • Aimed at performance issues, not integration with installation frameworks • Can collect input from other monitoring systems (e.g. Lemon) as well as directly from nodes.
Summary • Fabric Management is concerned with maintaining large clusters in defined states, handling evolution over time. • Installation/Upgrade can be via disk image or a more flexible “known process” • No standards (yet) for definition of system configuration • Installation toolkits mostly differ in approach in this area. • Many monitoring systems, but these are independent developments, mostly concentrating on performance related metrics. • ELFms integrates quattor installation and configuration toolkit with the Lemon monitoring system to provide tight control over node status • and adds a (CERN specific) package to manage software and hardware workflows.