300 likes | 451 Views
The CERN Agile Infrastructure Project: Configuration and Operations Tools. Helge Meinhard / CERN-IT (replacing Manuel Guijarro ) HEPiX Spring 2012 24 April 2012, Praha. Configuration and Operations Tools. https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure
E N D
The CERNAgile Infrastructure Project:Configuration and Operations Tools HelgeMeinhard / CERN-IT(replacing Manuel Guijarro) HEPiX Spring 2012 24 April 2012, Praha
Configuration and Operations Tools https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure https://agileinf.cern.ch/jira/ Agile Infrastructure - Configuration and Operation Tools
Project Scope • The project is reviewing the entire CERN computer-centre management toolset • What happens from the bare metal up • Asset management, inventory • Sysadmin tools and maintenance workflows • Service management and configuration tools • Dynamic configuration for ‘virtual’ hosts • Operations monitoring • Workflow automation and continuous deployment • … Agile Infrastructure - Configuration and Operation Tools
Configuration and Operations Tools Agile Infrastructure - Configuration and Operation Tools
Why? • Current production system built around the Quattor toolset is successfully managing O(10k) servers • (CERN) Quattor + many CERN components • Why are we changing the toolset? Agile Infrastructure - Configuration and Operation Tools
What are the Issues (1) • Uncompressible technical debt • The cost to develop and maintain our own solution is not reducing and clearly exceeds our resources • Small community (less funding) and general support problem. At CERN, we’ve fallen into the “sticky hands” support model • We need better automation and integration between the sub-components • Lack of automated workflow: everything is a ticket • emailScript™ : your added value in the process is often your CERN password • The 15-min “CDB commit walk” – context switch cost Agile Infrastructure - Configuration and Operation Tools
What are the Issues (2) • Transferrable skills and training • Learning curve for our tools is steep and remains high • It’s easier to hire people who have skills in a widely-used tool than your internal tools • Depending on where you look Agile Infrastructure - Configuration and Operation Tools
Jobs Adverts – indeed.com Index of millions of worldwide job posts across thousands of job sites These are the sort of posts our departing staff will be applying for. Puppet Quattor Agile Infrastructure - Configuration and Operation Tools
Integration is Hard • IPv6, virtualisation, Windows Server all need a solution • We could leverage lots of open source tools • But piecemeal integration of these requires high investment due to our complex system • Years of organic growth have made the system way too ‘hairy’ • It’s often easier to reinvent rather than integrate • Lack of ‘dynamic-ness’ in the infrastructure • We hack the config system for dynamic VMs • It’s critical to look at the system as a whole Agile Infrastructure - Configuration and Operation Tools
Use Puppet for the Core • The tool space has exploded in the last few years • In configuration management and ops • Large, shared ‘tool forges’, and lots of experience • Puppet and Chef are the clear leaders for the ‘core’ tool • other tools in our ‘scope’ try to integrate with those • Many large-scale enterprises use Puppet • Its declarative approach fits better with what weare used to • Large installations: friendly, wide-base community and commercial support and training • You can buy books on it Agile Infrastructure - Configuration and Operation Tools
Scaling Challenges: Nodes • Currently we have O(10k) physical nodes • IaaS approach: • Moving to virtual machines • More (smaller, load-balanced) service nodes • VMs for raw compute (batch or pilot jobs) • Homogeneous: compute + storage on the same node • Add another computer centre, 24/48 SMT cores per node, you get 100k – 300k virtual nodes to be managed • 99.6%(1) node update success-rate means 1200 manual interventions to “fix it”(1) in a recent intervention on lxbatch Agile Infrastructure - Configuration and Operation Tools
Scaling Challenges: People • Many, diverse applications (“clusters”) managed by different teams • ..and 700+ other “unmanaged” Linux nodes in VMs that could benefit from a simple configuration system Agile Infrastructure - Configuration and Operation Tools
Agile Infrastructure 1stTry (1) • First started investigating tools in September 2011 using ‘part-time’ resources from several IT groups • Trying iterative “agile-sprint” style (Scrum): short sprints, feedback, sprint review, visible • Take first, best-guess at architecture and tool selection, iterate • Mixed success with this agile style • What works: Good visibility and reviews. Daily “scrum” meeting useful. Weekly review meeting open to management. • What doesn’t: The “time boxing” part of Scrum sprints is hard with part-time resources • Now more staff available, but still mostly part-time efforts Agile Infrastructure - Configuration and Operation Tools
Agile Infrastructure 1stTry (2) • We’re currently running: • OpenStackas cloud software for virtual machines, image management, bulk storage • See later presentation • Puppet for the configuration management core • …with Foreman as a dashboard Agile Infrastructure - Configuration and Operation Tools
Foreman Dashboard Agile Infrastructure - Configuration and Operation Tools
Agile Infrastructure 1stTry (2) • We’re currently running: • OpenStackas cloud software for virtual machines, image management, bulk storage • See later presentation • Puppet for the configuration management core • …with Foreman as a dashboard • None of the tools are “perfect” out-of-the-box • .. but we’d rather submit patches to a good open source tool than re-implement it • We’ve experienced very good community support: RFCs and patches are quickly accepted • Very active community: often problems are fixed and missing features implemented before you even report them Agile Infrastructure - Configuration and Operation Tools
Agile Infrastructure 1stTry (3) • We’re currently running: • yum for software distribution (replacing spma) • git for template management: why git? • Almost all the Puppet (and Chef) usage schemes out there assume you use git to handle the templates • Many of the tools we can benefit from also assume git • We should not be different from the rest of the community Agile Infrastructure - Configuration and Operation Tools
Puppet • Client/server architecture • “puppetmaster”: horizontally scalable Rails application • X509 cert authenticated nodes: integrate with CERN CA Agile Infrastructure - Configuration and Operation Tools
Puppet • Puppet runs on the client, applyingthe configuration changes • It detects the current state and only runs if there’s something to do • It runs every few minutes • new configuration will be ~immediately applied (“fail-fast”). • This is a change from CDB where ‘latent’ changes can be stacked up • Normal mode is client-side compile (“assume success”) • No more CDB commit waits • Change from CDB: the compilation fails later • Good monitoring is a pre-req: puppet sends reports back to the puppetmaster • The Foreman tool can collect these for you Agile Infrastructure - Configuration and Operation Tools
Puppet Language • Puppet uses its own Ruby-like language for the templatesto “assert” the desired state of the nodes • With Ruby fall-back for hard stuff (we’ve only needed this once) • Being declarative rather than procedural, there are quirks • Takes a bit of practice to ‘get it’ • There are books, online docs, online cook-books, and a large community to help • It dispenses with the need for ncm components • All the work is done by puppet on the node itself – you just provide the template part to assert what you want done • Less software -> easier to move to new OS versions Agile Infrastructure - Configuration and Operation Tools
Externals • Puppet uses an external DB for much of the configuration that we currently store in textual CDB templates • Node function + hardware • Moving a host between clusters is a DB update • Your configuration can use variables the node detects itself • e.g. reconfigure daemons based on where a newly live-migrated VM has found itself • Query the compiled configuration of other hosts • e.g. Open my firewall to the lxadm nodes Agile Infrastructure - Configuration and Operation Tools
Moving towards PaaS • Parametrisable recipes • Just fill in the blanks • The aim is to make it easy to use “pre-canned” recipes without even touching a Puppet template • e.g. stick a standard CERN SSO-enabled apache / mod_wsgi / Django server on my box • …with these parameters • Moving us in the PaaS direction • Ultimately, it would be better if you never even needed to log into this node • (J2EE public service, IT web hosting service, MySQL service) Agile Infrastructure - Configuration and Operation Tools
Standard Workflow Iterate n minutes CDB onlxadm check out from CDB updatetemplates CDB commit run and check on test node notify with nc-client check on node(s) Iterate 1 minute Puppet onlxadm notify with mcollective check on foreman check out from git updatetemplates git commit and push run and check on test node Iterate Puppet-apply on test node check out from git onthe test node updatetemplates run puppet-apply check on test node git commit and push notify with mcollective check on foreman Agile Infrastructure - Configuration and Operation Tools
Modernising our Processes (1) • Our software processes for the computer centre are fairly limited • fire-and-forget broadcasts to project-elfms • …and rather manual • The manual test/ -> preprod/ -> prod/ template dance • Our toolset RPMs are ‘built on laptop’ and uploaded to ‘swrep’ by hand • Add standard continuous integration (e.g. Jenkins, Bamboo, Cruise) and automated build (Koji) as the only route to get new packages into the CC • .. then automate the testing • e.g. suitably tagged RPMs are automatically deployed to /test nodes. Agile Infrastructure - Configuration and Operation Tools
Modernising our Processes (2) • We’re working out which of the many puppet / git models suits us • code review, sign-off and automated notification for changes that will affect multiple clusters • How to automate the test/preprod/prod advancement • Pre-req is flexible monitoring and alarming • you need to trust that an automation failure will be signaled to you • Script-generated emails are banned • Need good monitoring to hang these notifications on • Integrate components rather than use emailScript™ • Script-generated tickets (where your value in the process is your password), are banned Agile Infrastructure - Configuration and Operation Tools
Current Tool Snapshot (Liable to Change) PuppetForeman mcollective, yum Jenkins AIMS/PXE Foreman JIRA Openstack Nova git, SVN Koji, Mock Yum repo Pulp Lemon Hardware database Puppet stored config DB Agile Infrastructure - Configuration and Operation Tools
Preliminary Timelines Aggressive schedule if we are to make it for new data centre Agile Infrastructure - Configuration and Operation Tools
Initial Steps • Decided on tools • Integrating them to make a production setup • We can still change.. But we’re starting to commit… • Looking for early adopters • In particular to understand the people-scaling / ACL issues: which of the git/puppet models is best? • e.g. PES/OIS services: batch/VMs, JIRA, Drupal • https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/EarlyAdopters2012 • Help with integration / coding • Help with ideas • Help with building the task list Agile Infrastructure - Configuration and Operation Tools
Summary • IT has started a new project to move our infrastructure to a new toolset based around industry standard open source components • Puppet for the core configuration tool • Better integration between components • Use of more modern software processes to aid deployment • Better monitoring • Engage with the community rather than re-implement • Overall project scope is wider (see following presentations) • Improved monitoring • Cloud and virtualisation • Actively seeking wide involvement from CERN-IT and feedback from the community • https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure Agile Infrastructure - Configuration and Operation Tools
Acknowledgements • Many colleagues at CERN-IT, including • Tim Bell • Ian Bird • Bernd Panzer-Steindel • Gavin McCance • Manuel Guijarro Agile Infrastructure - Configuration and Operation Tools