200 likes | 212 Views
Update on Service Availability Monitoring. Marian Babik, Paloma Fuente, et al. (CERN) Emir Imamagic (SRCE) Paschalis Korosoglou (AUTH). Overview. Recent changes and releases SAM Update-20 SAM Update-22 Update-22 details and impact Operations and support. Update-20 Changes.
E N D
Update on Service Availability Monitoring Marian Babik, Paloma Fuente, et al. (CERN) Emir Imamagic (SRCE) Paschalis Korosoglou (AUTH)
Overview • Recent changes and releases • SAM Update-20 • SAM Update-22 • Update-22 details and impact • Operations and support
Update-20 Changes • Released: December 2012 • Last SAM release based on gLite • New features: • Operational Tools Monitoring - http://ops-monitor.cern.ch • Operational Tools Availability in MyEGI • Monthly reports in central MyEGI • Nagios configuration improvements • More information: http://cern.ch/go/bH6K
Update-22 Changes • Planned: April 2013 • Integration of EMI probes • based on EMI/UMD • Following EMI probes were integrated: • CREAM, WMS, BDII, ARC, LFC, FTS, SRM, ARGUS, GLEXEC, WN, UNICORE • Complete repackaging of SAM • Improved yaim configuration
Recent Activities • Required close collaboration with EMI and EGI JRA1 • Large-scale testing activity (with EMI) • https://twiki.cern.ch/twiki/bin/view/EMI/NagiosServerEMITestbed0022012 • SAM/Nagios probes WG (with EGI JRA1) • Meetings with EMI PTs • Evaluation of EMI probes (business logic) • Reported to EGI OMB
Next Release in Detail (1) • Update-22 will be a non-backward compatible in packaging • Installation from base SL5 is expected (no upgrade path, no SL6 support) • Probe packages imported to SAM • Middleware from UMD • Considerable simplification of repository setup (just SAM, UMD and EPEL)
Next Release In Detail (2) • Simplified yaim configuration: • new SAM_NAGIOS nodetype • SAM/Nagios configuration • Run-time optimizations • EMI NAGIOS nodetype provided by EMI • lightweight EMI-UI • environment setup for the probes • yaim –n NAGIOS –n SAM_NAGIOS
Next release in Detail (3) • Changes to metric names are needed: • org.sam.CREAMCE-JobSubmit -> emi.cream.CREAMCE-JobSubmit • Metric translation mechanism was implemented to handle transition period • NGIs sending both new and old metrics at the same time • Status and Availability history will be kept in both local and central databases
Impact on SAM • Probes are now part of the middleware (and developed by many different PTs) • Continuous coordination from JRA1 is crucial after the end of EMI • SAM release schedule now depends on PTs • Probes still shipped with SAM • But testing expected from PTs and middleware providers to ensure probes work with underlying middleware
Impact on EGI SR • EGI Staged Rollout (SR) assumes already tested production ready release • SAM can no longer guarantee this since: • Lacks control over probes and probe-to- middleware interfaces • No longer competent to test if probes work correctly with underlying middleware • Unable to ensure probes will work against production infrastructure • More complex testing needed
Possible Options • SAM testing releases • Via dedicated testing repository • Process similar to EGI SR (lightweight) would be needed to evaluate a testing release • Once approved – SAM would release to SR • UMD adopts the probes and does the initial testing to ensure • Probes work with released middleware • Spots major issues early in the process and can block the release
Operations and Support • SAM central services (since Sept. 2012) • 206 operational tickets • upgrades, generating reports, interventions, profile changes • 62 re-computations • GGUS (since Sept. 2012): • 117 GGUS tickets in 3rd level • 36 GGUS tickets in 2nd level
Summary • SAM central services stable • Substantial improvements in adoption of EMI probes, operational tools monitoring and Nagios configuration features • Continuous support and bug-fixing • Near-term plans (MS710)
Near-term plans • Update-22 will conclude development work planned for EGI-InSPIRE • but SAM will continue to evolve • Until end of EGI-InSPIRE • Continuous support and bug-fixing • Maintenance and operations of the SAM central services • SAM central Oracle databases • SAM central services (MyEGI and API) • EGI monthly reports • Operational Monitoring and Availability
WEB API statistics - March • ~ 2.5M hits/month • ~ 60k hits/day • Top hosts quering the Web API: • mon-it.cnaf.infn.it (167k hits) • rocnagios.grid.sinica.edu.tw (110k hits) • rocmon-fzk.gridka.de (85k hits) • ngi-de-nagios.gridka.de (85k hits) • Failures (0.2%)
SAM Scope • SAM grid monitoring (SAM-Gridmon) • Central services (Web, API, availability) • SAM-Nagios • Monitoring platform supporting multiple configurations: • NGI-Nagios • VO-Nagios • Operations Tools-Nagios (ops-monitor)
SAM Overview SAM regional instances • 40 regional instances • Hosting over 230 metrics • Monitoring over 4000 services
Validation and deployment • SAM operates nightly validation platform • Runs basic validation tests for each component • 12 VMs running all known configurations • SAM-Gridmon • SAM-Nagios • NGI Nagioses (NGI_IT, CERN, NGI_UK) • VO Nagios • Operated continuously • Installed/upgraded every 2 days to latest SAM-Update (SVN)
Validation and deployment • Upgrade of the preproduction line • CERN ROC • SAM central service (grid-monitoring-preprod) – became part of EGI testbed • Upgrade of the production line • SAM central service (grid-monitoring) • EGI SR • Upgrade of the production services • Tested by EAs • EGI SR report