200 likes | 216 Views
Stay updated on the latest changes, releases, and impact of the Service Availability Monitoring system. Learn about new features, integration of EMI probes, recent activities, impacts on SAM and EGI SR, and possible options for testing releases. Explore the operational support, plans for the future, and the scope of SAM monitoring services.
E N D
Update on Service Availability Monitoring Marian Babik, Paloma Fuente, et al. (CERN) Emir Imamagic (SRCE) Paschalis Korosoglou (AUTH)
Overview • Recent changes and releases • SAM Update-20 • SAM Update-22 • Update-22 details and impact • Operations and support
Update-20 Changes • Released: December 2012 • Last SAM release based on gLite • New features: • Operational Tools Monitoring - http://ops-monitor.cern.ch • Operational Tools Availability in MyEGI • Monthly reports in central MyEGI • Nagios configuration improvements • More information: http://cern.ch/go/bH6K
Update-22 Changes • Planned: April 2013 • Integration of EMI probes • based on EMI/UMD • Following EMI probes were integrated: • CREAM, WMS, BDII, ARC, LFC, FTS, SRM, ARGUS, GLEXEC, WN, UNICORE • Complete repackaging of SAM • Improved yaim configuration
Recent Activities • Required close collaboration with EMI and EGI JRA1 • Large-scale testing activity (with EMI) • https://twiki.cern.ch/twiki/bin/view/EMI/NagiosServerEMITestbed0022012 • SAM/Nagios probes WG (with EGI JRA1) • Meetings with EMI PTs • Evaluation of EMI probes (business logic) • Reported to EGI OMB
Next Release in Detail (1) • Update-22 will be a non-backward compatible in packaging • Installation from base SL5 is expected (no upgrade path, no SL6 support) • Probe packages imported to SAM • Middleware from UMD • Considerable simplification of repository setup (just SAM, UMD and EPEL)
Next Release In Detail (2) • Simplified yaim configuration: • new SAM_NAGIOS nodetype • SAM/Nagios configuration • Run-time optimizations • EMI NAGIOS nodetype provided by EMI • lightweight EMI-UI • environment setup for the probes • yaim –n NAGIOS –n SAM_NAGIOS
Next release in Detail (3) • Changes to metric names are needed: • org.sam.CREAMCE-JobSubmit -> emi.cream.CREAMCE-JobSubmit • Metric translation mechanism was implemented to handle transition period • NGIs sending both new and old metrics at the same time • Status and Availability history will be kept in both local and central databases
Impact on SAM • Probes are now part of the middleware (and developed by many different PTs) • Continuous coordination from JRA1 is crucial after the end of EMI • SAM release schedule now depends on PTs • Probes still shipped with SAM • But testing expected from PTs and middleware providers to ensure probes work with underlying middleware
Impact on EGI SR • EGI Staged Rollout (SR) assumes already tested production ready release • SAM can no longer guarantee this since: • Lacks control over probes and probe-to- middleware interfaces • No longer competent to test if probes work correctly with underlying middleware • Unable to ensure probes will work against production infrastructure • More complex testing needed
Possible Options • SAM testing releases • Via dedicated testing repository • Process similar to EGI SR (lightweight) would be needed to evaluate a testing release • Once approved – SAM would release to SR • UMD adopts the probes and does the initial testing to ensure • Probes work with released middleware • Spots major issues early in the process and can block the release
Operations and Support • SAM central services (since Sept. 2012) • 206 operational tickets • upgrades, generating reports, interventions, profile changes • 62 re-computations • GGUS (since Sept. 2012): • 117 GGUS tickets in 3rd level • 36 GGUS tickets in 2nd level
Summary • SAM central services stable • Substantial improvements in adoption of EMI probes, operational tools monitoring and Nagios configuration features • Continuous support and bug-fixing • Near-term plans (MS710)
Near-term plans • Update-22 will conclude development work planned for EGI-InSPIRE • but SAM will continue to evolve • Until end of EGI-InSPIRE • Continuous support and bug-fixing • Maintenance and operations of the SAM central services • SAM central Oracle databases • SAM central services (MyEGI and API) • EGI monthly reports • Operational Monitoring and Availability
WEB API statistics - March • ~ 2.5M hits/month • ~ 60k hits/day • Top hosts quering the Web API: • mon-it.cnaf.infn.it (167k hits) • rocnagios.grid.sinica.edu.tw (110k hits) • rocmon-fzk.gridka.de (85k hits) • ngi-de-nagios.gridka.de (85k hits) • Failures (0.2%)
SAM Scope • SAM grid monitoring (SAM-Gridmon) • Central services (Web, API, availability) • SAM-Nagios • Monitoring platform supporting multiple configurations: • NGI-Nagios • VO-Nagios • Operations Tools-Nagios (ops-monitor)
SAM Overview SAM regional instances • 40 regional instances • Hosting over 230 metrics • Monitoring over 4000 services
Validation and deployment • SAM operates nightly validation platform • Runs basic validation tests for each component • 12 VMs running all known configurations • SAM-Gridmon • SAM-Nagios • NGI Nagioses (NGI_IT, CERN, NGI_UK) • VO Nagios • Operated continuously • Installed/upgraded every 2 days to latest SAM-Update (SVN)
Validation and deployment • Upgrade of the preproduction line • CERN ROC • SAM central service (grid-monitoring-preprod) – became part of EGI testbed • Upgrade of the production line • SAM central service (grid-monitoring) • EGI SR • Upgrade of the production services • Tested by EAs • EGI SR report