360 likes | 528 Views
Service Availability Monitoring. Marian Babik , Wojciech Lapka , Paloma Fuente, Jacobo Tarragon , Robert Veznaver (CERN) Emir Imamagic (SRCE) Paschalis Korosoglou (AUTH ) Anastasios Andronidis (AUTH). Agenda. Motivation Usage Capabilities Architecture Interfaces Day to day.
E N D
Service Availability Monitoring Marian Babik, WojciechLapka, Paloma Fuente, Jacobo Tarragon, Robert Veznaver (CERN) Emir Imamagic (SRCE) PaschalisKorosoglou(AUTH) AnastasiosAndronidis (AUTH)
Agenda • Motivation • Usage • Capabilities • Architecture • Interfaces • Day to day
Why SAM ? • Understand and improve quality of services delivered by tiers and sites • Provide feedback to management and funding agencies if sites and tiers comply with the previously agreed SLA
SAM today SAM - distributed monitoring framework for computing availability and reliability of sites and services. 262 metrics 4200 services monitored 10 VOs 40 SAM instances 700 000 metric results/day
Use of SAM • WLCG • Experiments (ATLAS, CMS, LHCb, ALICE) • Management • EGI • VOs (Biomed, , , Gisela) • Management • Operations (COD, ROD) • Site managers
Use of SAM 729 546 metric results/day 550 metrics results/s
Use of SAM 729 546 metric results/day 550 metrics results/s
SAM capabilities • Open source based • Nagios, ActiveMQ, Django • Framework for executing Nagios probes and aggregating metric results • existing probes for almost every grid middleware (EMI, gLite, Unicore, ARC, Desktop Grids, QCR) • Notification • Reporting • Web interface and Web API • Support for third-party monitoring systems • OSG
MyWLCG - Web API • Exposing Web API to a number of clients: • Experiments dashboards • EGI dashboards • SLS • 3rd parties • Supporting XML, JSON • On average 2.0M hits/month
SAM Day to Day • Coordination • Scope and effort management (roadmap, PoW), change management, communication plan • Development • Actively maintaining/improving 10 components • ~200k lines of code, +200 packages • 612 development tickets closed (last 9 months) • Validation and staged rollout • Validation infrastructure deployed • Running all SAM services on 10 nodes • Continuous validation of latest development release • EGI and WLCG staged rollout
SAM Day to Day • Support • Direct support to WLCG VOs and EGI/OSG • SNOW: Grid Infrastructure Monitoring SE (2 FEs) • GGUS: 3rd level SAM/Nagios SU • 191 tickets closed (last 9 months) • Operations • Production and PreProduction infrastructures • Responsible for the operation of: • 2 SAM-Gridmon: central monitoring services • 8 VO SAM-Nagios: monitoring WLCG VO services • 1 OPS SAM-Nagios: monitoring the monitoring services! • 396 tickets closed (last 9 months)
Summary • SAM is tracking availability and reliability of sites in order to understand and improve their QoS • SAM is an open-source based platform • SAM is used daily by WLCG and EGI to monitor sites, services and compute their availability • SAM will continue supporting WLCG and EGI in their day to day operations
Contacts and References • Technical • tom-developers@cern • Support • SNOW (Grid infrastructure monitoring SE – SAM/Nagios FE) • Links • SAM documentation (http://cern.ch/go/Qq8w) • SAM internaldocs (http://cern.ch/go/lnH7) • SAM centralservice • http://grid-monitoring.cern.ch/mywlcg/ • SAM CHEP 2012 papers: • SAM operations (http://cern.ch/go/SPt8) • SAM architecture (http://cern.ch/go/Mst6)
Challenges • Many requirements from both WLCG and EGI • Possible improvements • Technology evolution • Testing of services not defined in GOCDB/OIM • Definition/testing of meta-services • Generic mechanism for loading results from other monitoring systems • Regional availability computations
Open Source Technologies • Many mature technologies improved by the community and successfully used at scale • Integrate them as pluggable tools in SAM • Nagios: monitoring platform • Push and pull model monitoring • Pluggable system for probes • Simple notification system • System well known by many system administrators • ActiveMQ: messaging infrastructure • Integration platform for Nagios instances • Standardized messaging protocol • Message throughput high performance • Resiliency to network failures
WLCG today Global collaboration of more than 170 computing centres in 36 countries, linking up national and international grid infrastructures.
Configuration • Topological aggregation • one source for topology aggregating OSG, EGI and WLCG sources • Profile management • defines and manages metrics • Nagios Configuration Generator • bootstraps Nagios based on information about needed topology and defined metrics
Collection/Notification • Nagios - open-source monitoring platform • Provides the following benefits for SAM: • Push and pull model monitoring • Pluggable system for probes • Nagios exchange - with many existing probes • System well known by many system administrators • Basic notification system
Transport/Filtering • ActiveMQ - open-source messaging and integration patterns server • Provides the following benefits for SAM: • integration platform for Nagios instances • standardized messaging protocol • high performance in terms of message throughput • resiliency to network failures Credits: Lionel, Massimo
Storage/Aggregation • Relational storage of metric results • Oracle, MySQL • Aggregation • Status computation – state of services and sites at a given point in time calculated based on the received metric results from Nagios • Availability computation • Availability - fraction of time a service was up during the period the service was known • Reliability – fraction of time a service was up during the period the service was scheduled to be up
Visualization/Reporting • http://youtu.be/oG-1B6KaKnk