WLCG infrastructure monitoring proposal

WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16thAugust 2013

Table of contents Summary of the progress Desired structure of applications Proposal for infrastructure monitoring Infrastructure monitoring P. Saiz

Summary Infrastructure monitoring P. Saiz

Motivation • Reduction on number of people • Redefining scope of applications • Combining expertise • Step out and evaluate other alternatives • Goal: • Offer (at least) same QoS with less resources Infrastructure monitoring P. Saiz

Status so far WLCG monitoring consolidation group created Applications supported by the section Applications used … so now we know what to provide Infrastructure monitoring P. Saiz

How to provide it • Input from our experience • Input from other groups • What is available out there • Split in different areas of work • Source of Information • Transport • Storage • Aggregation • Review of the areas • Visualization • Documentation • Deployment • Recurrent tasks Infrastructure monitoring P. Saiz

Structure of applications Infrastructure monitoring P. Saiz

Different layers of applications Transport Visualize Storage Collect information Aggregate Recurrent Tasks Documentation Deployment Infrastructure monitoring P. Saiz

Deployment Transport Visualize Storage Collect information Aggregate Recurrent Tasks Documentation Deployment Deployment • Using openstack, puppet, hiera, foreman • Quota of 100 nodes, 240 cores • Multiple templates already created • Development machine (7 nodes) • Web servers (SSB, SUM, WLCG transfers, Job: 16 nodes) • Elastic Search (6 nodes), Hadoop (4 nodes) • Currently working on nagiosinstallation • Migrating machines from quattor to AI • Koji and Bamboo for build system and continuous integration Infrastructure monitoring P. Saiz

Source of information Transport Visualize Collect information Storage Collect information Nagios GOCDB REBUS Aggregate OIM Savannah Recurrent Tasks Documentation Other app Deployment Gather info from external, internal sources. Publish it in the transport layer Infrastructure monitoring P. Saiz

Transport Transport Transport Visualize Storage Collect information Aggregate Recurrent Tasks Documentation Deployment Message Broker Local files HTTP PUT/GET UDP (table in DB)? Infrastructure monitoring P. Saiz

Storage Transport Visualize Storage Storage Collect information Accepts any data • #jobs, status of a service, downtime, pledges, channel status • Metric, Instance, Time Range, Value Archival • Long term data • (Same format as Metric Storage)? Current Metrics • Most common views Metadata • Profiles • Topology Archival Aggregate Current Metrics Recurrent Tasks Documentation Meta data Deployment Infrastructure monitoring P. Saiz

Aggregation Transport Visualize Storage Collect information Summary Site readiness Aggregate Aggregate Availability Recurrent Tasks Documentation Deployment Treated as another metric Might collect input from previous metrics Current schema of ‘CMS Site readiness’ Infrastructure monitoring P. Saiz

Visualization Transport Visualize Visualize Storage Collect information Server: • HTML skeleton • REST API with JSON data • Cache: memcache, varnish Client • Common library + plugin • jQuery • Common MVC • No obvious choice… • Plots (Interactive, Exportable, Embeddable) • Highcharts Aggregate Recurrent Tasks Documentation Deployment Infrastructure monitoring P. Saiz

Infrastructure monitoring Infrastructure monitoring P. Saiz

Current situation • Big system, difficult to maintain/evolve • Many internal dependencies • Multiple schemas, aggregations: • SSB, MRS, ACE • Scope much bigger than what we need • Limit to WLCG • Usage of probes • Does not test what the experiments are doing! • Non-trivial deployment of new tests • Based on technologies available at the time of the design • New requests from experiments: • Test whatever they want • Availability vs Usability • Combine Dashboard/SAM apps Infrastructure monitoring P. Saiz

Infrastructure monitoring Transport Visualize Storage Collect information Nagios Pledge Archival MyWLCG SUM SSB Down Pilot Metrics Report Trend HC VO feed POEM Aggregate ACE Recurrent Tasks Documentation Deployment Infrastructure monitoring P. Saiz

And for the prototype… Transport Visualize Storage Collect information Nagios Pledge Archival MyWLCG SUM SSB Simplified MRS • Accepts any data • No foreign keys! • No status calculation • 300K messages per day Down Direct Processed Data SSB format New Data Metrics Metrics Report Trend SSB Storage • Records status changes • Same procedure as any other metric HC VO feed POEM Aggregate ACE All the data in storage have the same format: • Instance, Metric, Time range, Value • Source could be nagios, pilot framework, VO-defined metrics, availabilities Recurrent Tasks Documentation Deployment consume2db Infrastructure monitoring P. Saiz

And now we can see metrics… 19 14 August 2013 Infrastructure monitoring P. Saiz Infrastructure monitoring P. Saiz

Aggregation • Combination of ACE +SSB Virtual Columns • Two types: • Horizontal: Ins1(M1…Mn) Ins1 (Mp) • Vertical: M1 (Ins1…Insn) Insp(M2) • Initial options for “and”, “or” of current status • Later on, might be extended to ‘sliding window’ • Full description Infrastructure monitoring P. Saiz

Examples of aggregation ATLAS_CRITICAL WN Site (expand this column) Infrastructure monitoring P. Saiz

Summary • Lots of progress towards unified schema • Data can be published from different sources • Nagios, VO-defined metrics, ACE, (HC, Job Pilots) • Single schema for storage • Components talk to each other through API • Getting close to a “proof of concept” • Aggregation needs some work • Visualization might need adjusting • Other tasks can go in parallel • NoSQL evaluation • Nagios configuration • Only active metrics Infrastructure monitoring P. Saiz

WLCG infrastructure monitoring proposal

WLCG infrastructure monitoring proposal

Presentation Transcript

A Strategy for WLCG Monitoring

WLCG

Monitoring infrastructure

Deployment of a WLCG network monitoring infrastructure based on the perfSONAR -PS technology

WLCG Monitoring Consolidation

Monitoring Infrastructure

WLCG Services Monitoring: SAM

RBX Temperature Monitoring Proposal

APEL CPU Accounting in the EGEE/WLCG infrastructure

WLCG Network Monitoring using perfSONAR -PS

Agile Infrastructure Monitoring

LHC - Technical Infrastructure Monitoring

WLCG Monitoring – some worked examples

WLCG Monitoring – An overview

Grid Infrastructure Monitoring

Infrastructure Monitoring Services

Grid Infrastructure Monitoring

New WLCG Grid Service Monitoring Displays

Infrastructure Monitoring Market