The Architecture of the WLCG Monitoring System

The Architecture of the WLCG Monitoring System Michel Jouvin (GRIF/LAL) on behalf of James Casey (CERN) (All materials from J. Casey) EGEE France, Lyon April 10, 2008

Outline • WLCG Monitoring Working Group • Mandate, background and key principles • Technology investigation • Messaging system • Reporting tools • Site Monitoring Prototype • Example • OSG RSV publication • Job Reliability Monitoring • WLCG/CCRC08 VO-oriented views • Summary 2 2

WLCG Monitoring Working Group • The WLCG Monitoring working was set up Nov. 2006 “….help improve the reliability of the grid infrastructure….” “…. provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service. …” “… stakeholder are site administrators, grid service managers and operations, VOs, Grid Project management” • Now acting as a project rather than a WG • Provides and maintains deliverables • Part of normal operations 3 3

Rely on Sites “Site administrators are closest to the problems, and need to know about them first” On the front line to reduce time to respond Initial focus has been on site monitoring Implications Improved understanding of how to monitor services “Service Cards” developed by EGEE SA3 Need to deploy components to sites Sometimes an entire monitoring system Needs active participation of site admins 4

Tell others what you know “If you’re monitoring a site remotely, it’s only polite to give the data to the site” (Chris Brew, RAL). Remote systems should feed back information to sites Implications Common publication mechanisms Integration into fabric monitoring Discovery of data Site trust of data – Is it a “backdoor” communications mechanism? 5

Authority for data… Currently repositories have direct DB connections to (all) other repositories E.g. SAM, Gridview, Gstat, GOCDB, CIC And they cache and merge and process the data Implications We have a “Interlinked distributed schema” Tools should take responsibility for contents of parts of it 6

Visualization for each community “User-targeted” visualization All should use the same underlying data Extract information processing out of visualization tools Provide same processed info to all visualizations Interface with community specific information, e.g. names Implications Many “similar” dashboards Everyone sees the same data Common frameworks/widgets would help 7

Process • Review existing monitoring systems • “Improving reliability is our goal !” • Identify gaps • Design integrated architecture for monitoring • Prototype some solutions • Reduce to a minimum specific components to develop and maintain • Must be usable by whole WLCG • EGEE, OSG, NDG 8 8

The pieces to work with… • The starting point was what we have now: • Availability testing framework – SAM/RSV • Job and Data reliability monitoring – Gridview • Grid topology – GOCDB/Registration DB • Dynamic view of the grid – BDII/CeMon • Accounting – APEL/Gratia • Experiment views – Dashboards • Fabric monitoring – Nagios, LEMON, … • Grid operations tools – CIC Portal • They work together right now • To a certain extent ! 9 9

We’ve got an integration problem ! 10 10

No monolithic systems Different systems should specialize in their areas of expertise And not have to also invent all the common infrastructure Implications Less overlap and duplication of work Someone needs to manage some common infrastructure We need to agree on the common infrastructure 11

Don’t have central bottlenecks “Local problems detected locally shouldn’t require remote services to work out what the problem is” Still a role for central detection of problem Just they’re reported locally too Lots of central processing done now in SAM/Gridview Implications Do as much processing locally (or regionally) Helps scaling – improves robustness Enables automation - reduces manpower Harder to deploy 12

Re-use, don’t re-invent What do we do? Collect some information, Move it around Store it, View it, Report on it This is pretty common  We should look at existing systems Already happening for site fabric… Nagios, LEMON, … Implication Less code to develop and maintain Integration nightmare? 13

Don’t impose systems on sites We can’t dictate a monitoring system Many (big?) sites already have a deployed system We have to be pluggable into them Implications Modular approach Specifications to define interfaces between existing systems and new components 14

Broker at the centre Reliablity and persistence of messaging built into the broker network Mitigates the single point of failures we’ve had with previous solutions Message delivery is guaranteed 15 15

Plug’n’Play Components • Still can end up with spaghetti • A component must take care only of its own job and ignore details of others (e.g. data schema) • Tight specification of interaction of components is required • Message format specifications • Standard metadata schema • Message Queue naming schemas • Protocols • Standard “Patterns” can act as a basis • http://enterpriseintegrationpatterns.com/ 16 16

Messaging Systems for Integration • We need: • Loose coupling of systems • Distributed components • Reliable delivery of messages • Standard methods of communication • Flexibility to add new producers and consumers of the information without having to reconfigure everything • Message Oriented Middleware provides this • And is widely used in similar scenarios 17 17

Messaging Systems Flexible architecture: Deliver messages, either in point to point (queue)… … or multicast mode (topics) Support Synchronous or Asynchronous communication. Reliable delivery of messages: Provide reliability to the senders if required Configurable persistency / Master-Slave. Highly Scalable: Network of Brokers 18 WLCG Monitoring – some worked examples - 18

ActiveMQ Mature open-source implementation of these ideas Top-level Apache project Commercial support available from IONA Easy to integrate Multiple language + transport protocol support Good performance characteristics See later … Work done to integrate into our environment RPMs, Quattor components + templates, LEMON alarms 19 WLCG Monitoring – some worked examples - 19

ActiveMQ Architecture 20 WLCG Monitoring – some worked examples - 20

ActiveMQ Throughput > Consumers > Throughput ?? Consumer Bottleneck! With a larger number of producers, even more messages per second saturating the consumer. 21

Reporting for WLCG • Currently a post-processing of results and graphs in Excel • Much manual work needed ! • Try to implement it directly on the GridView DB • Using a mature open-source reporting toolkit – JasperReports • UI Report builder – iReports • Web-based report server - OpenReports 22 WLCG Monitoring – some worked examples - 22

JasperReports 23 WLCG Monitoring – some worked examples - 23

Site Monitoring & Nagios • More details in next talk: • “Central Europe ROC Nagios Experience” • Nagios has shown itself to be a very useful component for building many part of our monitoring solutions • Local Site monitoring • Replacing the SAM execution framework • Too hard to maintain, too much centralized to scale • gStat – BDII monitoring • Probes within Nagios • Publish site results upwards to be part of availability/reliability computation 24 24

Messaging based archiving and reporting 25 25

In Production - OSG RSV to SAM • RSV – Resource and Service Validation • Uses Gratia as native transport within OSG • And OSG GOC runs a bridge to SAM for WLCG 26 26

Job Reliability Monitoring Requires to be able to gather job state transitions from all jobs submitted in WLCG resources EGEE (RB/WMS + Condor_G) + OSG + NDG Only gather this information once Propagate to interested parties Using existing systems and expertise where possible Don’t try and deploy components on every WMS/RB/L&B/CE/… Get ‘cooked’ data from the systems Hook up with Pilot Jobs Linkage between pilot and experiment jobs as a ‘state change’ 27 27

Current situation Currently mines L&B log files, and sends them via R-GMA Requires a specific component on every L&B Loses many records GridView hacks to ‘finish’ unfinished jobs after 24h Inaccurate results Jobs reported via experiment frameworks Gathers from many sources – Imperial College XML files, job submission tools, MonAlisa reporting from jobs, R-GMA But some missing information for Condor_G jobs info between submission and user job starting on WN Job aborted 28 28

Proposal Use WLCG Monitoring infrastructure (MSG) for collecting and transporting the data Messaging system Standard message formats Work with expert groups to instrument the job submission systems Visualization by Gridview + Dashboards 29 29

EGEE L&B Notifications means we don’t have to run components mining L&B logfiles Consumer of notifications can be remote L&B is stated to scale for our needs Tested at >1 million records/day Testing of integrating with notifications underway by GridView team Message formats already defined Old log mining approach will all be moved to messaging system to free GridView from R-GMA dependency 30 30

Condor_G Condor_G submitter instrumented to create L&B messages Done by a separate listener process that is started by Condor_G Limited subset of Condor_G state changes will be sent Listener/reporter can use different transport for reporting Currently MonAlisa as a transport layer Will migrate to WLCG messaging system 31 31

Pilot Jobs L&B client resides on every worker node Can be used to submit additional messages to L&B for a job Timestamps +environment for Job Wrapper start/end Timestamp of handover to user job Linkage of pilot job to experiment job ID … Benefit is that it’s all in one coherent data structure for a given job 32 32

EGEE Architecture 33 33

CMS SAM Portal 34 34

ServiceMap 35 • What’s a ServiceMap? • It’s a gridmap with many different maps, showing different aspects of the WLCG infrastructure • Gridmap : “treemap”-based view of the grid • http://gridmap.cern.ch • What’s the CCRC’08 ServiceMap? • Service ‘readiness’ • Service availability • Experiment Metrics • A single place to see both the VO and the infrastructure view of the grid 35

CCRC’08 ServiceMap 36 …Demo… http://gridmap.cern.ch/ccrc08/servicemap.html 36

WLCG Experiment metrics 37 • Show the VO view of the infrastructure • Two extra ‘maps’ planned • Reliability (e.g successful data transfer, jobs, …) • Metrics (MB/s, events/s, …) • Need interaction with experiments to create these two views

Summary 38 • CCRC’08 is a good opportunity to try some new operational tools • And evaluated them in a ‘real-world’ mode • The CCRC’08 ServiceMap seems to give a useful view of the grid • Need to iterate on what is useful to show • And fill in the white spaces… • Next Steps • MoU calculation and reporting to sites • Feedback on all the tools welcome ! 38

Links to CCRC08 tools 39 • CCRC’08 ServiceMap http://gridmap.cern.ch/ccrc08/servicemap.html • CCRC’08 Observations logbook https://prod-grid-logger.cern.ch/elog/CCRC'08+Observations/ RSS feed : https://prod-grid-logger.cern.ch/elog/CCRC'08+Observations/elog.rdf • Reponse tracking logbook https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/ RSS feed : https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/elog.rdf Presentation title - 39

Strategy Summary • Converge to standards, but without a big bang • Leverage the underlying infrastructures rather than layer lots of systems on top • Reduce maintenance/development costs by using commodity components whenever possible • Modular and loosely-coupled to adapt to changes in infrastructure and funding models 40 40

Architecture Summary • Our design for a new architecture leverages commodity software components • Probe Execution (Nagios), Messaging (ActiveMQ), Reporting (JasperReports) • It is essentially an integration exercise • Make existing tools work together better • In order to improve reliability • This is what we will verify over the next 12 months 41 41

More Information… GDB Reports on Monitoring by James Almost every month at GDB or pre-GDB http://indico.cern.ch/categoryDisplay.py?categId=3l181 Improving Job Reliability http://indico.cern.ch/conferenceDisplay.py?confId=20228 Email James… Look at CERN directory… 42

The Architecture of the WLCG Monitoring System

The Architecture of the WLCG Monitoring System

Presentation Transcript

A Strategy for WLCG Monitoring

The Role of System Architecture in System Development

WLCG infrastructure monitoring proposal

WLCG Monitoring Consolidation

The Bus Architecture of Embedded System

The Readout Architecture of the ATLAS Pixel System

WLCG Services Monitoring: SAM

Architecture of the gLite Data Management System

WLCG Network Monitoring using perfSONAR -PS

The WLCG CCRC `08

WLCG Information System Status

Benefits of The Railway Monitoring System

WLCG Monitoring – some worked examples

WLCG Monitoring – An overview

The Design of System Architecture

The Architecture of an Enterprise System

Processing of the WLCG monitoring data using NoSQL

Architecture of the gLite Data Management System

Architecture of the Earth System Modeling Framework

New WLCG Grid Service Monitoring Displays

The importance of wireless temperature monitoring system

The Working of Wing Monitoring System