1 / 42

The Architecture of the WLCG Monitoring System

This document outlines the architecture and principles of the WLCG Monitoring Working Group, emphasizing the importance of site administrators, data authority, visualization, integration, components, and messaging for grid reliability. It discusses the need for a modular approach, common infrastructure, and localized problem detection to enhance system robustness.

maciel
Download Presentation

The Architecture of the WLCG Monitoring System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Architecture of the WLCG Monitoring System Michel Jouvin (GRIF/LAL) on behalf of James Casey (CERN) (All materials from J. Casey) EGEE France, Lyon April 10, 2008

  2. Outline • WLCG Monitoring Working Group • Mandate, background and key principles • Technology investigation • Messaging system • Reporting tools • Site Monitoring Prototype • Example • OSG RSV publication • Job Reliability Monitoring • WLCG/CCRC08 VO-oriented views • Summary 2 2

  3. WLCG Monitoring Working Group • The WLCG Monitoring working was set up Nov. 2006 “….help improve the reliability of the grid infrastructure….” “…. provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service. …” “… stakeholder are site administrators, grid service managers and operations, VOs, Grid Project management” • Now acting as a project rather than a WG • Provides and maintains deliverables • Part of normal operations 3 3

  4. Rely on Sites “Site administrators are closest to the problems, and need to know about them first” On the front line to reduce time to respond Initial focus has been on site monitoring Implications Improved understanding of how to monitor services “Service Cards” developed by EGEE SA3 Need to deploy components to sites Sometimes an entire monitoring system Needs active participation of site admins 4

  5. Tell others what you know “If you’re monitoring a site remotely, it’s only polite to give the data to the site” (Chris Brew, RAL). Remote systems should feed back information to sites Implications Common publication mechanisms Integration into fabric monitoring Discovery of data Site trust of data – Is it a “backdoor” communications mechanism? 5

  6. Authority for data… Currently repositories have direct DB connections to (all) other repositories E.g. SAM, Gridview, Gstat, GOCDB, CIC And they cache and merge and process the data Implications We have a “Interlinked distributed schema” Tools should take responsibility for contents of parts of it 6

  7. Visualization for each community “User-targeted” visualization All should use the same underlying data Extract information processing out of visualization tools Provide same processed info to all visualizations Interface with community specific information, e.g. names Implications Many “similar” dashboards Everyone sees the same data Common frameworks/widgets would help 7

  8. Process • Review existing monitoring systems • “Improving reliability is our goal !” • Identify gaps • Design integrated architecture for monitoring • Prototype some solutions • Reduce to a minimum specific components to develop and maintain • Must be usable by whole WLCG • EGEE, OSG, NDG 8 8

  9. The pieces to work with… • The starting point was what we have now: • Availability testing framework – SAM/RSV • Job and Data reliability monitoring – Gridview • Grid topology – GOCDB/Registration DB • Dynamic view of the grid – BDII/CeMon • Accounting – APEL/Gratia • Experiment views – Dashboards • Fabric monitoring – Nagios, LEMON, … • Grid operations tools – CIC Portal • They work together right now • To a certain extent ! 9 9

  10. We’ve got an integration problem ! 10 10

  11. No monolithic systems Different systems should specialize in their areas of expertise And not have to also invent all the common infrastructure Implications Less overlap and duplication of work Someone needs to manage some common infrastructure We need to agree on the common infrastructure 11

  12. Don’t have central bottlenecks “Local problems detected locally shouldn’t require remote services to work out what the problem is” Still a role for central detection of problem Just they’re reported locally too Lots of central processing done now in SAM/Gridview Implications Do as much processing locally (or regionally) Helps scaling – improves robustness Enables automation - reduces manpower Harder to deploy 12

  13. Re-use, don’t re-invent What do we do? Collect some information, Move it around Store it, View it, Report on it This is pretty common  We should look at existing systems Already happening for site fabric… Nagios, LEMON, … Implication Less code to develop and maintain Integration nightmare? 13

  14. Don’t impose systems on sites We can’t dictate a monitoring system Many (big?) sites already have a deployed system We have to be pluggable into them Implications Modular approach Specifications to define interfaces between existing systems and new components 14

  15. Broker at the centre Reliablity and persistence of messaging built into the broker network Mitigates the single point of failures we’ve had with previous solutions Message delivery is guaranteed 15 15

  16. Plug’n’Play Components • Still can end up with spaghetti • A component must take care only of its own job and ignore details of others (e.g. data schema) • Tight specification of interaction of components is required • Message format specifications • Standard metadata schema • Message Queue naming schemas • Protocols • Standard “Patterns” can act as a basis • http://enterpriseintegrationpatterns.com/ 16 16

  17. Messaging Systems for Integration • We need: • Loose coupling of systems • Distributed components • Reliable delivery of messages • Standard methods of communication • Flexibility to add new producers and consumers of the information without having to reconfigure everything • Message Oriented Middleware provides this • And is widely used in similar scenarios 17 17

  18. Messaging Systems Flexible architecture: Deliver messages, either in point to point (queue)… … or multicast mode (topics) Support Synchronous or Asynchronous communication. Reliable delivery of messages: Provide reliability to the senders if required Configurable persistency / Master-Slave. Highly Scalable: Network of Brokers 18 WLCG Monitoring – some worked examples - 18

  19. ActiveMQ Mature open-source implementation of these ideas Top-level Apache project Commercial support available from IONA Easy to integrate Multiple language + transport protocol support Good performance characteristics See later … Work done to integrate into our environment RPMs, Quattor components + templates, LEMON alarms 19 WLCG Monitoring – some worked examples - 19

  20. ActiveMQ Architecture 20 WLCG Monitoring – some worked examples - 20

  21. ActiveMQ Throughput > Consumers > Throughput ?? Consumer Bottleneck! With a larger number of producers, even more messages per second saturating the consumer. 21

  22. Reporting for WLCG • Currently a post-processing of results and graphs in Excel • Much manual work needed ! • Try to implement it directly on the GridView DB • Using a mature open-source reporting toolkit – JasperReports • UI Report builder – iReports • Web-based report server - OpenReports 22 WLCG Monitoring – some worked examples - 22

  23. JasperReports 23 WLCG Monitoring – some worked examples - 23

  24. Site Monitoring & Nagios • More details in next talk: • “Central Europe ROC Nagios Experience” • Nagios has shown itself to be a very useful component for building many part of our monitoring solutions • Local Site monitoring • Replacing the SAM execution framework • Too hard to maintain, too much centralized to scale • gStat – BDII monitoring • Probes within Nagios • Publish site results upwards to be part of availability/reliability computation 24 24

  25. Messaging based archiving and reporting 25 25

  26. In Production - OSG RSV to SAM • RSV – Resource and Service Validation • Uses Gratia as native transport within OSG • And OSG GOC runs a bridge to SAM for WLCG 26 26

  27. Job Reliability Monitoring Requires to be able to gather job state transitions from all jobs submitted in WLCG resources EGEE (RB/WMS + Condor_G) + OSG + NDG Only gather this information once Propagate to interested parties Using existing systems and expertise where possible Don’t try and deploy components on every WMS/RB/L&B/CE/… Get ‘cooked’ data from the systems Hook up with Pilot Jobs Linkage between pilot and experiment jobs as a ‘state change’ 27 27

  28. Current situation Currently mines L&B log files, and sends them via R-GMA Requires a specific component on every L&B Loses many records GridView hacks to ‘finish’ unfinished jobs after 24h Inaccurate results Jobs reported via experiment frameworks Gathers from many sources – Imperial College XML files, job submission tools, MonAlisa reporting from jobs, R-GMA But some missing information for Condor_G jobs info between submission and user job starting on WN Job aborted 28 28

  29. Proposal Use WLCG Monitoring infrastructure (MSG) for collecting and transporting the data Messaging system Standard message formats Work with expert groups to instrument the job submission systems Visualization by Gridview + Dashboards 29 29

  30. EGEE L&B Notifications means we don’t have to run components mining L&B logfiles Consumer of notifications can be remote L&B is stated to scale for our needs Tested at >1 million records/day Testing of integrating with notifications underway by GridView team Message formats already defined Old log mining approach will all be moved to messaging system to free GridView from R-GMA dependency 30 30

  31. Condor_G Condor_G submitter instrumented to create L&B messages Done by a separate listener process that is started by Condor_G Limited subset of Condor_G state changes will be sent Listener/reporter can use different transport for reporting Currently MonAlisa as a transport layer Will migrate to WLCG messaging system 31 31

  32. Pilot Jobs L&B client resides on every worker node Can be used to submit additional messages to L&B for a job Timestamps +environment for Job Wrapper start/end Timestamp of handover to user job Linkage of pilot job to experiment job ID … Benefit is that it’s all in one coherent data structure for a given job 32 32

  33. EGEE Architecture 33 33

  34. CMS SAM Portal 34 34

  35. ServiceMap 35 • What’s a ServiceMap? • It’s a gridmap with many different maps, showing different aspects of the WLCG infrastructure • Gridmap : “treemap”-based view of the grid • http://gridmap.cern.ch • What’s the CCRC’08 ServiceMap? • Service ‘readiness’ • Service availability • Experiment Metrics • A single place to see both the VO and the infrastructure view of the grid 35

  36. CCRC’08 ServiceMap 36 …Demo… http://gridmap.cern.ch/ccrc08/servicemap.html 36

  37. WLCG Experiment metrics 37 • Show the VO view of the infrastructure • Two extra ‘maps’ planned • Reliability (e.g successful data transfer, jobs, …) • Metrics (MB/s, events/s, …) • Need interaction with experiments to create these two views

  38. Summary 38 • CCRC’08 is a good opportunity to try some new operational tools • And evaluated them in a ‘real-world’ mode • The CCRC’08 ServiceMap seems to give a useful view of the grid • Need to iterate on what is useful to show • And fill in the white spaces… • Next Steps • MoU calculation and reporting to sites • Feedback on all the tools welcome ! 38

  39. Links to CCRC08 tools 39 • CCRC’08 ServiceMap http://gridmap.cern.ch/ccrc08/servicemap.html • CCRC’08 Observations logbook https://prod-grid-logger.cern.ch/elog/CCRC'08+Observations/ RSS feed : https://prod-grid-logger.cern.ch/elog/CCRC'08+Observations/elog.rdf • Reponse tracking logbook https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/ RSS feed : https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/elog.rdf Presentation title - 39

  40. Strategy Summary • Converge to standards, but without a big bang • Leverage the underlying infrastructures rather than layer lots of systems on top • Reduce maintenance/development costs by using commodity components whenever possible • Modular and loosely-coupled to adapt to changes in infrastructure and funding models 40 40

  41. Architecture Summary • Our design for a new architecture leverages commodity software components • Probe Execution (Nagios), Messaging (ActiveMQ), Reporting (JasperReports) • It is essentially an integration exercise • Make existing tools work together better • In order to improve reliability • This is what we will verify over the next 12 months 41 41

  42. More Information… GDB Reports on Monitoring by James Almost every month at GDB or pre-GDB http://indico.cern.ch/categoryDisplay.py?categId=3l181 Improving Job Reliability http://indico.cern.ch/conferenceDisplay.py?confId=20228 Email James… Look at CERN directory… 42

More Related