230 likes | 255 Views
This presentation discusses the importance and complexity of monitoring the job processing activity within the ATLAS and CMS Virtual Organisations. It explores the existing solutions for monitoring job processing and introduces the Experiment Dashboard system as a user-centric monitoring solution. The presentation covers the architecture and features of the Experiment Dashboard system and its applications for task monitoring. It also discusses future plans and summarizes the benefits of user-centric monitoring.
E N D
User-centric monitoring of the analysis and production activities within the ATLAS and CMS Virtual Organisations using the Experiment Dashboard systemEGI Community Forum 2012 J. Andreeva, M. Cinquilli, I. Dzhunov, E. Karavakis (CERN & SA3), M. Kenyon, L. Kokoszkiewicz, P. Saiz, L. Sargsyan, D. Tuckett CERN IT-ES EGI Community Forum 2012 - Munich
Outline • Importance and complexity of monitoring the LHC job processing activity • Existing solutions for ATLAS & CMS VOs • Experiment Dashboard Task Monitoring applications • Common solutions for ATLAS & CMS • Future plans • Summary User-centric monitoring using the Experiment Dashboard system
Importance of monitoring the job processing activity • WLCG integrates more than 140 computing centres in 35 countries • Job processing is the core part of the VO computing activities • More than 200,000 jobs are running concurrently for the LHC VOs using various middleware platforms, job submission methods and execution back-ends • Scientists must be able to monitor without any hassle the execution status, application and grid-level messages of their tasks that may run at any site within the WLCG • Only serious issues should be escalated to the support teams User-centric monitoring using the Experiment Dashboard system
Complexity of monitoring the job processing activity • More than 600K ATLAS jobs & 400K CMS jobs are submitted daily on different middleware platforms! • Job processing activity is divided into two categories: • User analysis • Data reconstruction & Monte-Carlo production • Data reco & MC production are well-organised activities performed by a group of experts • User analysis is a chaotic activity performed by diverse members of the physics community • Normally carried out by users who are not necessarily experienced in using the Grid - particular difficult to predict User-centric monitoring using the Experiment Dashboard system
Existing solutions • Most of the monitoring applications are coupled to VO-specific solutions • CRAB Monitoring is coupled to jobs submitted by the CRAB submission system • WMAgent Monitoring is coupled to jobs submitted via WMAgent • Other submission tools (ProdAgent, Grid Control, farmout, … ) • Panda Monitoring is coupled to jobs submitted via the PanDA workload management system • GangaMon / MiniDashboard is coupled to jobs submitted with Ganga CMS ATLAS User-centric monitoring using the Experiment Dashboard system
Experiment Dashboard • Monitoring system developed for the LHC experiments • Enables transparent view of the experiment activities across different middleware implementations and combines Grid monitoring data with information that is specific to the VO • Loose coupling to information sources; collecting information from various information sources • Job submission systems • Jobs themselves • Relies on instrumentation of the job submission frameworks and provides a common library for that purpose. Defines common set of attributes and format for reporting • Presents this information in a coherent way as all of it came from one single source! User-centric monitoring using the Experiment Dashboard system
Dashboard Task Monitoring applications • The Dashboard Task Monitoring applications collect & expose to the user a user-centric set of info • Provide a clean and precise view of the • task evolution • reason of failure • resubmission history • Based on common solutions and DB schema • Developed in close collaboration with the physicists who use the Grid infrastructure and they are tailored to their needs • Heavily used both within ATLAS & CMS for the production and analysis activities User-centric monitoring using the Experiment Dashboard system
Job monitoring architecture Dashboard Data Repository (ORACLE) Job submission client or server Dashboard consumer Message server (MonALISA or MSG) Dashboard web server Jobs running at the WNs User web interfaces Data retrieval via APIs User-centric monitoring using the Experiment Dashboard system
Job monitoring architecture (cont.) Dashboard Data Repository (ORACLE) Job submission client or server Dashboard consumer Message server (MonALISA or MSG) CMS information sources: CRAB jobs, clients and server, WMAgent jobs and server (and other submission tools) are instrumented for Dashboard reporting. Reporting is currently based on MonALISA Dashboard web server Jobs running at the WNs User WEB interfaces Data retrieval via APIs User-centric monitoring using the Experiment Dashboard system
Job monitoring architecture (cont.) Prod Sys DB PanDA DB Dashboard Data Repository (ORACLE) Job submission client or server Dashboard consumer Message server (MonAlisa or MSG) ATLAS information sources: Direct access to ATLAS Production DB and PanDA DB. Ganga jobs submitted through WMS and local batch systems and Ganga clients are instrumented for Dashboard reporting. Reporting based on ActiveMQ (MSG) - can be used by any job submission framework Dashboard web server Jobs running at the WNs User WEB interfaces Data retrieval via APIs User-centric monitoring using the Experiment Dashboard system
Job monitoring architecture (cont.) Dashboard Data Repository (ORACLE) The same data repository is used by multiple applications within a VO. Each of them is focused on a particular use case. Common solutions shared by the two VOs even when using different job submission systems and execution back-ends. UIs are database agnostic Job submission client or server Dashboard consumer Message server (MonALISA or MSG) Dashboard web server Jobs running at the WNs User web interfaces Data retrieval via APIs User-centric monitoring using the Experiment Dashboard system
Job monitoring architecture (cont.) Dashboard Data Repository (ORACLE) Job submission client or server Dashboard consumer Message server (MonALISA or MSG) Dashboard information is consumed by other applications in machine-readable format: Local fabric monitoring Site Status Board GridMapSiteView WLCG Google Earth Dashboard CMS Data Popularity Imperial College Real Time Monitoring Dashboard web server Jobs running At the WNs User WEB interfaces Data retrieval via APIs User-centric monitoring using the Experiment Dashboard system
CMS Analysis Task Monitoring • Focused on the user's perspective • Offers a wide selection of graphical plots • User-driven development • Heavily used by CMS – up to 305 daily users User-centric monitoring using the Experiment Dashboard system
CMS Analysis Task Monitoring • Focused on the user's perspective • Offers a wide selection of graphical plots • User-driven development • Heavily used by CMS – up to 305 daily users Users from 52 countries from 5 months stats!!! User-centric monitoring using the Experiment Dashboard system
Analysis Task Monitoring • User / User-support perspective with a wide selection of plots • Using web2.0 technologies and exposing a modern user interface • Empowers users so that only non-trivial issues are escalated to support teams Task filtering by time period Task filtering by pattern Task name resolved according to output container dataset name Panda states Graphical representation of the status of jobs Based on hBrowse, a common jQuery framework used for generic job monitoring applications (for more information please see the poster) Powered by hBrowse User-centric monitoring using the Experiment Dashboard system
Analysis Task Monitoring Task meta information Links to the panda page for more detailed information Advanced interactive plots. Can be exported as image or pdf document User-centric monitoring using the Experiment Dashboard system
Analysis Task Monitoring on Android! • Work performed by two Brunel University students:Parth Patel & Benjamin Taliadoros (under the supervision of Prof. Akram Khan) • Download Link: dashboard.cern.ch/cms • Installation Steps: 1) Download Application from above link, 2) Open downloaded file, 3) User must enable the ‘Untrusted Sources’ option from Settings to install Tasks view • Sort by: • Task Name • Date (ascending or descending • Total # Jobs (ascending or descending) User-centric monitoring using the Experiment Dashboard system
Analysis Task Monitoring on Android! Jobs view User-centric monitoring using the Experiment Dashboard system
Error Reporting Tool • When a client submitted job fails, a user can upload a snapshot of the working directory for investigation by the Analysis Ops team • Heavily used by the CMS Analysis Operations Service Experts can download a snapshot of the working dir of the user Links to Task Monitoring Powered by hBrowse User-centric monitoring using the Experiment Dashboard system
Production Task Monitoring • Allows users to follow the progress of production tasks • Task-oriented view of production activity with a wide selection of stats&plots • Easily detect inefficiencies and/or delays in executing production tasks • Takes into account feedback collected from ATLAS production managers Powered by hBrowse User-centric monitoring using the Experiment Dashboard system
Future Plans • Dashboard job monitoring applications will be extended according to the requests of the LHC VOs • Analysis Task Monitoring will support the resubmission and cancellation of a given task or job • Production Task Monitoring will be extended according to the requests being collected from ATLAS production managers User-centric monitoring using the Experiment Dashboard system
Summary • The Experiment Dashboard Framework could be easily adapted to the needs of new VOs but the VOs must decide what they wish to monitor and implement/extend the monitoring system according to their needs • Provides common solutions for job monitoring of the LHC experiments based on the instrumentation of the job submission frameworks. Common libraries for that purpose are provided • Works transparently across different middleware platforms, submission methods and execution back-ends • Targets different categories of users • Heavily used by the ATLAS and CMS analysis and production community on a daily basis • Responds well to the needs of the LHC experiments http://dashboard.cern.ch User-centric monitoring using the Experiment Dashboard system
Backup Slide • Guide on commonly used tools, libraries and coding style within the developers of the Experiment Dashboard project is available at https://twiki.cern.ch/twiki/bin/view/ArdaGrid/Libs