250 likes | 394 Views
The Experiment Dashboard. ISGC 2008 9-11 th April 2008 Pablo Saiz, Julia Andreeva, Benjamin Gaidioz, Anastasia Ivanchecnko, Gerhild Maier, Ricardo Rocha, Irina Sidirova IT-GS-MND. Overview. Dashboard structure Dashboard in production Job Monitoring Grid reliability
E N D
The Experiment Dashboard ISGC 2008 9-11th April 2008 Pablo Saiz, Julia Andreeva, Benjamin Gaidioz, Anastasia Ivanchecnko, Gerhild Maier, Ricardo Rocha, Irina Sidirova IT-GS-MND
Overview • Dashboard structure • Dashboard in production • Job Monitoring • Grid reliability • Prodsys • Data Management • SAM • FTS monitoring • Site status board • Future development • Conclusions ISGC 2008 -- Pablo.Saiz@cern.ch 2
Dashboard Framework Multiple clients: cli, web Multiple output formats: plain text, csv, xml, xhtml Collectors of information Common configuration and management Agents Web / HTTP Interface Data Access Layer (DAO) DB reading and writing via DAO layer Connection pooling Easy to add interface for a different backend Oracle DB ISGC 2008 -- Pablo.Saiz@cern.ch 3
Dashboard activities COMMON applications ALICE, ATLAS, CMS, LHCb, Vlemed Job monitoring Site reliability Experiment specific applications Accounting information from Apel and Gratia for ATLAS (prototype) Experiment Dashboard Data management monitoring for ATLAS Production monitoring for ATLAS and CMS (prototypes) CMS Integration and commissioning Task monitoring for CMS analysis users (ATLAS on the way) Transfer monitoring for ALICE IO rate monitoring between WN and SE (prototype) Site availability based on the results of SAM tests Job Robot monitoring ISGC 2008 -- Pablo.Saiz@cern.ch 4
Job Monitoring • Display all the jobs submitted by a VO • Follow the status of the jobs • Collect information from different sources • RGMA, IC Real Time Monitor, BDII, MonALISA, … • Very useful for VO managers, site admin, users • Possibility to get the output in different formats • Deployed for ALICE, ATLAS, CMS, LHCb and VleMed ISGC 2008 -- Pablo.Saiz@cern.ch 5
Job Monitoring ISGC 2008 -- Pablo.Saiz@cern.ch 6
Job Monitoring ISGC 2008 -- Pablo.Saiz@cern.ch 7
Site Reliability • Efficiency of the different sites • Jobs and Job Attempts • List of most common errors • And recipes to the solutions!! • Generic application • Automatic generation of monthly reports ISGC 2008 -- Pablo.Saiz@cern.ch 8
Site reliability ISGC 2008 -- Pablo.Saiz@cern.ch 9
Production System • ATLAS Prodsys • Identify failing tasks and jobs • Evaluate the performance of the sites • Daily/weekly/monthly statistics • User guide ISGC 2008 -- Pablo.Saiz@cern.ch 10
Production System ISGC 2008 -- Pablo.Saiz@cern.ch 11
Production System ISGC 2008 -- Pablo.Saiz@cern.ch 12
Data Management • Monitor of T0 and Production system • Report of transfers to the different sites • Integrated with the ATLAS management system • Information of the clouds, sites, SE and datasets • History of errors ISGC 2008 -- Pablo.Saiz@cern.ch 13
Data Management ISGC 2008 -- Pablo.Saiz@cern.ch 14
Data Management ISGC 2008 -- Pablo.Saiz@cern.ch 15
FTS reliability • Daily report on the success of transfers • Drill down list of errors • Integrated in the ALICE environment • Extremely useful during the different ALICE challenges: PDC06, PDC07, CRC08 • Working on making it generic ISGC 2008 -- Pablo.Saiz@cern.ch 16
FTS reliability ISGC 2008 -- Pablo.Saiz@cern.ch 17
SAM monitoring • Service Availability Monitoring • Clickable plots to drill down: • Site availability Service availability Service tests • Links to the SAM results • At the moment, only for CMS • ATLAS requested a similar interface • Ongoing work to make it generic ISGC 2008 -- Pablo.Saiz@cern.ch 18
SAM monitoring ISGC 2008 -- Pablo.Saiz@cern.ch 19
SAM monitoring ISGC 2008 -- Pablo.Saiz@cern.ch 20
Site Status Board • Table with status of the different sites for CMS • Easy definition of new ‘metrics’ • The ‘metrics’ can come from different sources • Links to more detailed information • At the moment, deployed for CMS • It could be used by other VO • Working on providing history • And aggregation… ISGC 2008 -- Pablo.Saiz@cern.ch 21
Site Status Board ISGC 2008 -- Pablo.Saiz@cern.ch 22
Site Status Board ISGC 2008 -- Pablo.Saiz@cern.ch 23
Experiment Dashboard plans • Include more data sources: condor_g, L&B, • Security: X509 authentication • New application: • Pilot jobs • Input collections • Improve existing applications • Make the SAM interface generic • More in depth failure analysis • User requests and suggestions • Integration with the GridMap technology ISGC 2008 -- Pablo.Saiz@cern.ch 24
Conclusions • The Experiment Dashboard provides: • Several monitor applications • Integration of information from different sources • Multiple output format: html, xml, csv, txt.. • Generic appliations: • Job Monitoring, Grid reliability • Experiment specific • DDM, ProdSys, Site Status Board, SAM, … • Used in production by multiple VO • User, installation and developer guides http://dashboard.cern.ch ISGC 2008 -- Pablo.Saiz@cern.ch 25