Update on SAM monitoring

Update on SAM monitoring Wojciech Lapka, David Collados

Outlook • Current Situation • Short overview of the migration of the LHC VOs from SAM to Nagios • as understood by GT with some input from CMS and ATLAS • detailed update can be given at the next MB • not enough time to work on a common report • January availability report only arrived during first week of February ( delay due to manual data quality assurance) • Status of ACE computation • Replacement of FCR for CMS blacklisting • Issues 2

Current Situation • We still have to operate two monitoring infrastructures in parallel • SAM legacy services • Original end of life was autumn 2010 (EGEE-III) • Old SAM Portal, DBs, FCR (operated by CERN-IT) • SAM-DPM machines(operated by CERN-IT) • SAM-BDII (run by CERN-IT) • NAGIOS based system • CERN-ROC Nagios Instance (CERN-IT) • Asia-Pacific Nagios Instance (CERN-IT) • Last ROC Nagios not run by an NGI • Planned: ALL ROC-Nagios move by 10.2010 !!! • Experiment specific Nagios Prod and PreProd Services • 8 instances, fully Quattorized, ready to move 3

Current Situation • New visualization and front-ends • MyEGI running on central Nagios DBs • Still needs work (features, bug fixing) • Team lost main developer  delay • GridView(service run by GT) • Development by BARC collaboration • Service will be integrated into MyEGI • 2nd Level support for SAM Nagios (GT) • Planned: Move to EGI autumn 2010 • Ops Nagios probes maintenance still with GT • Agreed to be moved to EMI Product Teams • Many services and tasks still with the team • +reduced manpower (went from 7  4) 4

Experimentsmoving to Nagios • Probes and debugging by IT-ES and experiments • Services and support by IT-GT • Follow-up on failures with experiment contact • New production and pre-prod setup from scratch • Validation of Nagios monitoring • Dec/Jan availability reports for SAM/Nagios • To compare the results • We expect Nagios and SAM availability figures to be within 5% • Nagios should be a bit higher due to re-tries • Equivalent metrics for CE/SRM at T0/T1s • Standard GridView algorithm was used • This allows direct comparison 5

Experimentsmoving to Nagios • GridView reports for LHC VOs: • Official (SAM based): http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/2010/201012/wlcg/ http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/201101/wlcg/ • New (Nagios based): http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/2010/201012-nagios/ http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/201101-nagios/ 6

Status: ALICE on Nagios • Followed up by Maria Dolores Saiz & Maarten Litmaath • Random failures during job submission • Likely reason: 5h30 timeout in Nagios • SAM used 12h • December • RAL (Availability: 61% in Nagios, 93% in SAM) • January • RAL (Availability:70% in Nagios, 90% in SAM) • Suggested next step: increase timeout & re-evaluate in March 7

Status: ATLAS on Nagios • Followed up by Alessandro di Girolamo • December 2010 • Very similar results for Nagios and SAM • January 2011 • BNL (43% in Nagios, 86% in SAM) • Problem understood by ATLAS and fixed by Site • Nagios uses new DN and CRL was not sufficiently recent • Nagios based availabilities have been implemented in ATLAS Dashboard • Data stored in legacy SAM DB • http://tinyurl.com/dashb-sam-nagios-48h 6

Status: CMS on Nagios • Followed up by Andrea Sciaba • ‘org.cms.SRM‐VOGet‘ fails randomly • December • RAL-LCG2 (81% in Nagios, 94% in SAM) • January • Taiwan-LCG2 (94% in Nagios, 100% in SAM) • Problem understood by CMS • Probe, Space Token and site config. related issue • Next steps (February): • Modification of CMS Nagios probe • Calculate and compare Dashboard availabilities • Run test ‘org.cms.WN-mc’ with production role 9

Status: LHCb on Nagios • Followed up by Roberto Santinelli • Random failures during jobsubmission • Most likely due to 5h30 timeout in Nagios • December • PIC (84% in Nagios, 99% in SAM) • INFN-CNAF (80% in Nagios, 99% in SAM) • RAL (79% in Nagios, 94% in SAM) • January 2011 • RAL (88% in Nagios, 97% in SAM) • Suggestednext step: Increase timeout & re-evaluatein March 10

HEP VOs – next steps • Validatedashboardapplications with Nagiostests (IT/ES) • Stillrequireslegacy SAM DB and portal • The portal provides a programmaticinterface (PI) • New interface by myEGIisavailablesincemid January • Still pre-prodservice but can be used for migration • GT will stop old SAM system as soon as we get green light from the experiments • June/July 2011 last security patches for SLC4 • service can’t be migrated to SL5 • Cannot afford running 2 services in parallel 11

ACE Schedule • December 2010 • Validated the standard availability for OPS√ • January • Computation of standard availabilities for LHC experiments (one profile per VO) √ • February • Multiple availabilities (different profiles, same algorithm) per VO √ • March • Multiple availabilities (different profiles and algorithms: CREAM CE use case) per VO 12

ACE – next steps • March: validate ACE reports with GridView • OPS & LHC VOs. • March: Generate two ACE reports for OPS • CREAM & LCG-CE and compare results • April: ACE validation • Production readiness • May: ACE in production mode • Given that no major issues are found 13

CMS blacklisting • In cooperation with Andrea Sciaba • CREAM & LCG-CE status computed based on Nagios results • Generic Programmatic Interface for data export available in JSON/XML • Ongoing work on a solution • Test BDII with blacklisting by end of this week • Move to prod after CMS green light • Would likeCMS to consider the use of the generic PI • Increased flexibility • More uniform approach, no extra service 14

Issues • Manpower went down by 60% (4 FTEs remain) • Team is in a Catch22 situation • All resources are absorbed by operations and support • Decommissioning of legacy services would free resources • But requires effort that is not available • We need to stop/move away services or development will freeze until more resources arrive • Risk that use of Nagios data via old SAM-DB continues for too long  move to new PI 15

Questions? 16

Update on SAM monitoring