160 likes | 289 Views
Update on SAM monitoring. Wojciech Lapka , David Collados. Outlook. Current Situation Short overview of the migration of the LHC VOs from SAM to Nagios as understood by GT with some input from CMS and ATLAS detailed update can be given at the next MB
E N D
Update on SAM monitoring Wojciech Lapka, David Collados
Outlook • Current Situation • Short overview of the migration of the LHC VOs from SAM to Nagios • as understood by GT with some input from CMS and ATLAS • detailed update can be given at the next MB • not enough time to work on a common report • January availability report only arrived during first week of February ( delay due to manual data quality assurance) • Status of ACE computation • Replacement of FCR for CMS blacklisting • Issues 2
Current Situation • We still have to operate two monitoring infrastructures in parallel • SAM legacy services • Original end of life was autumn 2010 (EGEE-III) • Old SAM Portal, DBs, FCR (operated by CERN-IT) • SAM-DPM machines(operated by CERN-IT) • SAM-BDII (run by CERN-IT) • NAGIOS based system • CERN-ROC Nagios Instance (CERN-IT) • Asia-Pacific Nagios Instance (CERN-IT) • Last ROC Nagios not run by an NGI • Planned: ALL ROC-Nagios move by 10.2010 !!! • Experiment specific Nagios Prod and PreProd Services • 8 instances, fully Quattorized, ready to move 3
Current Situation • New visualization and front-ends • MyEGI running on central Nagios DBs • Still needs work (features, bug fixing) • Team lost main developer delay • GridView(service run by GT) • Development by BARC collaboration • Service will be integrated into MyEGI • 2nd Level support for SAM Nagios (GT) • Planned: Move to EGI autumn 2010 • Ops Nagios probes maintenance still with GT • Agreed to be moved to EMI Product Teams • Many services and tasks still with the team • +reduced manpower (went from 7 4) 4
Experimentsmoving to Nagios • Probes and debugging by IT-ES and experiments • Services and support by IT-GT • Follow-up on failures with experiment contact • New production and pre-prod setup from scratch • Validation of Nagios monitoring • Dec/Jan availability reports for SAM/Nagios • To compare the results • We expect Nagios and SAM availability figures to be within 5% • Nagios should be a bit higher due to re-tries • Equivalent metrics for CE/SRM at T0/T1s • Standard GridView algorithm was used • This allows direct comparison 5
Experimentsmoving to Nagios • GridView reports for LHC VOs: • Official (SAM based): http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/2010/201012/wlcg/ http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/201101/wlcg/ • New (Nagios based): http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/2010/201012-nagios/ http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/201101-nagios/ 6
Status: ALICE on Nagios • Followed up by Maria Dolores Saiz & Maarten Litmaath • Random failures during job submission • Likely reason: 5h30 timeout in Nagios • SAM used 12h • December • RAL (Availability: 61% in Nagios, 93% in SAM) • January • RAL (Availability:70% in Nagios, 90% in SAM) • Suggested next step: increase timeout & re-evaluate in March 7
Status: ATLAS on Nagios • Followed up by Alessandro di Girolamo • December 2010 • Very similar results for Nagios and SAM • January 2011 • BNL (43% in Nagios, 86% in SAM) • Problem understood by ATLAS and fixed by Site • Nagios uses new DN and CRL was not sufficiently recent • Nagios based availabilities have been implemented in ATLAS Dashboard • Data stored in legacy SAM DB • http://tinyurl.com/dashb-sam-nagios-48h 6
Status: CMS on Nagios • Followed up by Andrea Sciaba • ‘org.cms.SRM‐VOGet‘ fails randomly • December • RAL-LCG2 (81% in Nagios, 94% in SAM) • January • Taiwan-LCG2 (94% in Nagios, 100% in SAM) • Problem understood by CMS • Probe, Space Token and site config. related issue • Next steps (February): • Modification of CMS Nagios probe • Calculate and compare Dashboard availabilities • Run test ‘org.cms.WN-mc’ with production role 9
Status: LHCb on Nagios • Followed up by Roberto Santinelli • Random failures during jobsubmission • Most likely due to 5h30 timeout in Nagios • December • PIC (84% in Nagios, 99% in SAM) • INFN-CNAF (80% in Nagios, 99% in SAM) • RAL (79% in Nagios, 94% in SAM) • January 2011 • RAL (88% in Nagios, 97% in SAM) • Suggestednext step: Increase timeout & re-evaluatein March 10
HEP VOs – next steps • Validatedashboardapplications with Nagiostests (IT/ES) • Stillrequireslegacy SAM DB and portal • The portal provides a programmaticinterface (PI) • New interface by myEGIisavailablesincemid January • Still pre-prodservice but can be used for migration • GT will stop old SAM system as soon as we get green light from the experiments • June/July 2011 last security patches for SLC4 • service can’t be migrated to SL5 • Cannot afford running 2 services in parallel 11
ACE Schedule • December 2010 • Validated the standard availability for OPS√ • January • Computation of standard availabilities for LHC experiments (one profile per VO) √ • February • Multiple availabilities (different profiles, same algorithm) per VO √ • March • Multiple availabilities (different profiles and algorithms: CREAM CE use case) per VO 12
ACE – next steps • March: validate ACE reports with GridView • OPS & LHC VOs. • March: Generate two ACE reports for OPS • CREAM & LCG-CE and compare results • April: ACE validation • Production readiness • May: ACE in production mode • Given that no major issues are found 13
CMS blacklisting • In cooperation with Andrea Sciaba • CREAM & LCG-CE status computed based on Nagios results • Generic Programmatic Interface for data export available in JSON/XML • Ongoing work on a solution • Test BDII with blacklisting by end of this week • Move to prod after CMS green light • Would likeCMS to consider the use of the generic PI • Increased flexibility • More uniform approach, no extra service 14
Issues • Manpower went down by 60% (4 FTEs remain) • Team is in a Catch22 situation • All resources are absorbed by operations and support • Decommissioning of legacy services would free resources • But requires effort that is not available • We need to stop/move away services or development will freeze until more resources arrive • Risk that use of Nagios data via old SAM-DB continues for too long move to new PI 15
Questions? 16