250 likes | 425 Views
WLCG Service Report. Andrea.Valassi@cern.ch ~~~ WLCG Management Board , 24 th April 2012. Introduction. 5 busy weeks since the last MB report on March 20 th LHC beam commissioning and data taking (first stable beams on April 5)
E N D
WLCG Service Report Andrea.Valassi@cern.ch ~~~ WLCG Management Board, 24th April 2012
Introduction • 5 busy weeks since the last MB report on March 20th • LHC beam commissioning and data taking (first stable beams on April 5) • Busy but successful – smooth activity also over the Easter break • 5 Service Incident Reports received: • CASTOR name server stuck, 3 CMS files truncated, on Apr 4 (ALARM and SIR) • GGUS unreachable for some regions due to DNS update on Mar 20 (SIR) • RAID corruption (Adaptec 6445) at PIC on Mar 15, 1269 ATLAS files lost (SIR) • Defective Enstore LT05 cartridge at PIC on Mar 9, one ATLAS file lost (SIR) • Database server upgrades to Oracle 11g at T0 and T1 in Q1 2012 (SIR) • 10 real GGUS ALARMS (7 for ATLAS, 1 for CMS, 2 for LHCb) • Five at CERN, two at KIT, one at INFN, Taiwan, IN2P3 • Many other issues reported at the daily meetings, most notably: • FTS upgrades to 2.2.8 and related issues at several sites • LHCb file corruption at IN2P3 (GGUS:80338) • Large fraction of short (<200s) pilot jobs at IN2P3 from ATLAS and LHCb • One node of CMS online DB rebooted due to too high load while data-taking • GEANT network problem on April 13 (preliminary SIR) • New CVMFS client deployed to fix cache issue reported by LHCb
Support-related events since last MB There were 11 real ALARM tickets since the 2012/03/20 MB (5 weeks). 8 submitted by ATLAS (of which GGUS:81429 turned out to be a false – not test – ALARM, hence not drilled here). 1 by CMS. 2 by LHCb. Ticket closing is now automatic after 10 working days as per EGI reporting requirements. (ticket closing in CERN SNOW is also automatic after only 3 working days). The GGUS monthly release took place on 2012/03/20. Bugs related to the Remedy upgrade, preventing email notifications and attachments from being delivered, were discovered and fixed thanks to the regular test ALARMs’ suite. Details Savannah:127010 Details follow…
WLCG MB Report WLCG Service Report LHCb ALARM->Tape recall rate very low at GridKa GGUS:80589 7
ATLAS ALARM-> CERN-IN2P3 transfers not processed by FTS GGUS:80602
CMS ALARM-> CERN Storage mgnt system shows issues with file copying GGUS:80905 (SIR)
ATLAS ALARM-> IN2P3 transfer errors due to destination SRM AuTH GGUS:81286
ATLAS ALARM-> CERN Raw data retrieval problem from Castor GGUS:81352
1.1 1.2 1.3 4.1 3.2 3.1 3.1
Analysis of the reliability plots: Week of 19/03/2012 – 25/03/2012 Trans-VO events [None] ATLAS 1.1 IN2P3 (25/03). CreamCE tests failing on cccreamceli01 for entire week & for 50% of 25/03 on ccreamceli06. 1.2 NIKHEF (25/03). Juk & stremsel.nikhef.nl failing CREAM-CE tests for ~35% of 25/03. 1.3 SARA-MATRIX (21/03). Creamce & creamce2.gina.sara.nl failing tests for ~35% & ~55% of 21/03. ALICE [Nothing to report] CMS 3.1 ASGC (24 & 25/03). Srm2.grid.sinica.edu.tw failing VO Put tests on 24 & 25/03; cream03.grid.sinica.edu.tw failing JobSubmit tests from 0700 on 25/03 onwards. CMS 3.2 IN2P3-CC (22/03-23/03). cccreamceli05.in2p3.fr failing org.cms.WN-swinst tests for 13 hours + service availability unknown for another 20 hours. cccreamceli07.in2p3.fr failing org.cms.WN-swinst tests for 9 hours + service availability unknown for another 12 hours. LHCb 4.1 CNAF (19/03). SRM-VOLs test failing from 0000 to 0900 on 19/03.
1.1 3.1 3.2 4.1
Analysis of the reliability plots: Week of 26/03/2012 – 01/04/2012 Trans-VO events [None] ATLAS 1.1 NIKHEF (26/03). JobSubmit tests cancelled/timed out, no ticket opened for it ALICE [Nothing to report] CMS 3.1 IN2P3 (28&29/03). CREAM-CE tests failures (SAV) 3.2 ASGC (29&30/03). SRMv2 tests failures (GGUS) LHCb 4.1 RAL (30/03). DirectJobSubmit CREAM CE tests failures for ~3 hours
Analysis of the reliability plots: Week of 02/04/2012 – 08/04/2012 Trans-VO events [None] ATLAS [Nothing to report] ALICE [Nothing to report] CMS [Nothing to report] LHCb 4.1 PIC (02/04-03/04). Annual power supply check. Since 02/04 17h UTC org.sam.CREAMCE-DirectJobSubmit SAM tests are cancelled, since 03/04 2am UTC SRM SAM test org.lhcb.SRM-VOLsDir, org.lhcb.SRM-VOLs, and org.lhcb.SRM-VODe were failing. Failures disappeared on 03/04 17 hrs UTC (when the downtime finished).
1.1 3.1
Analysis of the reliability plots: Week of 09/04/2012 – 15/04/2012 Trans-VO events [None] ATLAS [Nothing to report] 1.1 TRIUMF (10/04-11/04). CREAM-CE and SRMv2 SAM/Nagios tests failed between 8am UTC 10/04 and 6am 11/04 due to ongoing unscheduled downtime at TRIUMF-LCG2 induced by 2 site-wide powercuts. ALICE [Nothing to report] CMS 3.1 TW_ASGC (11/04-12/04). CREAM-CE and SRMv2 SAM/Nagios tests failed between 5pm UTC 11/04 and 11am 12/04 due to ongoing storage unscheduled downtime. LHCb [Nothing to report]
1.1 1.2
Analysis of the reliability plots: Week of 16/04/2012 – 22/04/2012 Trans-VO events [None] ATLAS 1.1 INFN-T1 (18/04). Storage test results degraded for 9 hrs during downtime for tape facility upgrade. 1.2 NDGF-T1 (20/04). Storage test results degraded for 7 hrs due to issue with dCache. GGUS:81447 ALICE [Nothing to report] CMS [Nothing to report] LHCb [Nothing to report]
Conclusions • Business as usual – busy (again) but successful • First stable beams on April 5th • Upgrade to FTS 2.2.8 has been completed • Several issues with the 2.2.8 release have been reported by the sites • All such issues have been addressed by patches over FTS 2.2.8 • These (yet unreleased) patches will be included in the next EMI release