AMOD report 24 – 30 September 2012

AMOD report 24 – 30 September 2012 Fernando H. Barreiro Megino CERN IT-ES

Workload

Data transfers > 1M files a day High number of transfer failures caused by a few NL T2s

Tue25 - High load on PanDA Servers • Average time for DQ2+LFC registration increased dramatically causing high load on PanDA Servers • Some LFC timings in the logs indicated that the registration slowness was in DQ2 CC writer 1 CC writer 2 Number of sessions open on ADCR3 instance. Mostly by ATLAS_LFC_W user

Tue25 - High load on PanDA Servers • Other observations that came up during the investigation • Some improvements on the LFC client are going to be discussed during “DB technical meeting on the LFC” on Wednesday 3rd Oct • PanDA server LFC registration should be activated for all sites in order to avoid individual registrations by the pilot • aCT registers in bursts without bulk methods: In the LFC logs we saw 4k accesses over 1 hour and only 7 access over another hour • There were 2 SS machines serving the DE cloud (i.e. the same sites twice) with similar configuration

Thu27- SS callbacks to dashboard piling up SS-FR • Initially we thought it was exclusively due to the CERN network intervention • After checking the logs we have seen slow callbacks before the intervention on different SS machines • D. Tuckett is checking the situation

Other incidents and downtimes • Monday • New PanDA proxy had not been updated on PanDAMonitor machines (Savannah: 97737) • INFN-T1 scheduled downtime for ~1 hour • Tuesday • RAL 6h upgrade to CASTOR 2.1.12-10. Alastair set UK cloud brokeroff on previous evening • Thursday • CERN network intervention to replace some switches. Services under risk were CASTOR, EOS, elogand dashboard. Smooth intervention - NTR. • Friday • BNL to ASGC transfer errors. Being investigated by both sides during the weekend. ASGC FTS is blocked to access BNL SRM and routing path is changed. (GGUS:86537)

Other incidents and downtimes (2) • Sunday • PVSS DCS replication with large delays due to high insertion rate. DCS expert had to be called on Sunday • RAL had failing jobs due to put errors and transfer errors – including T0 export. Caused by problem with Stager databases and resolved during Sunday late evening(GGUS:86552) • Saturday • SS-SARA had CRITICAL errors. MySQL DB corruption? Problem to be understood by DDM experts.

Acknowledgements • Except for occasional highlights it has been a very quiet week • Thanks a lot to • ADCoSexpert&shifters, and to the Comp@P1 shifter for the good work • experts of the different components and sites for the quick reaction • Alessandro, Ueda for their support

Backup slides

NL transfer errors

AMOD report 24 – 30 September 2012