80 likes | 207 Views
AMOD Weekly report (Ale, Alexei, Jarka ). Doug Benjamin (AMOD shadow). Active week. Total jobs Completed (hourly) (analysis/Production). Tier 0 LFS running jobs. 40K. 6k. 3 k. 10 K. DDM data transfers (daily). 700 TB. 1 00 TB. Various ATLAS items. ATLAS Items to Follow up:
E N D
AMOD Weekly report(Ale, Alexei, Jarka) Doug Benjamin (AMOD shadow)
Active week Total jobs Completed (hourly) (analysis/Production) Tier 0 LFS running jobs 40K 6k 3k 10 K DDM data transfers (daily) 700 TB 100 TB
Various ATLAS items ATLAS Items to Follow up: • FTS job not being dropped: add the conditions (error messages) in dq2.cfg – To be done • FTS returns exit code #1 if command does not succeed • SLS SS becomes orange/red only if Restarts>3 in 30 mins, • this cannot happen (because of the loop of DQ2). Cedric is aware, • Possible solutions: a) lower the threshold, i.e. 2 restarts, or b) increase the time window - night shift Thursday/Friday: he/she missed 2 big issues. To be followed by shift coordinators • FR cloud SS dq2 crash and restart frequently • Related to Multi-hop subscription – Experts pondering solution • AGIS ToACache being tested in SS FT, one deletion agent (atlddm17), SS CERN(TW and IT) https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/36887 • ALARM (Wednesday) Tier0 submission bsub slow: • changing the network card (upgrade from 1to 10 Gb) of the master. • changing the way in which the CE queries the LSF master, i.e. grouping the queries from the CREAM CE . They think that this can reduce quite a lot • there is an open call to the LSF company (Platform, that is IBM company). There is an expert looking at the problems right now.
Never a dull moment Monday: • CVMFS conddb file issue: Doug and Misha contacted the morning, Misha on holidays, Doug flying to CERN, Misha on holidays, Doug flying to CERN, Doug fixed the problem lunch time (not during the flight ). • Instructions written for AMOD – shift coordinators need to make sure they are in the instructions Tuesday: • many ATLAS Central Services (running on VM) got crazy: due to a glitch on the Hypervisors http://itssb.web.cern.ch/service-incident/cviclr04fc-cluster-failure/12-06-2012 - • CERN-PROD ALARM: T0 Merge files not accessible (GGUS:83178) • FZK FTS Server Channel stuck (GGUS:83111) Site Admin talked with FTS Wednesday: • MC12*merge.TAG* replication issue: all the data were set to secondary. Immediate actions: change metadata to primary, replicate all to CERN-PROD. Follow up: make sure this won't happen again (Datri modified for group production to datadisk), discuss number of replicas, etcetc. still to be "closed the loop". • RAL – attempted Oracle11 upgrade – UK wide FTS outage – FTS move alternative database
Never a dull moment(2) Thursday: • CERN PROD LFS bsub sub slow – ALARM GGUS:83252 • RAL Oracle11 upgrade did not make it, site rolled back. To be re-scheduled. FTS server was returning for the jobs submitted before the start of the intervention exit code 256, which is the same of "no contact to FTS server", so ATLAS SS were polling all the already lost FTS jobs. Since this is definetely related to this aborted attempt of upgrade, we do not report further (i.e. no GGUS). • A growing number of sites using CVMFS are seeing the occasional error with: [CVMFS sites] Error: cmtsite command was timed out : https://savannah.cern.ch/support/?129468 • There is an issue when a DDM SS Box spends too much time to scan the FTS jobs. If the site is far away, the problem is reached with less FTS jobs. Apparently, the problem is reached for 400 FTS jobs in TRIUMF FTS and 800-1000 for european T1s. DDM (Cedric) is trying to find the optimal way to avoid this. Friday: • CERN-PROD : overnight SRM transfer failure CERN-PROD_TZERO, later in morning stages successful, ticket closed: GGUS:8329 • TRIUMF: FTS errors with proxy : GGUS:83293 • FR site services – (see page 3) • DDM SS power cut overnight affected FZK SS box -
Finally the weekend Saturday/Sunday: • IN2P3 FTS channels got stuck. GGUS:83320 solved: some channel agents did not recover the Oracle connexion after the logrotate at 4:00 AM due to a problem with Oracle virtual IPs. Solved by defining a new connection string which does not use Oracle virtual IPs. • TRIUMF 1745 files lost.Files declared to the consistency service. Savannah:95440. Ticket will be updated when the exact number of lost files is confirmed. • Extra T1-T1 subscriptions (some missing datasets) • All CERN resources given to Tier 0 processing Criticial directory fills Our monitoring fails to catch it Data quality monitoring issues
OWL Monitoring/shift operations • Around 00:00 Saturday morning - Tier 0 – directory used for log file merging became full – stopping processing ( atypical occurrence) • Browser used by shifter at P1 crashed an restarted at ~12:30 – normally update plots were wedged and warning icons were stuck green • Lost Tier 0 processing for the night until shift change • OWL shifter is a distributed computing expert • Our shift monitoring not really designed for middle for the night operation well people are most tired. My suggestion – review of all shift monitoring – with assumption shifters are tired – make it easier for them to spot serious errors
Many Thanks • Senior AMOD’s (Ale, Jarka and Alexei) • Excellent trainers – they know how to crack the whip • Comp@P1 shifters – to keep data flowing during the week • ADCos Shifters and Experts • Tier 0 experts – making the data flow • Site admins – quickly recovering from the daily hickups • ADC experts