90 likes | 228 Views
AMOD Report. Simone Campana CERN IT-ES. Grid Services. A very good week for sites No major issues for T1s and T2s The only one to report is CASTOR@TW Tail of problems after an hardware failure DB index corrupted, need rebuild and a scheduled downtime
E N D
AMOD Report Simone Campana CERN IT-ES
Grid Services • A very good week for sites • No major issues for T1s and T2s • The only one to report is CASTOR@TW • Tail of problems after an hardware failure • DB index corrupted, need rebuild and a scheduled downtime • A typhoon in TW brought complications to the schedule
ATLAS services: DDM SS • On Saturday many SS restarts due to ARDACallback agent crashing • The problem was not related to Dashboard but to activeMQ • Both ActiveMQ and ARDA Dashb callbacks sent by the same agent • Martin B. spotted an issue in an ActiveMQ broker • ActiveMQ callbacks have been disabled in many SS machines • CERN IT has been contacted about the faulty broker
ATLAS Services: DDM SS (follow up) • The case needs to be added to the AMOD documentation (or the DDM documentation) • The AMOD needs to be able to see the ActiveMQ monitoring (now certificate protected) • The AMOD needs to be able to login to Dashboard machines (was possible, not working now) • DDM SS need to be protected against this behavior • Martin has a list of possible improvements
ATLAS Services: DBs • On Sunday afternoon, Online to Offline replication of non DCS data was “yellow” for 2 hours. • This is not ADC responsibility: • The P1 shifter should report to the shift leader • The shift leader should contact the proper people • Something went wrong in this • It is explained in the AMOD twiki but the AMOD missed to see it • The problem vanished by itself
ATLAS Services: schedconfig • There was a “partial” update of schedconfig • Some queue with “copytool=lcgcp2”, “lfcregister=None” in IN2P3-CC • What happens: • The pilot uploads in the SE and does not register in LFC (feature of of lcgcp2) • The panda server does not register in LFC (since lfcregister=None) • Both Panda and Pilot believe all is OK and the job finishes successfully • Now we have dark data and Prodsys thinking the task is complete …
ATLAS Services: schedconfig (follow up) • Ueda is registering missing files by hand • 50% of files produced by IN2P3-CC in one week … • We are lucky Ueda is Ueda … I would take 2 month of holiday. • Schedconfig should protect against this (I am not sure how or if AGIS can protect and how) since: • Human errors happen • The meaning and behavior of schedconfig fields is not well documented • We have many queues, many panda sites and many attributes for each of them • BTW, let’s please push for getting rid of those panda queues once forever (see A. Di Girolamo’s thread)
ATLAS Services: comp@P1 terminal • Firefox in the comp@P1 terminal crashed in the night of Wednesday • The shifter tried the procedure to restart but did not succeed for 1h • Unable to connect to any page • The he called the AMOD. Who could not do much • But the system magically started to work again • The (non confirmed) hypothesis is that the conTZole crashed Firefox • Happened in the past • But this time there was at least another problem • Ueda suggests to run conTZole and all the rest in separate windows
Conclusions • Very quiet shift • My last AMOD was 1 week before the Higgs seminar … • 2 night calls (both of them for a good reason)