1 / 9

AMOD Report

AMOD Report. Simone Campana CERN IT-ES. Grid Services. A very good week for sites No major issues for T1s and T2s The only one to report is CASTOR@TW Tail of problems after an hardware failure DB index corrupted, need rebuild and a scheduled downtime

kasen
Download Presentation

AMOD Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMOD Report Simone Campana CERN IT-ES

  2. Grid Services • A very good week for sites • No major issues for T1s and T2s • The only one to report is CASTOR@TW • Tail of problems after an hardware failure • DB index corrupted, need rebuild and a scheduled downtime • A typhoon in TW brought complications to the schedule

  3. ATLAS services: DDM SS • On Saturday many SS restarts due to ARDACallback agent crashing • The problem was not related to Dashboard but to activeMQ • Both ActiveMQ and ARDA Dashb callbacks sent by the same agent • Martin B. spotted an issue in an ActiveMQ broker • ActiveMQ callbacks have been disabled in many SS machines • CERN IT has been contacted about the faulty broker

  4. ATLAS Services: DDM SS (follow up) • The case needs to be added to the AMOD documentation (or the DDM documentation) • The AMOD needs to be able to see the ActiveMQ monitoring (now certificate protected) • The AMOD needs to be able to login to Dashboard machines (was possible, not working now) • DDM SS need to be protected against this behavior • Martin has a list of possible improvements

  5. ATLAS Services: DBs • On Sunday afternoon, Online to Offline replication of non DCS data was “yellow” for 2 hours. • This is not ADC responsibility: • The P1 shifter should report to the shift leader • The shift leader should contact the proper people • Something went wrong in this • It is explained in the AMOD twiki but the AMOD missed to see it • The problem vanished by itself

  6. ATLAS Services: schedconfig • There was a “partial” update of schedconfig • Some queue with “copytool=lcgcp2”, “lfcregister=None” in IN2P3-CC • What happens: • The pilot uploads in the SE and does not register in LFC (feature of of lcgcp2) • The panda server does not register in LFC (since lfcregister=None) • Both Panda and Pilot believe all is OK and the job finishes successfully • Now we have dark data and Prodsys thinking the task is complete …

  7. ATLAS Services: schedconfig (follow up) • Ueda is registering missing files by hand • 50% of files produced by IN2P3-CC in one week … • We are lucky Ueda is Ueda … I would take 2 month of holiday. • Schedconfig should protect against this (I am not sure how or if AGIS can protect and how) since: • Human errors happen • The meaning and behavior of schedconfig fields is not well documented • We have many queues, many panda sites and many attributes for each of them • BTW, let’s please push for getting rid of those panda queues once forever (see A. Di Girolamo’s thread)

  8. ATLAS Services: comp@P1 terminal • Firefox in the comp@P1 terminal crashed in the night of Wednesday • The shifter tried the procedure to restart but did not succeed for 1h • Unable to connect to any page • The he called the AMOD. Who could not do much • But the system magically started to work again • The (non confirmed) hypothesis is that the conTZole crashed Firefox • Happened in the past • But this time there was at least another problem • Ueda suggests to run conTZole and all the rest in separate windows

  9. Conclusions • Very quiet shift • My last AMOD was 1 week before the Higgs seminar … • 2 night calls (both of them for a good reason)

More Related