1 / 8

Experiment Support

Experiment Support. AMOD weekly report. 19 th July - 25 th July. GGUS and elogs. https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports. 27 Team GGUS + 1 Alarm Alarm: CERN-PROD CASTORATLAS down GGUS ALARM did not work Annoying: when we need an alarm, it rarely works

ekram
Download Presentation

Experiment Support

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experiment Support AMOD weekly report 19th July- 25th July 25 July 2011

  2. GGUS and elogs https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports • 27 Team GGUS + 1 Alarm • Alarm: CERN-PROD CASTORATLAS down • GGUS ALARM did not work • Annoying: when we need an alarm, it rarely works • When CASTORATLAS came back, then SRM was overloaded • CASTOR experts say that the problem was a network problem, many diskservers lost connection all together. • 252 elogs • 13 pages 25 July 2011

  3. CERN-PROD castor general https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports • Many errors • mostly: “INTERNAL ERROR” Too many threads busy with Castor at the moment • CASTOR expert are investigating • Problem seems to be much more frequent after the 5th of July • Just to mention: one diskserver broken (HW) since 4 weeks, we did not get the list of files yet • Next AMOD should be tougher than me 25 July 2011

  4. Cond files upload issue on CASTOR • Since 3 weeks there have often problems while uploading cond files • CASTOR issues related to SRM: the file is “properly” stored in castor but before the SRM PutDone there are “INTERNAL errors” and the file is in CASTOR in “STAGEOUT” state. • The file is accessible through rfcp or xrdcp, but this is NOT enough for us, since it cannot be exported • It is possible to lcg-del -l the file with prod role • We observed also problems with LFC registration • Out of approx 8/10 times we had this problem the LFC issue was only 1 clear, and 1 maybe. • FileRegister2 script is using an old version of dq2-put (since they need --guid option that was not available in the latest dq2-put): • New version of dq2-put with guid option in validation release right now, meeting between dq2-support and Misha/cond people tomorrow. • Ueda is calling a meeting to understand the requirements 25 July 2011

  5. LFC 1M entries in one folder https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports • Hammercloud set UK cloud brokeroff since Johannes had 1M entries in his user folder, thus lfc-mkdir failed • Lfc-del -r to clean the folder • Other catalogs the same? • Other users? 25 July 2011

  6. IN2P3-CC Oracle issue https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports • 3am Tuesday night Oracle DB HW problem • Reported at WLCG daily meeting of Tuesday 15:00 • Report mention that the problem was not fully solved, DB was running in degraded mode: • We asked explicitly about LFC, no issues mentioned • Wed at 15:00 • Report from IN2P3-CC: Problem still not completely solved, service not stable yet. • Wed evening: • ADCoS noticed that IN2P3-CC declared an outage on GCDB for LFC • French Cloud set offline • Multicloud Tokyo and GRIF set offline also 25 July 2011

  7. IN2P3-CC Oracle issue https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports • Thursday at 15:00 • WLCG daily meeting: observed data corruption on the data recorded since Tuesday • Thursday late afternoon: • Stephane got an email from IN2P3-CC saying that the DB was restored with backup from Monday at 12:00, all the other data should have been considered as lost • French cloud blacklisted in DDM (dq2-get and SS read allowed) • LFC outage: autoblacklist of IN2P3-CC, now from AGIS we can retrieve the list of all affected DDMEndpoints. • Friday morning: • Discussed what to do with ADC ops 25 July 2011

  8. IN2P3-CC Oracle issue https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports • During the weekend: • Deletion on PRODDISK • Produced list of MC/group task in active state on Monday at 12:00 • Consistency check done • 60k datasets (55k are users datasets) • Now: French cloud re-included in DDM • IN2P3-CC still out of Tier0 export • Cloud still offline • We will re-evaluate tomorrow morning 25 July 2011

More Related