80 likes | 199 Views
Experiment Support. AMOD weekly report. 19 th July - 25 th July. GGUS and elogs. https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports. 27 Team GGUS + 1 Alarm Alarm: CERN-PROD CASTORATLAS down GGUS ALARM did not work Annoying: when we need an alarm, it rarely works
E N D
Experiment Support AMOD weekly report 19th July- 25th July 25 July 2011
GGUS and elogs https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports • 27 Team GGUS + 1 Alarm • Alarm: CERN-PROD CASTORATLAS down • GGUS ALARM did not work • Annoying: when we need an alarm, it rarely works • When CASTORATLAS came back, then SRM was overloaded • CASTOR experts say that the problem was a network problem, many diskservers lost connection all together. • 252 elogs • 13 pages 25 July 2011
CERN-PROD castor general https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports • Many errors • mostly: “INTERNAL ERROR” Too many threads busy with Castor at the moment • CASTOR expert are investigating • Problem seems to be much more frequent after the 5th of July • Just to mention: one diskserver broken (HW) since 4 weeks, we did not get the list of files yet • Next AMOD should be tougher than me 25 July 2011
Cond files upload issue on CASTOR • Since 3 weeks there have often problems while uploading cond files • CASTOR issues related to SRM: the file is “properly” stored in castor but before the SRM PutDone there are “INTERNAL errors” and the file is in CASTOR in “STAGEOUT” state. • The file is accessible through rfcp or xrdcp, but this is NOT enough for us, since it cannot be exported • It is possible to lcg-del -l the file with prod role • We observed also problems with LFC registration • Out of approx 8/10 times we had this problem the LFC issue was only 1 clear, and 1 maybe. • FileRegister2 script is using an old version of dq2-put (since they need --guid option that was not available in the latest dq2-put): • New version of dq2-put with guid option in validation release right now, meeting between dq2-support and Misha/cond people tomorrow. • Ueda is calling a meeting to understand the requirements 25 July 2011
LFC 1M entries in one folder https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports • Hammercloud set UK cloud brokeroff since Johannes had 1M entries in his user folder, thus lfc-mkdir failed • Lfc-del -r to clean the folder • Other catalogs the same? • Other users? 25 July 2011
IN2P3-CC Oracle issue https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports • 3am Tuesday night Oracle DB HW problem • Reported at WLCG daily meeting of Tuesday 15:00 • Report mention that the problem was not fully solved, DB was running in degraded mode: • We asked explicitly about LFC, no issues mentioned • Wed at 15:00 • Report from IN2P3-CC: Problem still not completely solved, service not stable yet. • Wed evening: • ADCoS noticed that IN2P3-CC declared an outage on GCDB for LFC • French Cloud set offline • Multicloud Tokyo and GRIF set offline also 25 July 2011
IN2P3-CC Oracle issue https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports • Thursday at 15:00 • WLCG daily meeting: observed data corruption on the data recorded since Tuesday • Thursday late afternoon: • Stephane got an email from IN2P3-CC saying that the DB was restored with backup from Monday at 12:00, all the other data should have been considered as lost • French cloud blacklisted in DDM (dq2-get and SS read allowed) • LFC outage: autoblacklist of IN2P3-CC, now from AGIS we can retrieve the list of all affected DDMEndpoints. • Friday morning: • Discussed what to do with ADC ops 25 July 2011
IN2P3-CC Oracle issue https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports • During the weekend: • Deletion on PRODDISK • Produced list of MC/group task in active state on Monday at 12:00 • Consistency check done • 60k datasets (55k are users datasets) • Now: French cloud re-included in DDM • IN2P3-CC still out of Tier0 export • Cloud still offline • We will re-evaluate tomorrow morning 25 July 2011