110 likes | 124 Views
This report provides updates on the WLCG service, including unscheduled interventions and service incident reports. Issues with disk buffer, LHCb job failures, and site problems are discussed.
E N D
WLCG Service Report Olof.Barring@cern.ch ~~~ WLCG Management Board, 21st April 2009
Introduction This report covers the service since last week’s MB GGUS ticket rate back to normal. A few unscheduled interventions occurred during this period, but no serious events that triggered Service Incident Reports
GGUS Summaries • Once again sites (e.g. IN2P3) stress that experiments should submit GGUS tickets
ATLAS alarm to FZK - details • Problem: (2009-04-16 18:07) Disk buffer in front of ATLASMCTAPE in FZK is full • Detailed description:The buffer in front of ATLASMCTAPE in FZK is full. There are errors starting approx at 20:00 like [FTS] FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [NO_SPACE_LEFT] at Thu Apr 16 19:40:08 CEST 2009 state Failed : space with id=331572 does not have enough space] Source Host [se-goegrid.gwdg.de]The problem was reported already yesterday and was due to tape migration being interrupted. I should also mention that a 1TB buffer in front of tape for both MCDISK and DATADISK for a 10% ATLAS T1 seems a bit low to me. • Solution: (2009-04-17 06:10) Hi,yes, the space tokengot filled up, but there was no technical or software problem, as far as I can see. According to our own monitoring graphs ( http://grid.fzk.de/monitoring/dcache_spacetokens.php ) data can in faster as it could be flushed to tape. If this is a matter of to small space token capacity or to slow tape writing speed has to be discussed elsewhere.Kind regards,Xavier.
ATLAS alarm to FZK - details • FZK follow-up comments of Andreas Heiss on 20/4 • “At Thursday late afternoon, around 16:20 local time, Atlas indeed sent an email requesting the increase of the spacetoken ATLASMCTAPE, but unfortunately all revceivers of this email had already left home or were on holidays. At that time, the tape backend was working with low efficiency and writing only about 120MB/s which was not sufficient. At 18:07 UTC (20:07 local), Atlas sent an alarm ticket which triggered an SMS to my mobile phone at 21:02 local time (unclear where the delay of 1h came from) which I unfortunately saw not earlier than 23:45 local time, since the phone was laying at a 'dead spot' where it didn't get the mobile network. I opened the ticket (GGUS id 47937) and saw the quoted error message concerning failing FTS transfers on a T2->T1 channel. That's why I wrote into the ticket that I do not agree to open an alarm ticket because of failing T2->T1 transfers However, the main problem for Atlas was obviously not the failing T2-T1 transfers but failure of many local jobs which could not write their output to dCache.” • “However, independant of this incident, I think we should discuss at some point, under what circumstances an alarm ticket is ok and if there are problems where alarm tickets are not ok (e.g. failing MC data transfers or MC production jobs). We heard at the WLCG workshop that most sites trigger mobile phones of more than one person when an alarm is coming in. GridKa will also do so soon. So, a problem which justifies e.g. waking up several people at night should be _really_ a severe one. I don't want to start an email discussion, but maybe we can put that point on the agenda of e.g. a GDB?” 5
LHCb alarm to FZK - details • Problem: (2009-04-11 18:36) All lhcb jobs to ce-2-fzk.gridka.de failing • Detailed description:the command "glite-brokerinfo getCE" is failing with following error:glite-brokerinfo: error while loading shared libraries: libclassad_ns.so.0: cannot open shared object file: No such file or directoryThe command is used to determine where the pilot is running. • Solution: (2009-04-14 08:25) We have added the missing library and installed a missing package.Please let us know, if you still have problems with the new 64 Bit SL5 WNs.Regards,Angela
Poor availability for LHCb at several sites • LHCb problems reported: • WMS submission is failures were traced to problems with short CRL for certificates created by the CERN CA (thanks to Michel Jouvin, GRIF). Fixed • CNAF BDII publishing wrong information making match-making impossible. Fixed • CERN CVS system failing: Fixed? • The job failures yesterday at Nikhef and IN2p3 are now explained by the pre-pended "root:" string to the returned tURL. ?? • The problem of jobs crashing accessing the LFC@CERN is still under investigation but seems to be that the thread pool in LFC becomes exhausted due the the way CORAL is accessing it. Understood? (“seems to be due to the suboptimal access of LFC from CORAL”)
Various site issues/highlights Some confusions about effect of using ‘At Risk’ for transparent interventions. Not counted as site downtime BNL: degraded efficiency due to large number tape staging requests from the ATLAS production tasks (pile up and HITS merging) and this causes a high load on the dCache/pnfs server resulting in an unnacceptably high failure rate for DDM transfers. CERN: Good news on Castor Oracle BIGID problem (next slide)
Good news on Castor Oracle BIGID problem From https://savannah.cern.ch/support/?106879 After joint work with Sebastien and excellent feedback from some people from Oracle including Oracle development, it looks now clear that the problem is linked with the usage of "DML returning" statements accessed from OCCI. Basically it works for single row but with different types and combination of single row / multiple rows, it can “not work” and lead to issues like the Big Id issue. Oracle has opened a documentation bug (public and accessible with Metalink account) about the issue: OCCI does not support 'retuning into' … [ Important to stress collaborative work of many people / several groups/teams ]
Summary After a quiet Easter week, the number of GGUS tickets is back at its ‘normal’ level A number of site issues reported for ATLAS and LHCb FZK (both ATLAS, LHCb) CNAF (LHCb) Some LHCb issues also at SARA, PIC, NIKHEF, IN2P3 and CERN (LFC) Some problems solved (or almost) WMS submission is failing for LHCb certificate at some sites Related to the short CRL for certificates created by the CERN CA CASTOR “Bid-Id” problem understood