140 likes | 307 Views
GridPP 0x15 th collab’n mtg Swansea, 3-4 Sep 0x07D8 Jens Jensen. CASTOR at RAL. RAL CASTOR. 2.1.7-15 Allows disk pools shared between service classes (CMS request) (with GridFTP2) 2.1.7-16-1 SEGV on non-existent svccls ppToGet bugfix for xrootd Scheduling and GC bugfixes
E N D
GridPP 0x15th collab’n mtg Swansea, 3-4 Sep 0x07D8 Jens Jensen CASTOR at RAL
RAL CASTOR 2.1.7-15 Allows disk pools shared between service classes (CMS request) (with GridFTP2) 2.1.7-16-1 SEGV on non-existent svccls ppToGet bugfix for xrootd Scheduling and GC bugfixes CERN on -16 (or -14) Dash 1 contains mighunter db procedure hotfix
The Bug Caused a lot of downtime First Atlas, then LHCb, then CMS id2type.id suddenly “6.022x1023” Sequence number – suddenly real rather than int Related to stager “bulk” code (but not consistent), also with bulk==1 Related to RAC? Related to CASTOR instances sharing RAC clusters?
The Bug Seen in all RAC instance Not certification Shotgun workaround: restart RH automatically when error appears Apply same approach to mighunter Database parameter added Needs database restart (completed Wed. for remaining db (CMS)) Seems to have fixed the problem? Watched. Also used at CERN, seems to have no side effects
The Bug As of 11:25 today, Sebastien reports: Possibility of memory corruption in variable passing length to Oracle Affecting ASGC – thought to be unrelated to RAL's However, it's still in the bulk code? Fixed in 2.1.8, backported to 2.1.7, in next release
LHCb down... Yesterday morning LSF logs filling up, then not rotating Workaround, then LSF fix Not affecting CMS and Atlas
Repack Is was now working Sort of Occasionally files get stuck in stage-in Have to get unstuck “manually” High stager load with many tapes (INFN) Stuck again, as of a few minutes ago... Repack instance is sharing RAC with CMS
Data transfer and access Xrootd doesn't work at RAL Not clear why not GridFTP v2 Forks resources on demand
Dark Data Dark Storage Storage which is there but cannot be reached Now also published via BDII (not in prod’n though) “Reserved” (but also for non-space-token) not yet fully WLCG compliant but WLCG may change Need nearline space published (could do it in past) Dark Data Orphaned data:
Releases Release management Extremely complex system – CERN do testing We do testing, too – need to track CERN Differences between the labs (also INFN, ASGC) Good support from CERN – often RAL specific patches CASTOR 2.1.8 Expected “Mid September” Secure RFIO (here be dragons)
Status CASTOR instances at RAL: Atlas, CMS, LHCb, gen, preprod, certification 24/7 support via callouts (see Andrew's talk) Communication is important CASTOR-Experiments meeting CASTORPP-L (announce) and CASTOR-SUPPORT “CASTOR external” meetings CASTOR team CASTOR meeting CASTOR-OP
CASTOR Team overview Bonny – “benevolent coordinator” Chris – LSF, disk server deployment Tim – tapes, robot Shaun – SRM, CASTOR debugging Cheney – monitoring, Nagios, servers Guy – servers, setup Jens – SRM info, occasional debugging/support And of course the T1 team and the DB team
Final words “The log files never lie (well hardly ever)” Shaun “The Grid is an experimental science” me
Conclusion High priority at RAL CASTOR is obviously critical to UKI This is understood... lots of effort Extremely complex system Three teams at RAL working together No single person knows everything Communication is important Testing is important Differences between setups/labs