1 / 14

CASTOR at RAL

GridPP 0x15 th collab’n mtg Swansea, 3-4 Sep 0x07D8 Jens Jensen. CASTOR at RAL. RAL CASTOR. 2.1.7-15 Allows disk pools shared between service classes (CMS request) (with GridFTP2) ‏ 2.1.7-16-1 SEGV on non-existent svccls ppToGet bugfix for xrootd Scheduling and GC bugfixes

hastin
Download Presentation

CASTOR at RAL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GridPP 0x15th collab’n mtg Swansea, 3-4 Sep 0x07D8 Jens Jensen CASTOR at RAL

  2. RAL CASTOR 2.1.7-15 Allows disk pools shared between service classes (CMS request) (with GridFTP2)‏ 2.1.7-16-1 SEGV on non-existent svccls ppToGet bugfix for xrootd Scheduling and GC bugfixes CERN on -16 (or -14)‏ Dash 1 contains mighunter db procedure hotfix

  3. The Bug Caused a lot of downtime First Atlas, then LHCb, then CMS id2type.id suddenly “6.022x1023” Sequence number – suddenly real rather than int Related to stager “bulk” code (but not consistent), also with bulk==1 Related to RAC? Related to CASTOR instances sharing RAC clusters?

  4. The Bug Seen in all RAC instance Not certification Shotgun workaround: restart RH automatically when error appears Apply same approach to mighunter Database parameter added Needs database restart (completed Wed. for remaining db (CMS))‏ Seems to have fixed the problem? Watched. Also used at CERN, seems to have no side effects

  5. The Bug As of 11:25 today, Sebastien reports: Possibility of memory corruption in variable passing length to Oracle Affecting ASGC – thought to be unrelated to RAL's However, it's still in the bulk code? Fixed in 2.1.8, backported to 2.1.7, in next release

  6. LHCb down... Yesterday morning LSF logs filling up, then not rotating Workaround, then LSF fix Not affecting CMS and Atlas

  7. Repack Is was now working Sort of Occasionally files get stuck in stage-in Have to get unstuck “manually” High stager load with many tapes (INFN)‏ Stuck again, as of a few minutes ago... Repack instance is sharing RAC with CMS

  8. Data transfer and access Xrootd doesn't work at RAL Not clear why not GridFTP v2 Forks resources on demand

  9. Dark Data Dark Storage Storage which is there but cannot be reached Now also published via BDII (not in prod’n though)‏ “Reserved” (but also for non-space-token)‏ not yet fully WLCG compliant but WLCG may change Need nearline space published (could do it in past)‏ Dark Data Orphaned data:

  10. Releases Release management Extremely complex system – CERN do testing We do testing, too – need to track CERN Differences between the labs (also INFN, ASGC)‏ Good support from CERN – often RAL specific patches CASTOR 2.1.8 Expected “Mid September” Secure RFIO (here be dragons)‏

  11. Status CASTOR instances at RAL: Atlas, CMS, LHCb, gen, preprod, certification 24/7 support via callouts (see Andrew's talk)‏ Communication is important CASTOR-Experiments meeting CASTORPP-L (announce) and CASTOR-SUPPORT “CASTOR external” meetings CASTOR team CASTOR meeting CASTOR-OP

  12. CASTOR Team overview Bonny – “benevolent coordinator” Chris – LSF, disk server deployment Tim – tapes, robot Shaun – SRM, CASTOR debugging Cheney – monitoring, Nagios, servers Guy – servers, setup Jens – SRM info, occasional debugging/support And of course the T1 team and the DB team

  13. Final words “The log files never lie (well hardly ever)” Shaun “The Grid is an experimental science” me

  14. Conclusion High priority at RAL CASTOR is obviously critical to UKI This is understood... lots of effort Extremely complex system Three teams at RAL working together No single person knows everything Communication is important Testing is important Differences between setups/labs

More Related