Review of CASTOR Storage System and Future Directions

Storage Review David Britton,21/Nov/08.

One Year Ago Time Line • Oversight Committee – Oct 2008. • Data was expected in early summer 2008. • CASTOR was broken (2.1.2 and 2.1.3) and a serious concern. • Alternative to CASTOR (dCache and HPSS/enStore) had been considered and rejected. 2.1.4  Jan-08 Jan-09 Apr-08 Apr-09 Oct-07 Oct-08 Jul-08 Data? OC

OC Feedback Time Line NOTES FROM THE OCTOBER 2007 OC on CASTOR :The main concern was progress towards fixing CASTOR at the Tier-1. It was understood that various actions were ongoing, but that it was necessary to manage this (and the associated expectations on each side). We were asked to make all deadlines as clear as possible to all those involved in the project (since delays in this area inevitably have a large impact across the project). We need to agree, where necessary, sets of milestones and deadlines from CERN, the Tier-1, ATLAS, CMS and LHCb for end-December, February (prior to CCRC-1) and May (prior to CCRC-2) in anticipation of the next OC meeting in mid-May. 2.1.4  Jan-08 Jan-09 Apr-08 Apr-09 Oct-07 Oct-08 Jul-08 Data? OC

Tier-1 Review Time Line NOTES ON CASTOR FROM THE NOVEMBER 2007 Tier-1 Review: Concerns: "2.1 CASTOR: The effort required over the next 12 months on CASTOR may be larger than planned." This was about 5 FTE (half funded by GridPP) compared to plan of 1.5 FTE. Recommendations:3.1 The CASTOR level of effort is appropriate for steady-state operation, but given the current status, it needs to be monitored. Based on current input, we do not believe that a long-term redistribution of manpower in this area would lead to an optimum overall plan. In the short term, it is recognised that dedicated effort is required for testing. This should be regarded as transitionary. (Point-2.1) 2.1.4  Jan-08 Jan-09 Apr-08 Apr-09 Oct-07 Oct-08 Jul-08 Data? OC Tier-1 Review

2008 Time Line 2.1.7  2.1.4  2.1.6  Jan-08 Jan-09 Apr-08 Apr-09 Oct-07 Oct-08 Jul-08 Data? CASTOR S.I.R.’s OC CCRC08 CCRC08 Tier-1 Review

2008 – The Present Time Line ?????????????????????????? 2.1.4  2.1.6  2.1.7  Jan-08 Jan-09 Apr-08 Apr-09 Oct-07 Oct-08 Jul-08 Data? CASTOR S.I.R.’s OC OC CCRC08 CCRC08 Tier-1 Review Storage Review

Where do we go from here? • At the review last year the feedback noted: We were also pleased to see signs of improvement w.r.t. CASTOR, following dedicated efforts from several individuals from a potentially disastrous situation. • A year later, it is clear that: The CASTOR and Database teams have put in an enormous amount of work and achieved many successes. They have significantly improved the infrastructure, monitoring, and management processes. BUT …. we have not yet established a stable, reliable, load-tolerant mass storage service that is adequate for data. • At this point we need to take a step back and look at the big picture to ensure that over the next 6 months we can address this.

(Sample) Questions • Can we benefit by making our CASTOR setup mimic CERN’s more closely? • Cost issues? • Knowledge issues? • Manpower issues? • Other non-CERN CASTOR sites?

(Sample) Questions • Is the main problem actually the database and Is the RAC set-up a large part of most problems? • Licences and hardware costs? • Oracle Expertise? • CERN / Oracle Support? • Other non-CERN CASTOR sites?

(Sample) Questions • What effort is needed on CASTOR/databases over the next 6-months and the next 2 years, and can we provide it? • Backdrop: • 2 FTE funded by GridPP in this area. • 11 FTE total effort reported by Tier-1 against 17 FTE funded

(Sample) Questions • Have we optimised the management, operation and internal and external interfaces of the Database and Castor teams? • Do we have the right skill mixture? • Is there enough agility? • How do we interface to CERN? To the Experiments?

(Sample) Questions • Is our hardware resilient (enough) and is our architecture optimal? • Disk failures (correlations; replacement process)? • Load levels ? • RAC ?

(Sample) Questions • How do we approach future CASTOR upgrades? • Is our test-bed sufficient (RAC?)? • Can we/do we generate representative loads? • Do we have enough (the right sort of) manpower? • How do we make the decision to deploy?

(Sample) Questions • Does or will the changing (relative) costs of disk and tape (infrastructure) change the usage model? • 5 FTE = £350k/p.a. • Tape infrastructure FY08 £694k (+ £75k media).

(Sample) Questions • Is there light at the end of the CASTOR tunnel on the timescale of data? • Fundamentally, are we in a different position this year? • What are the key indicators that show this? • How do we monitor/measure/present this?

(Sample) Questions • Are there alternatives to CASTOR that we should start to look at more seriously? • Options (from AS): • Keep running CASTOR; • Switch to dCache, either with DMF or some other HSM; • Switch to dCache with Enstore; • Write our own tapestore interface for dCache; • Buy a commercial HSM and rewrite either DPM or the CASTOR SRM to interface to it, or write our own SRM interface; • Run BeStMan or JASMINE; • Stop providing tape storage and switch to a disk-only Tier 1.

(Sample) Questions • Do we (deployers and users) still believe CASTOR is the right mid- and long-term solution? • Are the experiment mid/long term plans evolving ? • Archival storage on spin-on-demand disks or other technologies? • Is CASTOR appropriate for disk (only) storage (at any level?) • Can we/should we reduce our exposure/dependence on CASTOR?

(Sample) Questions • Can we benefit by making our CASTOR setup mimic CERN’s more closely? • Is the main problem actually the database and Is the RAC set-up a large part of most problems? • What effort is needed on CASTOR/databases over the next 6-months and the next 2 years, and can we provide it? • Have we optimised the management, operation and internal and external interfaces of the Database and Castor teams? • Is our hardware resilient (enough) and is our architecture optimal? • How do we approach future CASTOR upgrades? • Does or will the changing (relative) costs of disk and tape (infrastructure) change the usage model? • Is there light at the end of the CASTOR tunnel on the timescale of data? • Are there alternatives to CASTOR that we should start to look at more seriously? • Do we (deployers and users) still believe CASTOR is the right mid- and long-term solution? • Are ATLAS’s problems due to lack of embedded ATLAS effort at RAL and/or their file sizes? • We’ve seen lots of load related problems – need for a CCRC09? • Data base load – can we reduce it for a modest cost? • 0.5% data loss: What would (spin-on-demand) disk give us? • CNAF model of CASTOR only for tape (and STORM for disk)? • Is there a training issue for DB experts on CASTOR architecture/operation? • Oracle/RAC architecture optimisation and what are reasonable/expected loads (by V0)?

Review of CASTOR Storage System and Future Directions