80 likes | 92 Views
This document discusses challenges faced in accessing input data for LHCb jobs, including resolving best replicas, file staging, and accessing non-staged files. It highlights issues with SE endpoints, multiple concurrent jobs, and the fragility of storage systems. It proposes solutions like a dedicated staging service and instrumenting applications for file staging. Long-term prospects include GFAL library integration for improved data access reliability.
E N D
LHCb data access A.Tsaregorodtsev, CPPM, Marseille 15 March 2007, Clermont-Ferrand
Job access to the input data • The DIRAC job wrapper ensures access to the input sandbox and data before starting the application • Downloads input sandbox files to the local directory • Currently InputSandbox is a DIRAC WMS specific service • Can also use generic LFNs which is to become the main mode of operation
Job access to the input data (2) • Resolves the input data LFN into a “best replica” PFN for the execution site • Contacts the central LFC File Catalog for replica information • Picks up the replica on a local storage if any • Attempts to stage the data files (using lcg-gt) • File staging • Getting the TURL of the staged file accessible with (gsi)dcap or rfio(castor) protocols • This needs file pinning which is not yet available • Will be available with SRM2.2
Job access to the input data (3) • If the previous step fails, constructs the TURL based on the information stored in the DIRAC Configuration service • E.g. rfio:/castor/cern.ch/grid/lhcb/production/DC04/v2/<filename> • This is not a good solution as actual data access end-points may change or multiple end points are available for load balancing • If the previous step fails (e.g. no adequate protocol available for the site), bring datasets local • This is not a solution for jobs with many input files • Construct the POOL XML slice with the LFN to PFN mapping to be used by applications
Job data access problems • lcg-gt semantics is not the same for Castor and dCache SE’s • Castor: return TURL after the file is staged • dCache: return TURL to which the file will be staged • Access to a non-staged file via dcap protocol fails • Problem to be solved by the dCache developers and service providers • Absent of file pinning • Jobs with multiple input files can see some of them garbage collected while the application execution
Job data access problems(2) • Storage systems capacity not adequate to the load of multiple concurrent jobs • Up to 100MB/s sustained access while massive reconstruction is ongoing on T1 site by ~100 concurrent jobs • Needs adequate file servers hardware • Simultaneous staging requests from multiple jobs can bring the Castor system down • Problems seen at CNAF • Low responsiveness of the SRM interfaces under a high load • Commands are timing out, jobs aborted • Overall fragility of the SE end-points • Intermittent problems lowering the overall efficiency • Difficult to report – can not file a GGUS ticket for each occasional SE failure
Addressing problems • Building an LHCb staging service • Files are requested to be staged before the job is going to a site • Difficulties due to the absence of uniform staging commands for various storages • BringDataOnline function of the SRM2.2 eventually • Hacky workarounds for the time being • Instrumenting the applications with staging capabilities to bring files online along with the running application • In a prototype state • Eventually the GFAL library is meant to handle these cases – opening file by a SURL • Long term prospect
Conclusions • Data access by the running jobs stays fragile for the moment • Many problems will be solved with the new SRM2.2 storage interface • Not available yet for all the backends • Many new features – will take time to polish them • Critical area with the overall service stability being an issue • Especially for data stored on tape storage