1 / 8

LHCb data access

This document discusses challenges faced in accessing input data for LHCb jobs, including resolving best replicas, file staging, and accessing non-staged files. It highlights issues with SE endpoints, multiple concurrent jobs, and the fragility of storage systems. It proposes solutions like a dedicated staging service and instrumenting applications for file staging. Long-term prospects include GFAL library integration for improved data access reliability.

Download Presentation

LHCb data access

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LHCb data access A.Tsaregorodtsev, CPPM, Marseille 15 March 2007, Clermont-Ferrand

  2. Job access to the input data • The DIRAC job wrapper ensures access to the input sandbox and data before starting the application • Downloads input sandbox files to the local directory • Currently InputSandbox is a DIRAC WMS specific service • Can also use generic LFNs which is to become the main mode of operation

  3. Job access to the input data (2) • Resolves the input data LFN into a “best replica” PFN for the execution site • Contacts the central LFC File Catalog for replica information • Picks up the replica on a local storage if any • Attempts to stage the data files (using lcg-gt) • File staging • Getting the TURL of the staged file accessible with (gsi)dcap or rfio(castor) protocols • This needs file pinning which is not yet available • Will be available with SRM2.2

  4. Job access to the input data (3) • If the previous step fails, constructs the TURL based on the information stored in the DIRAC Configuration service • E.g. rfio:/castor/cern.ch/grid/lhcb/production/DC04/v2/<filename> • This is not a good solution as actual data access end-points may change or multiple end points are available for load balancing • If the previous step fails (e.g. no adequate protocol available for the site), bring datasets local • This is not a solution for jobs with many input files • Construct the POOL XML slice with the LFN to PFN mapping to be used by applications

  5. Job data access problems • lcg-gt semantics is not the same for Castor and dCache SE’s • Castor: return TURL after the file is staged • dCache: return TURL to which the file will be staged • Access to a non-staged file via dcap protocol fails • Problem to be solved by the dCache developers and service providers • Absent of file pinning • Jobs with multiple input files can see some of them garbage collected while the application execution

  6. Job data access problems(2) • Storage systems capacity not adequate to the load of multiple concurrent jobs • Up to 100MB/s sustained access while massive reconstruction is ongoing on T1 site by ~100 concurrent jobs • Needs adequate file servers hardware • Simultaneous staging requests from multiple jobs can bring the Castor system down • Problems seen at CNAF • Low responsiveness of the SRM interfaces under a high load • Commands are timing out, jobs aborted • Overall fragility of the SE end-points • Intermittent problems lowering the overall efficiency • Difficult to report – can not file a GGUS ticket for each occasional SE failure

  7. Addressing problems • Building an LHCb staging service • Files are requested to be staged before the job is going to a site • Difficulties due to the absence of uniform staging commands for various storages • BringDataOnline function of the SRM2.2 eventually • Hacky workarounds for the time being • Instrumenting the applications with staging capabilities to bring files online along with the running application • In a prototype state • Eventually the GFAL library is meant to handle these cases – opening file by a SURL • Long term prospect

  8. Conclusions • Data access by the running jobs stays fragile for the moment • Many problems will be solved with the new SRM2.2 storage interface • Not available yet for all the backends • Many new features – will take time to polish them • Critical area with the overall service stability being an issue • Especially for data stored on tape storage

More Related