LHCb data access

LHCb data access A.Tsaregorodtsev, CPPM, Marseille 15 March 2007, Clermont-Ferrand

Job access to the input data • The DIRAC job wrapper ensures access to the input sandbox and data before starting the application • Downloads input sandbox files to the local directory • Currently InputSandbox is a DIRAC WMS specific service • Can also use generic LFNs which is to become the main mode of operation

Job access to the input data (2) • Resolves the input data LFN into a “best replica” PFN for the execution site • Contacts the central LFC File Catalog for replica information • Picks up the replica on a local storage if any • Attempts to stage the data files (using lcg-gt) • File staging • Getting the TURL of the staged file accessible with (gsi)dcap or rfio(castor) protocols • This needs file pinning which is not yet available • Will be available with SRM2.2

Job access to the input data (3) • If the previous step fails, constructs the TURL based on the information stored in the DIRAC Configuration service • E.g. rfio:/castor/cern.ch/grid/lhcb/production/DC04/v2/<filename> • This is not a good solution as actual data access end-points may change or multiple end points are available for load balancing • If the previous step fails (e.g. no adequate protocol available for the site), bring datasets local • This is not a solution for jobs with many input files • Construct the POOL XML slice with the LFN to PFN mapping to be used by applications

Job data access problems • lcg-gt semantics is not the same for Castor and dCache SE’s • Castor: return TURL after the file is staged • dCache: return TURL to which the file will be staged • Access to a non-staged file via dcap protocol fails • Problem to be solved by the dCache developers and service providers • Absent of file pinning • Jobs with multiple input files can see some of them garbage collected while the application execution

Job data access problems(2) • Storage systems capacity not adequate to the load of multiple concurrent jobs • Up to 100MB/s sustained access while massive reconstruction is ongoing on T1 site by ~100 concurrent jobs • Needs adequate file servers hardware • Simultaneous staging requests from multiple jobs can bring the Castor system down • Problems seen at CNAF • Low responsiveness of the SRM interfaces under a high load • Commands are timing out, jobs aborted • Overall fragility of the SE end-points • Intermittent problems lowering the overall efficiency • Difficult to report – can not file a GGUS ticket for each occasional SE failure

Addressing problems • Building an LHCb staging service • Files are requested to be staged before the job is going to a site • Difficulties due to the absence of uniform staging commands for various storages • BringDataOnline function of the SRM2.2 eventually • Hacky workarounds for the time being • Instrumenting the applications with staging capabilities to bring files online along with the running application • In a prototype state • Eventually the GFAL library is meant to handle these cases – opening file by a SURL • Long term prospect

Conclusions • Data access by the running jobs stays fragile for the moment • Many problems will be solved with the new SRM2.2 storage interface • Not available yet for all the backends • Many new features – will take time to polish them • Critical area with the overall service stability being an issue • Especially for data stored on tape storage

LHCb data access

LHCb data access

Presentation Transcript

Data Access

Data Access:

Event Data Definition in LHCb

LHCb

Data Access

LHCb MC data analysis

Data Access

LHCb

LHCb

LHCb

LHCB

LHCb Data Storage Glenn Patrick 06.06.00

The LHCb Data Management System

LHCb Data Challenge 2004

LHCb

Event Data Definition in LHCb

LHCb Event Data Model

Data Persistency Solution for LHCb

LHCb Event Data Model