210 likes | 373 Views
MICE Data Flow. Henry Nebrensky Brunel University. 1. MICE Data and the Grid. Storage, archiving and dissemination of experimental data: Not been a high priority so far Overall strategy not documented anywhere obvious Individual work on parts of this – but do the pieces fit together?
E N D
MICE Data Flow Henry Nebrensky Brunel University 1 Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Data and the Grid • Storage, archiving and dissemination of experimental data: • Not been a high priority so far • Overall strategy not documented anywhere obvious • Individual work on parts of this – but do the pieces fit together? • Grid: • Certain Grid services are separately funded to provide a production service to MICE • Provides a ready-made set of building blocks – but “we” have to put them together • MICE need to know what they want, to make sure that the finished edifice meets all their needs (and that Grid includes all the necessary bricks) Henry Nebrensky - MICE CM24 - 2 June 2009
Decision Time • We need to start putting the pieces together very soon. • Once data starts going on tape it will not be possible to change how and where it is stored • need an agreed plan in the near future (i.e. by end of CM24) • There are a number of unresolved issues – see Note 252 and the data flow diagram. • Data volumes, lifetime and access control mostly unclear • (LFC) File naming scheme – see MICE Note 247 • File metadata requirements – raised at CM23 Henry Nebrensky - MICE CM24 - 2 June 2009
The Awesome Power of Grid Computing The Grid provides seamless interconnection between tens of thousands of computers. It therefore generates new acronyms and jargon at superhuman speed. 4 Henry Nebrensky - MICE CM24 - 2 June 2009
Grid Middleware We are currently using EGEE/WLCG middleware and resources, as they are receiving significant development effort and are a reasonable match for our needs (shared with various minor experiments such as LHC) Outside Europe other software may be expected – e.g. the OSG stack in the US. Interoperability is, from our perspective, yet another “known unknown”... 5 Henry Nebrensky - MICE CM24 - 2 June 2009
MICE and Grid Data Storage • The Grid provides MICE not only with computing (number-crunching) power, but also with a secure global framework allowing users access to data • Good news: storing development data on the Grid keeps it available to the collaboration – not stuck on an old PC in the corner of the lab • Bad news: loss of ownership – who picks up the data curation responsibilities? • Data can be downloaded from the Grid to user’s “own” PC – doesn’t need to be analysed remotely 6 Henry Nebrensky - MICE CM24 - 2 June 2009
Grid File Management (1) Each file is given a unique, machine-generated, GUID when stored on the Grid The file is physically uploaded to one (or more) SEs (Storage Elements) where it is given a machine-generatedSURL (Storage URL) Machine-generated names are not (meant to be) human-usable A “replica catalogue” tracks the multiple SURLs of a GUID For sanity's sake we would like to associate sensible filenames with each file (LFN, Logical File Name) A “file catalogue” is a database that translates between something that looks like a Unix filesystem and the GUIDs and SURLs needed to actually access the data on the Grid 7 Henry Nebrensky - MICE CM24 - 2 June 2009
Grid File Management (2) • MICE has an instance of LFC (LCGFile Catalogue) run by the Tier 1 at RAL • The LFC service can do both the replica and LFN cataloguing • LFC presents the user with what looks like a normal Unix filespace - the Grid client SW keeps track of the data behind the scenes. LFC From MICE Note 247 8 Henry Nebrensky - MICE CM24 - 2 June 2009
Data Integrity (For recent SE releases) a checksum is calculated automatically when a file is uploaded. This can be checked when the file is transferred between SEs, or the value retrieved to check local copies. 9 Henry Nebrensky - MICE CM24 - 2 June 2009
The VOMS server File permissions will needed e.g. to ensure that users can’t accidentally delete RAW data. These rules will need to last for at least the life of the experiment. VOMS is a Grid service that allows us to define specific roles (e.g. DAQ data archiver) which will then be allowed certain privileges (such as writing to tape at RAL Tier 1). The VOMS service then maps humans to those roles, via their certificates. MICE VOMS server is provided via GridPP at Manchester, UK. New Mice are added or assigned to roles by the VO Manager (and Mouse) Paul Hodgson. Thus the VOMS service provides us with a single portal where we can add/remove/reassign Mice, without needing to negotiate with the operators of every Grid resource worldwide – we actually keep control “in-house.” 10 Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Data Flow • The basic data flow in MICE is thus something like: • The raw data file from the experiment are sent to tape using Grid protocols, including registering the files in LFC. • The offline reconstruction can then use Grid/LFC to pull down the raw data, and upload reconstructed (“RECO” or DST) files. • Users can use Grid/LFC to access RECO files they want to play with. • If I combine the above description with some background knowledge of the Grid, some snippets of what people are working on and a whole lot of guesswork I get: 11 Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Data Flow Diagram • Short-dashed lines indicate entities that still need confirmation • Question marks indicate even higher levels of uncertainty • More details in MICE Note 252 • The diagram would look pretty much the same if non-Grid tools were used 12 Henry Nebrensky - MICE CM24 - 2 June 2009
Data Flow Implementation • Most of this is NOT in place yet (at production level)! 13 Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Data Unknowns • MICE Note 252 identifies four main flavours of data: RAW, RECO, analysis results, and MonteCarlo simulation. • For all four, we need to understand the: • volume (the total amount of data, the rate at which it will be produced, and the size of the individual files in which it will be stored) • lifetime (ephemeral or longer lasting? will it need archiving to tape?) • access control (who will create the data? who is allowed to see it? can it be modified or deleted, and if so who has those privileges?) • Also need to identify use cases I’ve missed, especially ones that will need more VOMS roles or CASTOR space tokens. 14 Henry Nebrensky - MICE CM24 - 2 June 2009
File Catalogue Namespace (1) • Also, we need to agree on a consistent namespace for the file catalogue • Proposal (MICE Note 247, Grid talk at CM23): • We get given /grid/mice/ by the server • Five upper-level directories: • Construction/ historical data from detector development and QA • Calibration/ needed during analysis (large datasets, c.f. DB) • TestBeam/ test beam data • MICE/ DAQ output and corresponding MC simulation 15 Henry Nebrensky - MICE CM24 - 2 June 2009
File Catalogue Namespace (2) • /grid/mice/users/name For people to use as scratch space for their own purposes, e.g. analysis • Encourage people to do this through LFC – helps avoid “dark data” • LFC allows Unix-style access permissions • Again, the LFC namespace is something that needs to be finalised before production data can start to be registered. 16 Henry Nebrensky - MICE CM24 - 2 June 2009
Metadata Catalogue For many applications – such as analysis – you will want to identify the list of files containing the data that matches some parameters This is done by a “metadata catalogue”.For MICE this doesn't yet exist A metadata catalogue can in principle return either the GUID or an LFN – it shouldn’t matter which as long as it’s properly integrated with the other Grid services. (Grid talk at CM23) 17 Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Metadata Catalogue • We need to select a technology to use for this • use the configuration database? • gLite AMGA (who else uses it – will it remain supported?) • Need to implement – i.e. register metadata to files • What metadata will be needed for analysis? • Should the catalogue include the file format and compression scheme (gzip ≠ PKzip)? 18 Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Metadata Cataloguefor Humans or, in non-Gridspeak: we have several databases (configuration DB, EPICS, e-Logbook) where we should be able to find all sorts of information about a run/timestamp. but how do we know which runs to be interested in, for our analysis? we need an “index” to the MICE data, and for this we need to define the set of “index terms” that will be used to search for relevant datasets. 19 Henry Nebrensky - MICE CM24 - 2 June 2009
MICE Metadata • Run, date/time • Step • Nominal 4-d / tranverse normalised Emittance • Diffuser setting • Nominal Momentum • Configuration: • Magnet currents • Physical geometry • RF? • ??? 20 Henry Nebrensky - MICE CM24 - 2 June 2009
Conclusions The data flow is more complex than people realise… … and probably won’t work by accident Some specific issues that need to be understood are the attributes of the data flows (Note 252), the LFC Namespace (Note 247) and the index terms for the metadata catalogue. This needs discussion and (where necessary) decision pretty soon – by end CM24 – to be ready for data taking. 21 Henry Nebrensky - MICE CM24 - 2 June 2009