110 likes | 215 Views
Data preservation in ALICE. Federico Carminati. Motivation. ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and reusing it a legacy for a potentially even larger community and many years to come
E N D
Data preservation in ALICE Federico Carminati
Motivation • ALICE is a 150 M CHF investment by a large scientific community • The ALICE data is unique and reusing it a legacy for a potentially even larger community and many years to come • The real net investment is far larger by an order of magnitude when one includes the manpower costs over the entire project to date along with an estimate of the potential reasonably projectable future costs... not counting LHC itself • This point alone would deserve its own slide... • ALICE event size and complexity are unprecedented and pose real challenges for all the processing infrastructure • Preserving the data access implies not only storing the data and keeping the software, but also giving access to a sizeable amount of processing resources
Motivation • Data preservation is a continuous effort that ALICE is doing for the collaboration needs • Extending and formalizing this effort will be a benefit for both the collaboration and the community, long term after data taking • ALICE is committed to allow the re-use of the data and software by this large community including: non-member HEP scientists or users, educators and students, and also the general public in different forms of outreach. • Besides the cost of amassing the data in the first place, the time it takes to mount and perform such experiments is also a major issue • There will be a need to have access to such data to train future generations of physicists during periods when actual running experiments are not available. • Moreover… one is never certain that all of the discoveries have been wrung out of the initial data set during the original analysis process... • Particularly given the complexity of the data
Legal Issues • Various funding agencies that have contributed past support (and who may offer support in the future) to the members of the collaboration & governments directly to CERN, have imposed (and are obligated by their governments to insist on) formal data preservation requirements… The issues are: • Who bears the cost of implementation? • Collaborations, Institutions, P.I.’s, funding agencies, CERN? • What are the liabilities for failure to comply? • Now? (i.e. to the current collaborators at this time…) • In the future? (i.e. to today’s collaboration members in the future…) • Who else might be liable? (Collaborations, Institutions, Individual P.I.’s, funding agencies, CERN?) • Who decides when the mandates are satisfied? • Collaborations, Funding Agencies, Governments?
ALICE data preservation policy • ALICE recently initiated internal discussions on data preservation issues • Due to the operational requirement related to current data taking and processing, only a few people have been actively working on this issue. • The ALICE MB is being kept informed regarding this activity. • The current mandate is to define a policy that defines ALICE position with respect to the different preservation levels… • ALICE participates now regularly to the internal LHC data preservation meetings • ALICE strongly supports the harmonization of the base principles and the development of common tools whenever possible… • These meeting are providing the starting points for our internal discussions • We foresee the finalization of a draft for ALICE DP policy by the end of this year’s running periods.
Levels of data preservation • ALICE is considering all data preservation levels as defined by the DPHEP community • The different levels may be mapped to different use- cases and users. • Of course, it is clear that the resources and manpower requirements grow with the increasing preservation levels… • The ALICE recipe will have to distribute existing resources in different fractions for different levels • Higher level DP aspects like raw data preservation will get much less weight until extra resources become available…
Level 1 • Includes publications, supporting documents and any additional numerical data… • ALICE is using open access journals for all its publications • We are using standard procedures for keeping track of all figures and additional data, to be propagated to systems like INSPIRE or HEPData • We are considering frameworks like RECAST for analysis output data archiving…
Level 2 • Includes high level, simplified data formats for basic analysis, theory comparison, education and outreach • A simplified export format for AOD ALICE data is easy to achieve. • We are considering the possible targets for such formats and the different use cases… • We still have to define the types of data to be converted and the amount… • A harmonized policy among the different LHC experiments is probably possible on this level…
Level 3 • Reconstructed data and simulations, along with the analysis software environment, workflows and documentation to allow new analysis short of redoing basic reconstruction of the data. • We are considering making available a fraction of the ALICE AOD data, already calibrated with the best knowledge and analysis ready… (This would be provided after a delay to be determined to allow the current collaboration members sufficient time to complete the initial data analysis process. • There are significant problems related to the logistics of the operation at this level: available storage, software updates, and the infrastructure to enable the re-processing of the data • We are in the process of attempting to determine the actual needs and associated costs… • This is provided for members of the Collaboration
Level 4 • Raw ‘offline’ data and the software with the calibrations required to be able to reconstruct them, with full documentation • ALICE raw data are unusable for physics analysis in their present form. Besides Tier-0, ALICE keeps a copy on Tier-1 for subsequent calibration and reconstruction passes… • ALICE software used to process raw data is released under open source license and the documentation is available, allowing for any subsequent reprocessing, which is, in principle, only limited by the resources needed for CPU and storage • We do not foresee allowing large scale reprocessing of the ALICE raw data by the general public, the main constraint being the cost of considerable resources required to maintain such a capability… • This is provided for members of the Collaboration
Conclusions • ALICE recognizes the importance of Data Preservation • ALICE will work with the other experiments in order to contribute to and implement a common solution for DP • ALICE would agree to Level 1 and Level 2 • Provided resources are found! • Level 3 seems harder and substantial resources should be found inside the experiment • It is however provided for members of the Collaboration • Level 4 seems to be out of scope for the moment, unless there is a common decision by all the LHC experiments to do it • It is also provided for members of the Collaboration • In any case computational resources will be needed to test the system (disk and CPU)