150 likes | 250 Views
Preserving Scientific Data. Jamie Shiers, Information Technology Department, CERN, Geneva, Switzerland. Agenda. Motivation for preserving scientific data – examples from a range of sciences Volume of data involved and related issues Some concrete archiving examples from Particle Physics
E N D
Preserving Scientific Data Jamie Shiers, Information Technology Department, CERN, Geneva, Switzerland
Agenda • Motivation for preserving scientific data – examples from a range of sciences • Volume of data involved and related issues • Some concrete archiving examples from Particle Physics • Remaining challenges • Conclusions UNESCO Information Preservation debate, April 2007 - Jamie.Shiers@cern.ch
Motivation • Climate data: in an era when climate change is hotly debated, the motivations appear clear… • Medical data: important for understanding issues such as historical pandemics, cross-species diseases etc. Avian flu, HIV, … • Cosmological data: plays a vital role in our evolving understanding of the Universe – astrophysics community has an explicit policy (data is made public after 1 year – data volume doubles each year) • Particle Physics data: Similar arguments – will we ever be able to build similar accelerators to those of today? If we ‘lose’ this data, what of our scientific heritage? Need to look at old data for a signal that should have been seen (has happened several times) UNESCO Information Preservation debate, April 2007 - Jamie.Shiers@cern.ch
Time Energy, Density, Temperature Standard Cosmology Good model from 0.01 sec after Big Bang Supported by considerable observational evidence Elementary Particle Physics From the Standard Model into the unknown: towards energies of 1 TeV and beyond: the Terascale Towards Quantum Gravity From the unknown into the unknown... http://www.damtp.cam.ac.uk/user/gr/public/bb_history.html UNESCO Information Preservation debate, April 2007 - Jamie.Shiers@cern.ch
Issues • How much data is involved? • Preserving the bits • Understanding the bits UNESCO Information Preservation debate, April 2007 - Jamie.Shiers@cern.ch
Balloon (30 Km) CD stack with 1 year LHC data! (~ 20 Km) Concorde (15 Km) Mt. Blanc (4.8 Km) How much data is involved? • In 1998, the following estimates were made regarding the data from LEP (1989 – 2000) that should be kept • By today’s standards, these data volumes are trivial • Even though the total volume of data at the LHC is much much higher, the data that must be kept beyond the life of the machine (2007 to ~2020) will be easily handled by then • The LHC will generate some 15PB of data per year! UNESCO Information Preservation debate, April 2007 - Jamie.Shiers@cern.ch
The LHC machine - Overview pp, B-Physics,CP Violation ALICE LHC : 27 km ring 100m underground ATLAS General Purpose,pp, heavy ions Heavy ions, pp CMS +TOTEM Introduction Status of LHCb ATLAS ALICE CMS Conclusions UNESCO Information Preservation debate, April 2007 - Jamie.Shiers@cern.ch
The size of HEP detectors Introduction Status of LHCb ATLAS ALICE CMS Conclusions ATLAS Bld. 40 CMS UNESCO Information Preservation debate, April 2007 - Jamie.Shiers@cern.ch
Understanding the bits • In the mid-1990s, a successful re-analysis of 10-year old data from the JADE collaboration at the PETRA accelerator at DESY was made • A sub-set of the data was found abandoned in an office corner. The programs to read the data were in an obsolete language and were unusable. The data format was proprietary (but de-codable). • This provided valuable input into the LEP data archive • Data format: will this be readable in 5 / 10 / 100 years? 1000? • Programs: languages / operating systems / hardware platforms have very short life-spans wrt an archive • Metadata: essential to understand what the data means • The best solution to date is a so-called ‘Museum system’, but this is still a very short term solution wrt even Einstein, let alone Tyco Brahe, Kepler and Newton… UNESCO Information Preservation debate, April 2007 - Jamie.Shiers@cern.ch
Preserving the bits • Lifetimes of Particle Physics experiments are extremely long! Currently measured in decades… • Ironically, one of the solutions proposed for the LEP data archive (the then-current proposal for the LHC) was later abandoned (technical / commercial reasons) • This necessitated a ‘triple migration’: • Of 300TB of data between storage media; • Of the same data from one data format to another; • Of the accompanying processing codes. • In the end, the exercise took around 2 months per 100TB of data migrated, as well as a significant amount of effort (~1 FTE / 100TB) and hardware resources UNESCO Information Preservation debate, April 2007 - Jamie.Shiers@cern.ch
Outstanding Issues • There are no data formats, programming languages, computing hardware or operating systems with lifetimes that can be guaranteed beyond the short term • Virtual machine technology may extend an environment’s (see above) natural life – perhaps doubling it • Reducing the data into a much simplified and widely-used format can have significant advantages, but only allows restricted analyses to be performed • Preserving the detailed knowledge of the experimental apparatus is beyond current technology – it would require extreme discipline on behalf of the researchers as well as major advances in the understanding and description of metadata UNESCO Information Preservation debate, April 2007 - Jamie.Shiers@cern.ch
Conclusions • As long as advances in storage capacity continue, there are no significant issues related to the volume of scientific data that must be kept • Periodic migration between different types of storage media must be foreseen • Specific storage formats must also be catered for – this can require much more significant (time consuming and expensive) migrations • By far the biggest problem concerns understanding the data – there is currently no clear solution in this domain UNESCO Information Preservation debate, April 2007 - Jamie.Shiers@cern.ch
References • LEP Data archive • 1997: http://s.web.cern.ch/s/sticklan/www/archive/ • 2002: http://mgt-focus.web.cern.ch/mgt-focus/Focus25/maggim.pdf • 2003: http://cern.ch/pfeiffer/LEP-Data-Archive/proposal/ProposalForTheLEPDataArchive.html • http://tenchini.home.cern.ch/tenchini/Status_Archiving_6_Mar_2003.pdf • Lisbon workshop • http://cern.ch/knobloch/talks/CernCodataLisbon.ppt • http://www.erpanet.org/events/2003/lisbon/LisbonReportFinal.pdf • COMPASS / HARP data migrations • http://storageconference.org/2003/papers/06-Lubeck-Overview.pdf • http://www.slac.stanford.edu/econf/C0303241/proc/papers/THKT001.PDF • http://indico.cern.ch/getFile.py/access?contribId=448&sessionId=24&resId=1&materialId=paper&confId=0 UNESCO Information Preservation debate, April 2007 - Jamie.Shiers@cern.ch
Acknowledgements The following people provided material and / or pointers for this talk (knowingly or otherwise): • LEP Data Archive coordinators: • David Stickland, David.Stickland@cern.ch (L3) • Andreas Pfeiffer, Andreas.Pfeiffer@cern.ch • Marcello Maggi, Marcello.Maggi@ba.infn.it (ALEPH) • COMPASS / HARP migrations: • Andrea Valassi, Andrea.Valassi@cern.ch • ERPANET/CODATA Workshop • Jürgen Knobloch, Juergen.Knobloch@cern.ch UNESCO Information Preservation debate, April 2007 - Jamie.Shiers@cern.ch