270 likes | 280 Views
CERN Services for LTDP. International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics. iPRES 2016 , CH https://indico.cern.ch/event/448571/. Overview of CERN.
E N D
CERN Services for LTDP International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics iPRES 2016, CH https://indico.cern.ch/event/448571/
Overview of CERN • CERN – the European Organisation for Nuclear Research – is situated just outside Geneva, extending into France • Founded in 1954, it now has 22 member states • It operates a wide range of accelerators of which the LHC is probably best known • ALarge Hadron Collider was first proposed in the late 1970s, when discussions on a Lepton Collider (LEP) were being held • A High Luminosity upgrade (HL-LHC) was approved in June, extending the LHC’s life until around 2040 • A High Energy (HE-LHC) upgrade may follow…
The CERN Accelerator Complex The LHC (former LEP) ring 4
LTDP: Now & Then in HEP • Traditionally at CERN, users (experiments) were responsible for buying their own tapes and managing them • Capacity: 40-200MB! (1600bpi to first 3480 cartridges) • This started to change with LEP (1989), including with the introduction of robots and Unix-style filenames instead of tape numbers • But at the end of LEP (2000) still no sustainable preservation services • ~1 million tape volumes! Impossible to automate! • ALEPH: 1 PC with full environment + all data per collaborating institute
The DPHEP Study Group • Formed late 2008 at the initiative of DESY • Included representatives from all major HEP labs worldwide, including from experiments due to end data-taking shortly • Produced a Blueprint Report that detailed the situation and made concrete recommendations, now being acted upon • Input to European Particle Physics Strategy update of 2012/3 – highly influential!
What is the problem? • The data from the world’s particle accelerators and colliders (HEP data) is both costly and time consuming to produce • That from the LHC is a particularly striking example and ranges in volume from several hundred PB today to tens of EB by 2035 or so. • HEP data contains a wealth of scientific potential, plus high value for educational outreach. • Given that much of the data is unique, it is essential to preserve not only the data but also the full capability to reproduce past analyses and perform new ones. • This means preserving data, documentation, software and "knowledge". • There are numerous cases where data from a past experiment has been re-analyzed: we must retain the ability in the future
What does DPHEP do? • DPHEP has become a Collaboration with signatures from the main HEP laboratories and some funding agencies worldwide. • It has established a "2020 vision", whereby: • All archived data – e.g. that described in DPHEP Blueprint, including LHC data – should be easily findable and fully usable by the designated communities with clear (Open) access policies and possibilities to annotate further; • Best practices, tools and services should be well run-in, fully documented and sustainable; built in common with other disciplines, based on standards; • There should be a DPHEP portal, through which data / tools accessed; • Clear targets & metrics to measure the above should be agreed between Funding Agencies, Service Providers and the Experiments.
What Makes HEP Different? • We throw away most of our data before it is even recorded – “triggers” • Our detectors are relatively stable over long periods of time (years) – not “doubling every 6 or 18 months” • We make “measurements” – not “observations” • Our projects typically last for decades – we need to keep data usable during at least this length of time • We have shared“data behind publications” for more than 30 years… (HEPData)
An OBSERVATION… First observed during the solar eclipse of 1919 by Sir Arthur Eddington, when the Sun was silhouetted against the Hyades star cluster Barry Barish; ICHEP - Chicago
1.3 Billion Years Ago And another... (Black holes merging…) ICHEP - Chicago
Future Circular Colliders (FCC) International Collaboration: ~ 70 Institutes • International conceptual design study of a ~100 km ring: • pp collider (FCC-hh): ultimate goal defines infrastructure requirements • √s ~ 100 TeV, L~2x1035; 4 IP, ~20 ab-1/expt • e+e- collider (FCC-ee): possible first step • √s = 90-350 GeV, L~200-2 x 1034; 2 IP • pe collider (FCC-he): option • √s ~ 3.5 TeV, L~1034 “LEP3” Also part of the study: HE-LHC: FCC-hh dipole technology (~16 T) in LHC tunnel √s ~ 30 TeV GOAL: CDR in time for next ES FCC-ee options could have 1000 times the luminosity of LEP2. 10 years x 100 days = 1000 Machine studies are site-neutral. However, FCC at CERN would greatly benefit from existing laboratory infrastructure and accelerator complex 90-100 km ring fits geology
HEP LTDP Use Cases • Bit preservation as a basic “service” on which higher level components can build; • “Maybe CERN does bit preservation better than anyone else in the world” (David Giaretta) • Preserve data, software, and know-how in the collaborations; Basis for reproducibility; • Share data and associated software with (wider) scientific community, such as theorists or physicists not part of the original collaboration; • Open access to reduced data sets to general public (LHC experiments) • These match very well to the requirements for DMPs
Workshop on Active Data Management Plans • Agenda, talks, videos, conclusions • Includes more detailed talks about HEPdata preservation & Open Data releases
DMPs for the LHC experiments • The first LHC experiment to produce a “DMP” was CMS in 2012 • This called for Open Data Releases of significant fractions of the (cooked) data after an embargo period (see ADMP w/s) • Now all 4 main experiments have DMPs • Foresee capturing project-specific detail in DMPs (as opposed to overall site policy) • Open Data Releases are now “routine”! See this talk @ ICHEP for more details
CERN Services for LTDP • State-of-the art "bit preservation", implementing practices that conform to the ISO 16363 standard • "Software preservation" - a key challenge in HEP where the software stacks are both large and complex (and dynamic) • Analysis capture and preservation, corresponding to a set of agreed Use Cases • Access to data behind physics publications - the HEPData portal • An Open Data portal for released subsets of the (currently) LHC data • A DPHEP portal that links also to data preservation efforts at other HEP institutes worldwide. • Each of these is a talk topic in its own right!
Bit Preservation: Steps Include • Regular media verification • When tape written, filled, every 2 years… • Controlled media lifecycle • Media kept for 2 max. 2 drive generations • Reducing tape mounts • Reduces media wear-out & increases efficiency • Data Redundancy • For “smaller” communities, a 2nd copy can be created: separate library in a different building (e.g. LEP – 3 copies at CERN!) • Protecting the physical link • Between disk caches and tape servers • Protecting the environment • Dust sensors! (Don’t let users touch tapes) Constant improvement: reduction in bit-loss rate: 5 x 10-16
Software Preservation • HEP has since long shared its software across international collaborations • CERNLIB – first started in 1964 and used by many communities worldwide • Today HEP s/w is O(107) lines of code, 10s to 100s of modules and many languages! (No standard app) • Versioning filesystems and virtualisation look promising: have demonstrated resurrecting s/w 15 years after data taking and hope to provide stability 5-15 years into the future • Believe we can analyse LEP data ~30 years after data taking ended! • Does anyone have a better idea?
Analysis Preservation • The ability to reproduce analyses is not only required by Funding Agencies but also essential to the work of the experiments / collaborations • Use Cases include: • An analysis that is underway has to be handed over, e.g. as someone is leaving the collaboration; • A previous analysis has to be repeated; • Data from different experiments have to be combined. • Need to capture: metadata, software, configuration options, high-level physics information, documentation, instructions, links to presentations, quality protocols, internal notes, etc. • At least one experiment (ALICE) would like demonstrable reproducibility to be part of the publication approval process!
Portals • No time to discuss in detail but clearly address the challenges ofmaking the data “discoverable” and “usable” (if not necessarilyF.A.I.R.)
Certification of the CERN Site • We believe certification will allow us to ensure that best practices are implemented and followed up on in the long-term: “written into fabric of organisation” • Scope: Scientific Data and CERN’s Digital Memory • Timescale: complete prior to 2019/2020 ESPP update • Will also “ensure” adequate resources, staffing, training, succession plans etc. • CERN can expect to exist until HL/HE LHC (2040/50) • And beyond? FCC? Depends on physics…
Infrastructure & Security Risk Management ISO 16363 metrics
ISO 16363 metrics Organisational Infrastructure
Collaboration with others • The elaboration of a clear "business case" for long-term data preservation • The development of an associated "cost model” • A common view of the Use Cases driving the need for data preservation • Understanding how to address Funding Agencies requirements for Data Management Plans • Preparing for Certification of HEP digital repositories and their long-term future.
How Much Data? • 100TBper LEP experiment: 3 copies at CERN (1 on disk, 2 on tape) (+ copies outside) • 1-10PB for experiments at the HERA collider at DESY, the TEVATRON at Fermilab or the BaBar experiment at SLAC. • The LHC experiments is already in the multi-hundred PB range (x00PB) • 10EBor more including the High Luminosity upgrade of the LHC (HL-LHC)
Conclusions & Next Steps • As is well known, Data Preservation is a Journey and not a destination. • Can we capture sufficient “knowledge” to keep the data usable beyond the lifetime of the original collaboration? • Can we prepare for major migrations, similar to those that happened in the past? (Or will x86 and Linux last “forever”) • For the HL-LHC, we may have neither the storage resources to keep all (intermediate) data, nor the computational resources to re-compute them! • You can’t share or re-use data, nor reproduce results, if you haven’t first preserved it (data, software, documentation, knowledge)
Conclusions & Next Steps • As is well known, Data Preservation is a Journey and not a destination. • Can we capture sufficient “knowledge” to keep the data usable beyond the lifetime of the original collaboration? • Can we prepare for major migrations, similar to those that happened in the past? (Or will x86 and Linux last “forever”) • For the HL-LHC, we may have neither the storage resources to keep all (intermediate) data, nor the computational resources to re-compute them! • Open Data Releases – in addition to Certification – provide a powerful way of measuring whether we are achieving our goals!