240 likes | 348 Views
Curation in Natural sciences. Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre. Acknowledgments. Common effort of the ALICE and LGC Collaborations. Thanks to my colleagues of the ALICE-MUON Collaboration.
E N D
Curation in Natural sciences Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre
Acknowledgments • Common effort of the ALICE and LGC Collaborations. • Thanks to my colleagues of the ALICE-MUONCollaboration. Special thanks to Jean Cleymans, Bruce Becker, Artur Szostak, Gareth de Vaux, Sukalyan Chattopadyay, Corrado Cicalo, Timm Steinbeck, Volker Lindenstruth, Heinz Tilsner, Florent Staley and others
Topics for discussion Management of large data sets Inter-operability Standards and protocols Security and certification
Digital Curation • Maintainance of digital research data and other digital materials over their entire life-cycle and over time for current and future generations of users. • Processes of digital archiving and preservation • Also includes all the processes needed for good data creation and management, and the capacity to add value to data to generate new sources of information and knowledge. ", and services in this field." Centre for Digital Curation
Digital Curation(2) • Curation and long-term preservation of digital resources will be of increasing importance for a wide range of activities within research and education. • Through sensors, experiments, digitisation and computer simulation, digital resources and data are growing in volume and complexity at a staggering rate. • The cost of producing these resources is very high: satellites, particle accelerators, genome sequencing, and large scale digitisation and electronic publishing collectively represent a cumulative investment of billions of pounds in digital research and learning. • Long-term curation and preservation of digital resources is seen as a challenge which is difficult if not impossible for individual institutions to resolve on their own due to the complexity and scale of the challenges involved.
Curation in Physical Sciences • Data is being generated in large volumes. • In laboratories; old archival material (design specifications, codes etc) can serve as reference resources. • Remote information access through online publications. • Data management and real-time remote analysis • Heavily dependent on bandwidth • New middleware is being developed for access ofdata across geographically disparate centres. • Data sharing in astro; nuclear and particle physics • Usually characterised by large collaborations (in excess of 100 people) • MetaData are essential for the selection of events • Can use the Grid file catalogue for one part of the MetaData • During the Data Challenge we used the file catalogue for storing part of the MetaData
simulation CERN Data Handling and Computation for Physics Analysis reconstruction event filter (selection & reconstruction) detector analysis processed data event summary data raw data batch physics analysis event reprocessing analysis objects (extracted by physics topic) event simulation interactive physics analysis les.robertson@cern.ch
Experimental conditions in heavy-ion colliders • Beam : Pb-Pb, Ca-Ca, p-p, p-A • Rates : • 8000 events/s Minimum bias • 50-100/s central events (2-5% tot) • acquisition rate 100 Hz (central) 1000 Hz (dimuons) • 1 month/year (106 s) =107 central events • Multiplicity : dn/dy from 2000 to 8000 so a total of about 60000
Consequences • More than 60 GBytes produced per second in Alice: • High Level Trigger (HLT) + compression to reduce raw data to 1.2 GB/s : 2 to 3 PB/year in 1 month of data taking • Very fast acquisition and network • ALICE will be one of the largest data base in history • Need a GRID to distribute and analyse data
The GRID: networked data processing centres and ”middleware” software as the “glue” of resources. Researchers perform their activities regardless geographical location, interact with colleagues, share and access data Scientific instruments and experiments provide huge amount of data The Grid Vision
Classification of Grids • Computational Grids (including CPU scavenging Grids) which focuses primarily on computationally-intensive operations • Data Grids or the controlled sharing and management of large amounts of distributed data • Equipment Grids which have a primary piece of equipment e.g. a telescope, and where the surrounding Grid is used to control the equipment remotely and to analyse the data produced.
Grid beyond high energy physics • Due to the computational power of the EGEE new communities are requiring services for different research fields • Normally these communities do not need the complex structure that required by the HEP communities • In many cases, their productions are shorter and well defined in the year • The amount of CPU required is much lower and also the Storage capabilities 20 applications from 7 domains High Energy Physic, Biomedicine, Earth Sciences, Computational Chemistry Astronomy, Geo-physics and financial simulation 36
LCG services – built on two major science grid infrastructures EGEE - Enabling Grids for E-Science OSG - US Open Science Grid
Tier-0 – the accelerator centre • Data acquisition & initial processing • Long-term data curation • Distribution of data Tier-1 centres Tier-1 – “online” to the data acquisition process high availability • Managed Mass Storage – grid-enabled data service • Data-heavy analysis • National, regional support Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschunszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands Tier-1 (Amsterdam) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia SInica (Taipei) UK – CLRC (Oxford) US – FermiLab (Illinois) – Brookhaven (NY) LCG Service Hierarchy Tier-2 – ~100 centres in 20 countries • Simulation • End-user analysis – batch and interactive
Tier0 / Tier1 / Tier2 Networks Cape Town ?
Summary of Tier0/1/2 Roles • Tier0 (CERN): safe keeping of RAW data (first copy); first pass reconstruction, distribution of RAW data and reconstruction output to Tier1; reprocessing of data during LHC down-times; • Tier1: safe keeping of a proportional share of RAW and reconstructed data; large scale reprocessing and safe keeping of corresponding output; distribution of data products to Tier2s and safe keeping of a share of simulated data produced at these Tier2s; • Tier2: Handling analysis requirements and proportional share of simulated event production and reconstruction. • Very difficult to estimate Network requirements! N.B. there are differences in roles by experiment Essential to test using complete production chain of each!
Physics Data Challenge(s) AliEn job control Production of RAW Data transfer Shipment of RAW to CERN Reconstruction of RAW in all T1’s Analysis F. Carminatti (CERN) Tier2 Tier1 Tier1 Tier2
OSU/OSC LBL/NERSC Dubna Birmingham NIKHEF Saclay GSI CERN Padova Merida IRB Bologna Lyon Torino Bari Cagliari Yerevan Catania Kolkata, India Cape Town, ZA ALICE Network in the World http://www.to.infn.it/activities/experiments/alice-grid 37 people21 insitutions Active sites
Result: Sample Bandwidth Costs for African Universities Source: IEEAF
Topics for discussion Management of large data sets $$ and R Database management Skills Digital divide : Cyber infr: network/HR/libraries/data sets/LAN etc Inter-operability: e.g Astro-Grid, mammo Grid etc Standards and protocols Preservation and quality Access (meaning of numbers)/terminology and use of unfamiliar data Configuration management Ex: Particle data book Security and certification Certification authorities Dialogue between researchers & librarians Role of libraries and curators Guidelines Academic training programme/ schools outreach Schools: New curriculum development (lost data) Research students: access to previous theses Resource management
Challenges • Strategy for Natural sciences across different domains