430 likes | 601 Views
Building an Open Data Infrastructure for Science: Policy and Practice. Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor”. APA conference, November 2011. Overview. Introduction What is STFC? What do we do? Why do we do it? An example project (Practice)
E N D
Building an Open Data Infrastructure for Science:Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference, November 2011
Overview • Introduction • What is STFC? What do we do? Why do we do it? • An example project (Practice) • Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES • RCUK Data Principles (Policy)
250m Daresbury Laboratory ESRF & ILL, Grenoble What is STFC? • Programme includes: • Neutron and Muon Source • Synchrotron Radiation Source • Lasers • Space Science • Particle Physics • Compuing and Data Management • Microstructures • Nuclear Physics • Radio Communications Square Kilometre Array Large Hadron Collider
Tomorrow’s Digital Infrastructure for Science APA Conference 2007 John Womersley/Keith Jeffery
Data and the Research Process The Innovation Lifecycle Enabling Wealth Creation Enabling Knowledge Creation Strategic Direction The Body of Knowledge The Research Process The Government Process Improved Understanding Improved Quality of Life Quality Assessment Aggregation of Knowledge lies at the heart of the innovation lifecycle
Virtual Research Environment the researcher acts through ingest and access Archival Creation Access Curation Information Infrastructure Services the researcher shouldn’t have to worry about the information infrastructure Network Storage Compute Data centric view of research Data
The 7 C’s Communication Collaboration Creation Collection Data Capacity Curation Computation
Creation Linked systems for: • Proposal submission • User management • Data acquisition Metadata carried from each system to the next Detectors moving from Hz to KHz, MHz, GHz, ... Barrel toroid magnet and detector module from ATLAS at CERN Examining the detectors on MAPS instrument on ISIS
10 x Moore’s Law (2 years) 2 x Moore’s Law (1.5 years) Capacity Currently store about 7 PetaBytes of data • 1PB = 1015 Bytes • = a Billion Floppys • = a Million CDs • = a Thousand PCs • LHC experiments ~ 2PB • Active file system based HSM ~2PB • (facilities) • Dark archive ~2PB • (facilities and non-STFC services) } of Today’s Moore’s Law x1000 in 13 years Doubling every 1.3 years
Computation Compute intensive components on the grid Fitting of experimental data to model • Laser real-time diagnostics & data flow pipeline. Computational applications for Laser Theory Group’s adoption of HPC
Curation Complexity of Facility Archives • All ISIS data (~25 years) > 3,000,000 files • All Diamond Data (~2.5 years) > 11,000,000 files Breadth of Data Sources: • STFC (Tier 1) • NERC (BADC, NEODC) • BBSRC (All institutes) • AHRC • MRC • Others: • The STFC Data Portal • TheSTFC Publications Archive • The CCPs (Collaborative Computational Projects) • The Chemical Database Service • The Digital Curation Centre • The EUROPRACTICE Software service • TheHPCx Supercomputer • The JISCmail service • The NERC Datagrid • The Starlink Software suite • The UK Grid Support Centre • The World Data Centre for Solar-Terrestrial Physics Atlas Datastore Tape Robot The StorageTek tape robot with capacity of 20PB
Collection Record Publication Proposal Approval Scheduling Data cleansing Subsequent publication registered with facility Experiment Data analysis Scientist submits application for beamtime Tools for processing made available Facility committee approves application Raw data filtered and cleansed Scientists visits, facility run’s experiment Facility registers, trains, and schedules scientist’s visit
“The web has changed everything...” Communication Immense Expectations ! • Web enables: • access to everything • Everything on-line • Interlinking enables: • Validation of results • Repetition of experiment • Discovery enables: • new knowledge from old • Archiving enables: • Unplanned reuse of data • Antarctic environmental data • Reuse of knowledge • One paper has >20,000 downloads • Completing the cycle • Publications entered in next proposal STFC’s “e-pubs” Institutional Repository has records of 30,000 publications spanning 25 years
Collaboration Technology integration facilitates scientific collaboration • Cross facility/beamline • Cross disciplinary Technology integration improves facility efficiency • PaN-data –Photon and Neutron Data infrastructure • ICAT also used in Australian Synchrotron and Oak Ridge National Lab
Overview • Introduction • What is STFC? What do we do? Why do we do it? • An example project • Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES • RCUK Data Policy Principles
Scientific Data Landscape of Initiatives – results from call9 Tools for virtual research environments OA participatory infrastructure Tools for virtual research environments Generic services, storage and computation iMarine SCIDIP-ES ESPAS Environment Atmosphere/Space Physics AgINFRA PaNdataODI Agriculture PanDataODI Physics, Engineering Aggregated Data Sets(Temporary or Permanent) diXa Biology Medicine VREs Other Data transPLANT Social Sciences Scientific Data(Discipline Specific) ENGAGE VREs Workflows Researcher 2 Scientific World Aggregation Path Researcher 1 OPENAire Plus EUDAT Non Scientific World
PaN-data Partners PaN-data bring together 11 major European Research Infrastructures ISIS is the world’s leading pulsed spallation neutron source ILL operates the most intense slow neutron source in the world PSI operates the Swiss Light Source, SLS, and Neutron Spallation Source, SINQ, and is developing the SwissFEL Free Electron Laser HZB operates the BER II research reactor the BESSY II synchrotron CEA/LLB operates neutron scattering spectrometers from the Orphée fission reactor ESRF is a third generation synchrotron light source jointly funded by 19 European countries Diamond is new 3rd generation synchrotron funded by the UK and the Wellcome Trust DESY operates two synchrotrons, Doris III and Petra III, and the FLASH free electron laser Soleil is a 2.75 GeV synchrotron radiation facility in operation since 2007 ELETTRA operates a 2-2.4 GeV synchrotron and is building the FERMI Free Electron Laser ALBA is a new 3 GeV synchrotron facility due to become operational in 2010 JCNSJuelich Centre for Neutron Science MaxLab, Max IV Synchrotron The partners operate hundreds of instruments used by over 30,000 scientists each year PaN-data is coordinated by the e-Science Department at the Rutherford Appleton Laboratory, UK
Single Infrastructure Single User Experience Raw Data Catalogue Analysed Data Catalogue Publication Data Catalogue Data Analysis Different Infrastructures Different User Experiences Publications Catalogue Data Analysis Data Analysis Data Analysis Analysed Data Analysed Data Analysed Data Publication Data Publication Data Publication Data Raw Data Raw Data Raw Data Publications Publications Publications Facility 1 Facility 3 Facility 2 Capacity Storage Software Repositories Data Repositories Publications Repositories PaNdata Vision
Science driver – Data Integration Neutron diffraction NMR X-ray diffraction } } High-quality structure refinement
PaN-data Europe – building a sustainable data infrastructure for Neutron and Photon laboratories PaN-data Standardisation PaN-data Europe is undertaking 5 standardisation activities: Development of a common data policy framework Agreement on protocols for shared user information exchange Definition of standards for common scientific data formats Strategy for the interoperation of data analysis software enabling the most appropriate software to be used independently of where the data is collected Integration and cross-linking of research outputs completing the lifecycle of research, linking all information underpinning publications, and supporting the long-term preservation of the research outputs
ERA Open Access Sharing Initiatives (examples, etc) PaNdata ODI (begins end2011) Services PaNdata Support Action (Ends 30 Nov 11) Policies and Standards Users PaNdata ODI Users Data Data Virtual Labs PaNdata ODI (begins end2011) JRAs Policies Powder Diff SAXS & SANS Software Tomography Provenance Integration Preservation Scalability ERA Infrastructure Platform Initiatives (EGI, etc)
Objectives Objective 2 – Users To deploy, operate and evaluate a system for pan-European user identification across the participating facilities and implement common processes for the joint maintenance of that system. Objective 3 – Data To deploy, operate and evaluate a generic catalogue of scientific data across the participating facilities and promote its integration with other catalogues beyond the project. Objective 4 – Provenance To research and develop a conceptual framework, defined as a metadata model, which can record the analysis process, and to provide a software infrastructure which implements that model to record analysis steps hence enabling the tracing of the derivation of analysed data outputs. Objective 5 – Preservation To add to the PaNdata infrastructure extra capabilities oriented towards long-term preservation and to integrate these within selected virtual laboratories of the project to demonstrate benefits. These capabilities should, as for the developments in the provenance JRA, be integrated into the normal scientific lifecycle as far as possible. The conceptual foundations will be the OAIS standard and the NeXus file format. Objective 6 – Scalability To develop a scalable data processing framework, combining parallel filesystems with a parallelized standard data formats (pNexus pHDF5) to permit applications to make most efficient use of dedicated multi-core environments and to permit simultaneous ingest of data from various sources, while maintaining the possibility for real-time data processing. Objective 7 – Demonstration To deploy and operate the services and technology developed in the project in virtual laboratories for three specific techniques providing a set of integrated end-to-end data services.
Standards from PaNdata Support Action PaNdata ODI Service Releases Rel 1 Rel 2 Rel 3 Rel 4 Jun 2014 Dec 2012 Dec 2013 Jun 2013 PaNdata ODI Service Activities users uCat data dCat vLabs PaNdata ODI Joint Research Activities s/w Prov Integ Pres Scale
Research Environment the researcher acts through ingest and access Archival Creation Access Services Information Infrastructure Data the researcher shouldn’t have to worry about the information infrastructure Network Storage Compute The Research Lifecycle – a personal view MetaData/ Catalogues Provenanced Research Data Portals User Info feed DAQ feed Data Analysis feed EGI GEANT Local resources e-Infrastructure
Overview • Introduction • What is STFC? What do we do? Why do we do it? • An example project • Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES • RCUK Data Policy Principles
RCUK Principles on Data Policy } Data } Access Seven (fairly) orthogonal principles: Public good Preservation Discoverability Confidentiality First use Recognition Costs } Rights
Motivation Data are a critical output of the research process: • For the integrity, transparency and robustness of the research record • Often value increases through aggregation • Enables new research questions to be addressed • Supports the wider exploitation of data
Repeat, Repeal, Repurpose Why might we want access to data? Three distinct reasons for sharing data: • Repeat - Validation of previous analysis • How does this fit with peer review? • Reconsider/Reform/Repeal/Reverse - Alternative hypotheses in the same field • c.f. Reuse • How does this fit with “right” to first use? • Repurpose - New research in another field • c.f. Recycle • How does this fit with recognition of Intellectual contribution? (What’s in it for me?) Different concerns and requirements for each type of sharing
Data are a Public Good Publicly funded research data are a public good, produced in the public interest, which should be made openly available with as few restrictions as possible in a timely and responsible manner that does not harm intellectual property. Public good – is nonrival and non-excludable [wikipedia] consumption by one does not reduce availability for others no one can be effectively excluded from using Research Data recorded factual material commonly retained by and accepted in the scientific community as necessary to validate research findings As few restrictions as possible Later (distinguish registration from restriction) Timely Later (discipline specific) Responsible Later (maximising access does not necessarily maximising research benefit) Intellectual Property Later (balance contribution from sharing and from primary research)
Data should be managed Institutional and project specific data management policies and plans should be in accordance with relevant standards and community best practice. Data with acknowledged long term value should be preserved and remain accessible and usable for future research Policies and Plans DMPs should exist for all data Institutional/Departmental v. Project Standards and Best practice discipline specific Long term value eg. NERC has the concept of the Data Value Checklist. Future research (by current and future generations) Don’t lose it by accident
Data should be discoverable To enable research data to be discoverable and effectively reused by others, sufficient metadata should be recorded and made openly available to enable other researchers to understand the research and re-use potential of the data. Published results should always include information on how to access the supporting data • Effectively reused by others • for validation (the data behind the graph – and derivation/provenance?) • for reworking (alternative hypothesis) • for repurposing(same data or data merging) • Sufficient Metadata ... openly available (stronger than for data) • Understand the re-use potential • Discoverable – repository? registration? • Published results should include (pointers) • could be an email address (but note longevity requirement)
Data may be protected RCUK recognises that there are legal, ethical and commercial constraints on release of research data. To ensure that the research process is not damaged by inappropriate release of data, research organisation policies and practices should ensure that these are considered at all stages in the research process. • Legal • Data Protection Act, Freedom of Information Act, Environmental Information Regulations, EU INSPIRE Directive (Spatial Data) • Some Case law: UEA “Climategate” and a few others - can go either way • ICO developing Guidelines on FoI • Protection of Freedoms Bill • Ethical • Consent, Privacy, National Security, • Consent eg longitudinal cohort studies – protection of cohort participation • Commercial – shared funding, patent pending, commercial in confidence • ... all stages in the research process ... • Plan in proposal – peer reviewed • external review of access requests
Originators may have first use To ensure that research teams get appropriate recognition for the effort involved in collecting and analysing data, those who undertake research council funded work may be entitled to a limited period of privileged use of the data they have collected to enable them to publish the results of their research. The length of this period varies by research discipline and, where appropriate, is discussed further in the published policies of individual Research Councils. • ... may be entitled ... • c.f. will be entitled • ... limited period of privileged use ... • “Withholding of data without good reason beyond this period is not acceptable” • ... to publish the results of their research ... • to work on and publish the results • ... the length of this period varies ... • individual research councils’ policies elaborate
Reusers have responsibilities In order to recognise the intellectual contributions of researchers who generate, preserve and share key research datasets, all users of research data should acknowledge the sources of their data and abide by the terms and conditions under which they are accessed. • ... abide by the terms and conditions ... • terms and conditions may exist • to monitor use • to promote terms and conditions of use • ... should acknowledge the sources of their data ... • Data citation • c.f. should be required to acknowledge....
Data sharing is not free It is appropriate to use public funds to support the management and sharing of publicly-funded research data. To maximise the research benefit which can be gained from limited budgets, the mechanisms for these activities should be both efficient and cost-effective in the use of public funds. • Data sharing costs! • Marginal cost may be small but intial cost may be high • Even if data are free at the point of use (which it may not be) there are costs behind. • Cost is fundable by RCs - but needs tensioning against other research • The policies, models and mechanisms for managing and providing access to research data • Funders will work with grant holders to .....” • No prescription on how funding should be raised.” • You can ask for funding... • ... but be reasonable
Outcomes • Ensure the continuing availability of data of long-term value • Facilitate the development of mechanisms to improve the management, accessibility and attribution of data • Promote awareness of legislation and guidance relating to the management and dissemination of information
The Seven Ps Public good Public good Preservation Preservation Discoverability Promotion Confidentiality Protection & Privacy First use Privilege Recognition Probity Costs Price
Overview • Introduction • What is STFC? What do we do? Why do we do it? • An example project • Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES • RCUK Data Policy Principles • a final thought
Archival Creation Access Curation Services Network Storage Compute The Knowledge Lifecycle The 7 C’s Communication Collaboration Creation Collection Data Permanent Access Capacity Curation Provenanced Research Computation
Thank You www.pan-data.euwww.rcuk.ac.uk/research/Pages/DataPolicy.aspx
Building an Open Data Infrastructure for SciencePolicy and Practice Juan Bicarregui STFC e-Science APA conference, November 2011