220 likes | 240 Views
DCC Roadshow Oxford. Sam Pepler. Roadshow objectives. To describe the emerging trends and challenges associated with research data management and their potential impact on higher education institutions.
E N D
DCC Roadshow Oxford Sam Pepler
Roadshow objectives • To describe the emerging trends and challenges associated with research data management and their potential impact on higher education institutions. • To examine case studies at both disciplinary and institutional levels, highlighting different models, approaches and working practice.
Outline • To examine case studies at both disciplinary and institutional levels, highlighting different models, approaches and working practice. • NERC data centres • BADC as an example data centre • To describe the emerging trends and challenges associated with research data management and their potential impact on higher education institutions. • Trends and Challenges
Who are we…? What do we know about data? We’re (several) of the NERC data centres!
NERC Approach to data management • A federation of subject based repositories • Subject based repositories embedded in the community they serve. E.g. National Geoscience Data Centre (NGDC) is based at the British Geological Survey (BGS). • Subject based, but flexible enough that items are handled by the most appropriate data centre even if it is not strictly in its subject area. • Same policy and principles across the data centres, but different approaches to suit differing requirements. • They have a long history. The British Oceanographic Data Centre (BODC) was created in 1969.
NERC Data Policy It is NERC’s policy that 9. Applications for NERC funding must include an outline Data Management Plan, which must identify which of the data sets being produced are considered to be of long-term value, based on the criteria in NERC’s Data Value Checklist. The funding application must also identify any resources necessary to implement the Data Management Plan. 10. The outline Data Management Plan will be evaluated as part of the standard NERC grant assessment process. All successful applications will be required to produce a detailed Data Management Plan in conjunction with the appropriate NERC Data Centre. 11. All NERC-funded projects must work with the relevant NERC Data Centre to implement the Data Management Plan, and ensure that data of long-term value are submitted to the data centre for long-term management and dissemination. Data must be provided in an agreed format and accompanied by all necessary metadata. 12. Data from NERC-funded activities are provided to the data centres on a non-exclusive basis without prejudice to any intellectual property rights. This is to enable NERC to manage and make openly available publicly funded research data. All users must acknowledge the contribution made by those who created the data. 13. Those in receipt of NERC funding who do not meet these policy requirements without good reason risk having award payments withheld or becoming ineligible for future funding from NERC.
Data centres: their use, value and impact • This report provides an analysis of the usage and impact of a number of research data centres, representing a cross-section of research disciplines in the UK. • Data centres are a success story for their users, and funders and policy-makers should continue to support and promote existing national data centres.
Our data centre aims: • PRESERVATION: for future generations, in the next months, decades and centuries and • FACILITATION: for the here and now, as well as the future • BADC tries to deploy information systems that describe those data, parameters, projects and files, along with services that allow data manipulation • As well as providing a long-term archive for data created by NERC funded projects, the BADC also holds data from third parties, like the Met Office. • Contact: http://badc.nerc.ac.ukbadc@rl.ac.uk
Some BADC numbers for context Dataset: A collection of files sharing some administrative and/or project heritage. BADC has approximately 200 real datasets. BADC has approx 100 million files containing thousands of measured or simulated parameters. BADC tries to deploy information systems that describe those data, parameters, projects and files, along with services that allow one to manipulate them … Calendar year 2010: 2800 active users (of 17000 registered), downloaded 64 TB data in 16 million files from 165 datasets. Less than half of the BADC data consumers are “atmospheric science” users!
Example Data: Climate Model Inter-comparison Project 5 IPCC: FAR:1990 SAR:1995 TAR:2001 AR4:2007 AR5:2013
Handling the CMIP5 data Earth System Grid (ESG) US DoE funded project to develop software and support CMIP5 Consists of distributed data node software (to publish data) Tools gateway software (to provide catalog and services) Metafor Information model to describe models and simulations, and Tools to manipulate it Major “technical challenges” Earth System Grid FEDERATION (ESGF) Global initiative to deploy the ESG (and other) software to support: timely access to the data minimum international movement of the data long term access to significant versions of the CMIP5 data. Major “social challenge” as well as “technical challenge” 12
Trends and Challenges • Data selection is becoming a bigger issue • Data publication – fixing the publication process • More defined functions means that generic functions can be “out-sourced”
The Data Deluge Decisions, Decisions Decisions (need) Information & better yet PRIOR planning (Exploded view from 2007 IDC study – but note colours swapped) “the amount of data generated worldwide...is growing by 58% per year; in 2010 the world generated 1250 billion gigabytes of data”
The problem with environmental data... • ... Is it’s most often unique because of it’s environmental nature. We should keep everything indefinitely! But this is not going to be feasible for all data? • NERC is developing a Data Value Checklist that aims to identify which environmental data produced by a proposed NERC funded project should be considered for accession to NERC Data Centres to derive the maximum value for science. • A list of all expected data outputs should be captured in an outline Data Management Plan
Journals work, but... ... They’re not enough now to communicate everything we need to know about a scientific event - whether that’s an observation, simulation, development of a theory, or any combination of these. Data always has been the foundation of scientific progress – without it, we can’t test any of our assertions. Previously data was hard to capture, but could be (relatively) easily published in image or table format – papers “included” data! But now... Suber cells and mimosa leaves. Robert Hooke, Micrographia, 1665
Publishing data – why do it? • Scientific journal publication mainly focuses on the analysis, interpretation and conclusions drawn from a given dataset. • Examining the raw data that forms the dataset is more difficult, as datasets are usually stored in digital media, in a variety of (proprietary or non-standard) formats. • Peer-review is generally only applied to the methodology and final conclusions of a piece of work, and not the underlying data itself. But if the conclusions are to stand, the data and accomanying metadata must be of good quality. • A process of data publication, involving peer-review of datasets is (in some cases) and will be (in others) of benefit to many sectors of the academic community – including data producers!
Publishing: Some terminology (1) publish verb /ˈpʌb.lɪʃ/ [T] to make information available to people, especially in a book, magazine or newspaper, or to produce and sell a book, magazine or newspaper (Cambridge Advanced Learner's Dictionary - http://dictionary.cambridge.org/dictionary/british/publish ) cite verb ( GIVE EXAMPLE ) /saɪt/ [T] • formal - to mention something as proof for a theory or as a reason why something has happened • formal - to speak or write words taken from a particular writer or written work (Cambridge Advanced Learner's Dictionary - http://dictionary.cambridge.org/dictionary/british/cite_1 )
Publishing: Some terminology (2) This involves the peer-review of data sets, and gives “stamp of approval” associated with traditional journal publications. Can’t be done without effective linking/citing of the data sets. Doi:10232/123ro 2. Publication of data sets Doi:10232/123 1. Data set Citation This is our first step for this project – formulate and formalise a way of citing data sets. Will provide benefits to our users – and a carrot to get them to provide data to us! 0. Serving of data sets This is what data centres do as our day job – take in data supplied by scientists and make it available to other interested parties.
NERC DC common functions • Acquisition & ingestion • receive data, transform data, create metadata, quality check data, document data, liaise with suppliers, load data, data management planning, update catalogues • Information management • manage metadata, manage analogue datasets manage electronic datasets, preserve data, manage catalogues, monitor access, review data retention, review access, dispose of data • Access and delivery • manage non-web enquiries, apply policies, disseminate metadata, provide data discovery tools, provide catalogues, provide viewing/download, arrange loans/visits • Community support • develop policy, develop apps, attend conferences, provide advice, manage web, monitor community, produce publications, provide data management support, informatics research • Data Centre management • CS/NERC liaison, staff & performance management, budget management, income generation, equipment purchasing & maintenance, stakeholder reporting, reviews/audits • Policy & standards management • maintain policies, maintain standards, legal compliance, manage licensing, preservation strategies • Manage infrastructure • manage analogue environments, manage electronic environments, monitor technology, migration planning, database admin, IT infrastructure management, maintain apps/web
Conclusions • Subject based repositories are a good idea. Users like them probably because their subject is their primary community. • Data selection is becoming a bigger issue • Data publication – fixing the publication process. • More defined functions means that generic functions can be “out-sourced”